Changes for page Data Model
Last modified by Robert Schaub on 2025/12/24 11:48
Summary
-
Page properties (1 modified, 0 added, 0 removed)
Details
- Page properties
-
- Content
-
... ... @@ -1,25 +1,32 @@ 1 1 = Data Model = 2 + 2 2 FactHarbor's data model is **simple, focused, designed for automated processing**. 4 + 3 3 == 1. Core Entities == 6 + 4 4 === 1.1 Claim === 8 + 5 5 Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count 10 + 6 6 ==== Performance Optimization: Denormalized Fields ==== 12 + 7 7 **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%. 8 8 **Additional cached fields in claims table**: 15 + 9 9 * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores 10 - * Avoids joining evidence table for listing/preview11 - * Updated when evidence is added/removed12 - * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`17 +* Avoids joining evidence table for listing/preview 18 +* Updated when evidence is added/removed 19 +* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]` 13 13 * **source_names** (TEXT[]): Array of source names for quick display 14 - * Avoids joining through evidence to sources15 - * Updated when sources change16 - * Format: `["New York Times", "Nature Journal", ...]`21 +* Avoids joining through evidence to sources 22 +* Updated when sources change 23 +* Format: `["New York Times", "Nature Journal", ...]` 17 17 * **scenario_count** (INTEGER): Number of scenarios for this claim 18 - * Quick metric without counting rows19 - * Updated when scenarios added/removed25 +* Quick metric without counting rows 26 +* Updated when scenarios added/removed 20 20 * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed 21 - * Helps invalidate stale caches22 - * Triggers background refresh if too old28 +* Helps invalidate stale caches 29 +* Triggers background refresh if too old 23 23 **Update Strategy**: 24 24 * **Immediate**: Update on claim edit (user-facing) 25 25 * **Deferred**: Update via background job every hour (non-critical) ... ... @@ -28,13 +28,18 @@ 28 28 * ✅ 70% fewer joins on common queries 29 29 * ✅ Much faster claim list/search pages 30 30 * ✅ Better user experience 31 -* ⚠️ Small storage increase ( ~10%)38 +* ⚠️ Small storage increase (10%) 32 32 * ⚠️ Need to keep caches in sync 40 + 33 33 === 1.2 Evidence === 42 + 34 34 Fields: claim_id, source_id, excerpt, url, relevance_score, supports 44 + 35 35 === 1.3 Source === 46 + 36 36 **Purpose**: Track reliability of information sources over time 37 37 **Fields**: 49 + 38 38 * **id** (UUID): Unique identifier 39 39 * **name** (text): Source name (e.g., "New York Times", "Nature Journal") 40 40 * **domain** (text): Website domain (e.g., "nytimes.com") ... ... @@ -52,17 +52,21 @@ 52 52 **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage 53 53 Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency** 54 54 **Key**: Automated source reliability tracking 67 + 55 55 ==== Source Scoring Process (Separation of Concerns) ==== 69 + 56 56 **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis. 57 57 **The Problem**: 72 + 58 58 * Source scores should influence claim verdicts 59 59 * Claim verdicts should update source scores 60 60 * But: Direct feedback creates circular dependency and potential feedback loops 61 61 **The Solution**: Temporal separation 77 + 62 62 ==== Weekly Background Job (Source Scoring) ==== 79 + 63 63 Runs independently of claim analysis: 64 -{{code language="python"}} 65 -def update_source_scores_weekly(): 81 +{{code language="python"}}def update_source_scores_weekly(): 66 66 """ 67 67 Background job: Calculate source reliability 68 68 Never triggered by individual claim analysis ... ... @@ -82,12 +82,12 @@ 82 82 source.last_updated = now() 83 83 source.save() 84 84 # Job runs: Sunday 2 AM UTC 85 - # Never during claim processing 86 - {{/code}}101 + # Never during claim processing{{/code}} 102 + 87 87 ==== Real-Time Claim Analysis (AKEL) ==== 104 + 88 88 Uses source scores but never updates them: 89 -{{code language="python"}} 90 -def analyze_claim(claim_text): 106 +{{code language="python"}}def analyze_claim(claim_text): 91 91 """ 92 92 Real-time: Analyze claim using current source scores 93 93 READ source scores, never UPDATE them ... ... @@ -104,10 +104,12 @@ 104 104 verdict = synthesize_verdict(evidence_list) 105 105 # NEVER update source scores here 106 106 # That happens in weekly background job 107 - return verdict 108 - {{/code}}123 + return verdict{{/code}} 124 + 109 109 ==== Monthly Audit (Quality Assurance) ==== 126 + 110 110 Moderator review of flagged source scores: 128 + 111 111 * Verify scores make sense 112 112 * Detect gaming attempts 113 113 * Identify systematic biases ... ... @@ -147,11 +147,14 @@ 147 147 → NYT score: 0.89 (trending up) 148 148 → Blog X score: 0.48 (trending down) 149 149 ``` 168 + 150 150 === 1.4 Scenario === 170 + 151 151 **Purpose**: Different interpretations or contexts for evaluating claims 152 152 **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated. 153 153 **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim) 154 154 **Fields**: 175 + 155 155 * **id** (UUID): Unique identifier 156 156 * **claim_id** (UUID): Foreign key to claim (one-to-many) 157 157 * **description** (text): Human-readable description of the scenario ... ... @@ -172,6 +172,7 @@ 172 172 **Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence. 173 173 174 174 **Core Fields**: 196 + 175 175 * **id** (UUID): Primary key 176 176 * **scenario_id** (UUID FK): The scenario being assessed 177 177 * **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)") ... ... @@ -187,6 +187,7 @@ 187 187 188 188 **Example**: 189 189 For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)": 212 + 190 190 * Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"] 191 191 * After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"] 192 192 * Edit entity records the complete before/after change with timestamp and reason ... ... @@ -194,12 +194,18 @@ 194 194 **Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios. 195 195 196 196 === 1.6 User === 220 + 197 197 Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count 198 -=== User Reputation System == 222 + 223 +=== User Reputation System === 224 + 199 199 **V1.0 Approach**: Simple manual role assignment 200 200 **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary. 227 + 201 201 === Roles (Manual Assignment) === 229 + 202 202 **reader** (default): 231 + 203 203 * View published claims and evidence 204 204 * Browse and search content 205 205 * No editing permissions ... ... @@ -218,8 +218,11 @@ 218 218 * System configuration 219 219 * Access to all features 220 220 * Founder-appointed initially 250 + 221 221 === Contribution Tracking (Simple) === 252 + 222 222 **Basic metrics only**: 254 + 223 223 * `contributions_count`: Total number of contributions 224 224 * `created_at`: Account age 225 225 * `last_active`: Recent activity ... ... @@ -228,19 +228,26 @@ 228 228 * No automated privilege escalation 229 229 * No reputation decay 230 230 * No threshold-based promotions 263 + 231 231 === Promotion Process === 265 + 232 232 **Manual review by moderators/admins**: 267 + 233 233 1. User demonstrates value through contributions 234 234 2. Moderator reviews user's contribution history 235 235 3. Moderator promotes user to contributor role 236 236 4. Admin promotes trusted contributors to moderator 237 237 **Criteria** (guidelines, not automated): 273 + 238 238 * Quality of contributions 239 239 * Consistency over time 240 240 * Collaborative behavior 241 241 * Understanding of project goals 278 + 242 242 === V2.0+ Evolution === 280 + 243 243 **Add complex reputation when**: 282 + 244 244 * 100+ active contributors 245 245 * Manual role management becomes bottleneck 246 246 * Clear patterns of abuse emerge requiring automation ... ... @@ -250,11 +250,16 @@ 250 250 * Reputation decay for inactive users 251 251 * Track record scoring for contributors 252 252 See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers. 292 + 253 253 === 1.7 Edit === 294 + 254 254 **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at 255 255 **Purpose**: Complete audit trail for all content changes 297 + 256 256 === Edit History Details === 299 + 257 257 **What Gets Edited**: 301 + 258 258 * **Claims** (20% edited): assertion, domain, status, scores, analysis 259 259 * **Evidence** (10% edited): excerpt, relevance_score, supports 260 260 * **Scenarios** (5% edited): description, assumptions, confidence ... ... @@ -271,12 +271,14 @@ 271 271 * `MODERATION_ACTION`: Hide/unhide for abuse 272 272 * `REVERT`: Rollback to previous version 273 273 **Retention Policy** (5 years total): 318 + 274 274 1. **Hot storage** (3 months): PostgreSQL, instant access 275 275 2. **Warm storage** (2 years): Partitioned, slower queries 276 276 3. **Cold storage** (3 years): S3 compressed, download required 277 277 4. **Deletion**: After 5 years (except legal holds) 278 -**Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit)323 +**Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit) 279 279 **Use Cases**: 325 + 280 280 * View claim history timeline 281 281 * Detect vandalism patterns 282 282 * Learn from user corrections (system improvement) ... ... @@ -283,12 +283,17 @@ 283 283 * Legal compliance (audit trail) 284 284 * Rollback capability 285 285 See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases 332 + 286 286 === 1.8 Flag === 334 + 287 287 Fields: entity_id, reported_by, issue_type, status, resolution_note 288 -=== 1.9 QualityMetric === 336 + 337 +=== 1.9 QualityMetric === 338 + 289 289 **Fields**: metric_type, category, value, target, timestamp 290 290 **Purpose**: Time-series quality tracking 291 291 **Usage**: 342 + 292 292 * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times 293 293 * **Quality dashboard**: Real-time display with trend charts 294 294 * **Alerting**: Automatic alerts when metrics exceed thresholds ... ... @@ -295,10 +295,13 @@ 295 295 * **A/B testing**: Compare control vs treatment metrics 296 296 * **Improvement validation**: Measure before/after changes 297 297 **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}` 298 -=== 1.10 ErrorPattern === 349 + 350 +=== 1.10 ErrorPattern === 351 + 299 299 **Fields**: error_category, claim_id, description, root_cause, frequency, status 300 300 **Purpose**: Capture errors to trigger system improvements 301 301 **Usage**: 355 + 302 302 * **Error capture**: When users flag issues or system detects problems 303 303 * **Pattern analysis**: Weekly grouping by category and frequency 304 304 * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor ... ... @@ -310,9 +310,13 @@ 310 310 {{include reference="Test.FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}} 311 311 312 312 == 1.5 User Class Diagram == 367 + 313 313 {{include reference="Test.FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}} 369 + 314 314 == 2. Versioning Strategy == 371 + 315 315 **All Content Entities Are Versioned**: 373 + 316 316 * **Claim**: Every edit creates new version (V1→V2→V3...) 317 317 * **Evidence**: Changes tracked in edit history 318 318 * **Scenario**: Modifications versioned ... ... @@ -333,68 +333,91 @@ 333 333 Claim V2: "The sky is blue during daytime" 334 334 → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"} 335 335 ``` 394 + 336 336 == 2.5. Storage vs Computation Strategy == 396 + 337 337 **Critical architectural decision**: What to persist in databases vs compute dynamically? 338 338 **Trade-off**: 399 + 339 339 * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs 340 340 * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible 402 + 341 341 === Recommendation: Hybrid Approach === 404 + 342 342 **STORE (in PostgreSQL):** 406 + 343 343 ==== Claims (Current State + History) ==== 408 + 344 344 * **What**: assertion, domain, status, created_at, updated_at, version 345 345 * **Why**: Core entity, must be persistent 346 346 * **Also store**: confidence_score (computed once, then cached) 347 -* **Size**: ~1 KB per claim412 +* **Size**: 1 KB per claim 348 348 * **Growth**: Linear with claims 349 349 * **Decision**: ✅ STORE - Essential 415 + 350 350 ==== Evidence (All Records) ==== 417 + 351 351 * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at 352 352 * **Why**: Hard to re-gather, user contributions, reproducibility 353 -* **Size**: ~2 KB per evidence (with excerpt)420 +* **Size**: 2 KB per evidence (with excerpt) 354 354 * **Growth**: 3-10 evidence per claim 355 355 * **Decision**: ✅ STORE - Essential for reproducibility 423 + 356 356 ==== Sources (Track Records) ==== 425 + 357 357 * **What**: name, domain, track_record_score, accuracy_history, correction_frequency 358 358 * **Why**: Continuously updated, expensive to recompute 359 -* **Size**: ~500 bytes per source428 +* **Size**: 500 bytes per source 360 360 * **Growth**: Slow (limited number of sources) 361 361 * **Decision**: ✅ STORE - Essential for quality 431 + 362 362 ==== Edit History (All Versions) ==== 433 + 363 363 * **What**: before_state, after_state, user_id, reason, timestamp 364 364 * **Why**: Audit trail, legal requirement, reproducibility 365 -* **Size**: ~2 KB per edit366 -* **Growth**: Linear with edits ( ~A portion of claims get edited)436 +* **Size**: 2 KB per edit 437 +* **Growth**: Linear with edits (A portion of claims get edited) 367 367 * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total 368 368 * **Decision**: ✅ STORE - Essential for accountability 440 + 369 369 ==== Flags (User Reports) ==== 442 + 370 370 * **What**: entity_id, reported_by, issue_type, description, status 371 371 * **Why**: Error detection, system improvement triggers 372 -* **Size**: ~500 bytes per flag445 +* **Size**: 500 bytes per flag 373 373 * **Growth**: 5-high percentage of claims get flagged 374 374 * **Decision**: ✅ STORE - Essential for improvement 448 + 375 375 ==== ErrorPatterns (System Improvement) ==== 450 + 376 376 * **What**: error_category, claim_id, description, root_cause, frequency, status 377 377 * **Why**: Learning loop, prevent recurring errors 378 -* **Size**: ~1 KB per pattern453 +* **Size**: 1 KB per pattern 379 379 * **Growth**: Slow (limited patterns, many fixed) 380 380 * **Decision**: ✅ STORE - Essential for learning 456 + 381 381 ==== QualityMetrics (Time Series) ==== 458 + 382 382 * **What**: metric_type, category, value, target, timestamp 383 383 * **Why**: Trend analysis, cannot recreate historical metrics 384 -* **Size**: ~200 bytes per metric461 +* **Size**: 200 bytes per metric 385 385 * **Growth**: Hourly = 8,760 per year per metric type 386 386 * **Retention**: 2 years hot, then aggregate and archive 387 387 * **Decision**: ✅ STORE - Essential for monitoring 388 388 **STORE (Computed Once, Then Cached):** 466 + 389 389 ==== Analysis Summary ==== 468 + 390 390 * **What**: Neutral text summary of claim analysis (200-500 words) 391 391 * **Computed**: Once by AKEL when claim first analyzed 392 392 * **Stored in**: Claim table (text field) 393 393 * **Recomputed**: Only when system significantly improves OR claim edited 394 394 * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often 395 -* **Size**: ~2 KB per claim474 +* **Size**: 2 KB per claim 396 396 * **Decision**: ✅ STORE (cached) - Cost-effective 476 + 397 397 ==== Confidence Score ==== 478 + 398 398 * **What**: 0-100 score of analysis confidence 399 399 * **Computed**: Once by AKEL 400 400 * **Stored in**: Claim table (integer field) ... ... @@ -402,7 +402,9 @@ 402 402 * **Why store**: Cheap to store, expensive to compute, users need it fast 403 403 * **Size**: 4 bytes per claim 404 404 * **Decision**: ✅ STORE (cached) - Performance critical 486 + 405 405 ==== Risk Score ==== 488 + 406 406 * **What**: 0-100 score of claim risk level 407 407 * **Computed**: Once by AKEL 408 408 * **Stored in**: Claim table (integer field) ... ... @@ -411,13 +411,17 @@ 411 411 * **Size**: 4 bytes per claim 412 412 * **Decision**: ✅ STORE (cached) - Performance critical 413 413 **COMPUTE DYNAMICALLY (Never Store):** 414 -==== Scenarios ==== ⚠️ CRITICAL DECISION 497 + 498 +==== Scenarios ==== 499 + 500 + ⚠️ CRITICAL DECISION 501 + 415 415 * **What**: 2-5 possible interpretations of claim with assumptions 416 416 * **Current design**: Stored in Scenario table 417 417 * **Alternative**: Compute on-demand when user views claim details 418 -* **Storage cost**: ~1 KB per scenario × 3 scenarios average =~3 KB per claim505 +* **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim 419 419 * **Compute cost**: $0.005-0.01 per request (LLM API call) 420 -* **Frequency**: Viewed in detail by ~20% of users507 +* **Frequency**: Viewed in detail by 20% of users 421 421 * **Trade-off analysis**: 422 422 - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access 423 423 - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs ... ... @@ -425,12 +425,17 @@ 425 425 * **Speed**: Computed = 5-8 seconds delay, Stored = instant 426 426 * **Decision**: ✅ STORE (hybrid approach below) 427 427 **Scenario Strategy** (APPROVED): 515 + 428 428 1. **Store scenarios** initially when claim analyzed 429 429 2. **Mark as stale** when system improves significantly 430 430 3. **Recompute on next view** if marked stale 431 431 4. **Cache for 30 days** if frequently accessed 432 432 5. **Result**: Best of both worlds - speed + freshness 433 -==== Verdict Synthesis ==== 521 + 522 +==== Verdict Synthesis ==== 523 + 524 + 525 + 434 434 * **What**: Final conclusion text synthesizing all scenarios 435 435 * **Compute cost**: $0.002-0.005 per request 436 436 * **Frequency**: Every time claim viewed ... ... @@ -438,17 +438,23 @@ 438 438 * **Speed**: 2-3 seconds (acceptable) 439 439 **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale 440 440 * **Recommendation**: ✅ STORE cached version, mark stale when changes occur 533 + 441 441 ==== Search Results ==== 535 + 442 442 * **What**: Lists of claims matching search query 443 443 * **Compute from**: Elasticsearch index 444 444 * **Cache**: 15 minutes in Redis for popular queries 445 445 * **Why not store permanently**: Constantly changing, infinite possible queries 540 + 446 446 ==== Aggregated Statistics ==== 542 + 447 447 * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc. 448 448 * **Compute from**: Database queries 449 449 * **Cache**: 1 hour in Redis 450 450 * **Why not store**: Can be derived, relatively cheap to compute 547 + 451 451 ==== User Reputation ==== 549 + 452 452 * **What**: Score based on contributions 453 453 * **Current design**: Stored in User table 454 454 * **Alternative**: Compute from Edit table ... ... @@ -458,37 +458,43 @@ 458 458 * **Frequency**: Read on every user action 459 459 * **Compute cost**: Simple COUNT query, milliseconds 460 460 * **Decision**: ✅ STORE - Performance critical, read-heavy 559 + 461 461 === Summary Table === 462 -| Data Type | Storage | Compute | Size per Claim | Decision | Rationale | 463 -|-----------|---------|---------|----------------|----------|-----------| 464 -| Claim core | ✅ | - | 1 KB | STORE | Essential | 465 -| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility | 466 -| Sources | ✅ | - | 500 B (shared) | STORE | Track record | 467 -| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit | 468 -| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective | 469 -| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access | 470 -| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access | 471 -| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed | 472 -| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access | 473 -| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement | 474 -| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning | 475 -| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending | 476 -| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic | 561 + 562 +| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\ 563 +|-----|-|-||----|-----|\\ 564 +| Claim core | ✅ | - | 1 KB | STORE | Essential |\\ 565 +| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\ 566 +| Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\ 567 +| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\ 568 +| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\ 569 +| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\ 570 +| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\ 571 +| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\ 572 +| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\ 573 +| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\ 574 +| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\ 575 +| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\ 576 +| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\ 477 477 | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable | 478 -**Total storage per claim**: ~18 KB (without edits and flags)578 +**Total storage per claim**: 18 KB (without edits and flags) 479 479 **For 1 million claims**: 480 -* **Storage**: ~18 GB (manageable) 481 -* **PostgreSQL**: ~$50/month (standard instance) 482 -* **Redis cache**: ~$20/month (1 GB instance) 483 -* **S3 archives**: ~$5/month (old edits) 484 -* **Total**: ~$75/month infrastructure 580 + 581 +* **Storage**: 18 GB (manageable) 582 +* **PostgreSQL**: $50/month (standard instance) 583 +* **Redis cache**: $20/month (1 GB instance) 584 +* **S3 archives**: $5/month (old edits) 585 +* **Total**: $75/month infrastructure 485 485 **LLM cost savings by caching**: 486 486 * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims 487 487 * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims 488 488 * Verdict stored: Save $0.003 per claim = $3K per 1M claims 489 -* **Total savings**: ~$35K per 1M claims vs recomputing every time 590 +* **Total savings**: $35K per 1M claims vs recomputing every time 591 + 490 490 === Recomputation Triggers === 593 + 491 491 **When to mark cached data as stale and recompute:** 595 + 492 492 1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores) 493 493 2. **Evidence added** → Recompute: scenarios, verdict, confidence score 494 494 3. **Source track record changes >10 points** → Recompute: confidence score, verdict ... ... @@ -495,11 +495,15 @@ 495 495 4. **System improvement deployed** → Mark affected claims stale, recompute on next view 496 496 5. **Controversy detected** (high flag rate) → Recompute: risk score 497 497 **Recomputation strategy**: 602 + 498 498 * **Eager**: Immediately recompute (for user edits) 499 499 * **Lazy**: Recompute on next view (for system improvements) 500 500 * **Batch**: Nightly re-evaluation of stale claims (if <1000) 606 + 501 501 === Database Size Projection === 608 + 502 502 **Year 1**: 10K claims 610 + 503 503 * Storage: 180 MB 504 504 * Cost: $10/month 505 505 **Year 3**: 100K claims ... ... @@ -513,15 +513,21 @@ 513 513 * Cost: $300/month 514 514 * Optimization: Archive old claims to S3 ($5/TB/month) 515 515 **Conclusion**: Storage costs are manageable, LLM cost savings are substantial. 624 + 516 516 == 3. Key Simplifications == 626 + 517 517 * **Two content states only**: Published, Hidden 518 518 * **Three user roles only**: Reader, Contributor, Moderator 519 519 * **No complex versioning**: Linear edit history 520 520 * **Reputation-based permissions**: Not role hierarchy 521 521 * **Source track records**: Continuous evaluation 632 + 522 522 == 3. What Gets Stored in the Database == 634 + 523 523 === 3.1 Primary Storage (PostgreSQL) === 636 + 524 524 **Claims Table**: 638 + 525 525 * Current state only (latest version) 526 526 * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at 527 527 **Evidence Table**: ... ... @@ -548,31 +548,44 @@ 548 548 **QualityMetric Table**: 549 549 * Time-series quality data 550 550 * Fields: id, metric_type, metric_category, value, target, timestamp 665 + 551 551 === 3.2 What's NOT Stored (Computed on-the-fly) === 667 + 552 552 * **Verdicts**: Synthesized from evidence + scenarios when requested 553 553 * **Risk scores**: Recalculated based on current factors 554 554 * **Aggregated statistics**: Computed from base data 555 555 * **Search results**: Generated from Elasticsearch index 672 + 556 556 === 3.3 Cache Layer (Redis) === 674 + 557 557 **Cached for performance**: 676 + 558 558 * Frequently accessed claims (TTL: 1 hour) 559 559 * Search results (TTL: 15 minutes) 560 560 * User sessions (TTL: 24 hours) 561 561 * Source track records (TTL: 1 hour) 681 + 562 562 === 3.4 File Storage (S3) === 683 + 563 563 **Archived content**: 685 + 564 564 * Old edit history (>3 months) 565 565 * Evidence documents (archived copies) 566 566 * Database backups 567 567 * Export files 690 + 568 568 === 3.5 Search Index (Elasticsearch) === 692 + 569 569 **Indexed for search**: 694 + 570 570 * Claim assertions (full-text) 571 571 * Evidence excerpts (full-text) 572 572 * Scenario descriptions (full-text) 573 573 * Source names (autocomplete) 574 574 Synchronized from PostgreSQL via change data capture or periodic sync. 700 + 575 575 == 4. Related Pages == 576 -* [[Architecture>>Test.FactHarbor.Specification.Architecture.WebHome]] 702 + 703 +* [[Architecture>>Test.FactHarbor V0\.9\.100.Specification.Architecture.WebHome]] 577 577 * [[Requirements>>Test.FactHarbor.Specification.Requirements.WebHome]] 578 578 * [[Workflows>>Test.FactHarbor.Specification.Workflows.WebHome]]