Changes for page Data Model
Last modified by Robert Schaub on 2025/12/24 11:48
Summary
-
Page properties (1 modified, 0 added, 0 removed)
Details
- Page properties
-
- Content
-
... ... @@ -1,32 +1,25 @@ 1 1 = Data Model = 2 - 3 3 FactHarbor's data model is **simple, focused, designed for automated processing**. 4 - 5 5 == 1. Core Entities == 6 - 7 7 === 1.1 Claim === 8 - 9 9 Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count 10 - 11 11 ==== Performance Optimization: Denormalized Fields ==== 12 - 13 13 **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%. 14 14 **Additional cached fields in claims table**: 15 - 16 16 * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores 17 -* Avoids joining evidence table for listing/preview 18 -* Updated when evidence is added/removed 19 -* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]` 10 + * Avoids joining evidence table for listing/preview 11 + * Updated when evidence is added/removed 12 + * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]` 20 20 * **source_names** (TEXT[]): Array of source names for quick display 21 -* Avoids joining through evidence to sources 22 -* Updated when sources change 23 -* Format: `["New York Times", "Nature Journal", ...]` 14 + * Avoids joining through evidence to sources 15 + * Updated when sources change 16 + * Format: `["New York Times", "Nature Journal", ...]` 24 24 * **scenario_count** (INTEGER): Number of scenarios for this claim 25 -* Quick metric without counting rows 26 -* Updated when scenarios added/removed 18 + * Quick metric without counting rows 19 + * Updated when scenarios added/removed 27 27 * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed 28 -* Helps invalidate stale caches 29 -* Triggers background refresh if too old 21 + * Helps invalidate stale caches 22 + * Triggers background refresh if too old 30 30 **Update Strategy**: 31 31 * **Immediate**: Update on claim edit (user-facing) 32 32 * **Deferred**: Update via background job every hour (non-critical) ... ... @@ -35,18 +35,13 @@ 35 35 * ✅ 70% fewer joins on common queries 36 36 * ✅ Much faster claim list/search pages 37 37 * ✅ Better user experience 38 -* ⚠️ Small storage increase (10%) 31 +* ⚠️ Small storage increase (~10%) 39 39 * ⚠️ Need to keep caches in sync 40 - 41 41 === 1.2 Evidence === 42 - 43 43 Fields: claim_id, source_id, excerpt, url, relevance_score, supports 44 - 45 45 === 1.3 Source === 46 - 47 47 **Purpose**: Track reliability of information sources over time 48 48 **Fields**: 49 - 50 50 * **id** (UUID): Unique identifier 51 51 * **name** (text): Source name (e.g., "New York Times", "Nature Journal") 52 52 * **domain** (text): Website domain (e.g., "nytimes.com") ... ... @@ -64,21 +64,17 @@ 64 64 **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage 65 65 Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency** 66 66 **Key**: Automated source reliability tracking 67 - 68 68 ==== Source Scoring Process (Separation of Concerns) ==== 69 - 70 70 **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis. 71 71 **The Problem**: 72 - 73 73 * Source scores should influence claim verdicts 74 74 * Claim verdicts should update source scores 75 75 * But: Direct feedback creates circular dependency and potential feedback loops 76 76 **The Solution**: Temporal separation 77 - 78 78 ==== Weekly Background Job (Source Scoring) ==== 79 - 80 80 Runs independently of claim analysis: 81 -{{code language="python"}}def update_source_scores_weekly(): 64 +{{code language="python"}} 65 +def update_source_scores_weekly(): 82 82 """ 83 83 Background job: Calculate source reliability 84 84 Never triggered by individual claim analysis ... ... @@ -98,12 +98,12 @@ 98 98 source.last_updated = now() 99 99 source.save() 100 100 # Job runs: Sunday 2 AM UTC 101 - # Never during claim processing {{/code}}102 - 85 + # Never during claim processing 86 +{{/code}} 103 103 ==== Real-Time Claim Analysis (AKEL) ==== 104 - 105 105 Uses source scores but never updates them: 106 -{{code language="python"}}def analyze_claim(claim_text): 89 +{{code language="python"}} 90 +def analyze_claim(claim_text): 107 107 """ 108 108 Real-time: Analyze claim using current source scores 109 109 READ source scores, never UPDATE them ... ... @@ -120,12 +120,10 @@ 120 120 verdict = synthesize_verdict(evidence_list) 121 121 # NEVER update source scores here 122 122 # That happens in weekly background job 123 - return verdict {{/code}}124 - 107 + return verdict 108 +{{/code}} 125 125 ==== Monthly Audit (Quality Assurance) ==== 126 - 127 127 Moderator review of flagged source scores: 128 - 129 129 * Verify scores make sense 130 130 * Detect gaming attempts 131 131 * Identify systematic biases ... ... @@ -165,14 +165,11 @@ 165 165 → NYT score: 0.89 (trending up) 166 166 → Blog X score: 0.48 (trending down) 167 167 ``` 168 - 169 169 === 1.4 Scenario === 170 - 171 171 **Purpose**: Different interpretations or contexts for evaluating claims 172 172 **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated. 173 173 **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim) 174 174 **Fields**: 175 - 176 176 * **id** (UUID): Unique identifier 177 177 * **claim_id** (UUID): Foreign key to claim (one-to-many) 178 178 * **description** (text): Human-readable description of the scenario ... ... @@ -193,7 +193,6 @@ 193 193 **Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence. 194 194 195 195 **Core Fields**: 196 - 197 197 * **id** (UUID): Primary key 198 198 * **scenario_id** (UUID FK): The scenario being assessed 199 199 * **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)") ... ... @@ -209,7 +209,6 @@ 209 209 210 210 **Example**: 211 211 For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)": 212 - 213 213 * Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"] 214 214 * After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"] 215 215 * Edit entity records the complete before/after change with timestamp and reason ... ... @@ -217,18 +217,12 @@ 217 217 **Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios. 218 218 219 219 === 1.6 User === 220 - 221 221 Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count 222 - 223 -=== User Reputation System === 224 - 198 +=== User Reputation System == 225 225 **V1.0 Approach**: Simple manual role assignment 226 226 **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary. 227 - 228 228 === Roles (Manual Assignment) === 229 - 230 230 **reader** (default): 231 - 232 232 * View published claims and evidence 233 233 * Browse and search content 234 234 * No editing permissions ... ... @@ -247,11 +247,8 @@ 247 247 * System configuration 248 248 * Access to all features 249 249 * Founder-appointed initially 250 - 251 251 === Contribution Tracking (Simple) === 252 - 253 253 **Basic metrics only**: 254 - 255 255 * `contributions_count`: Total number of contributions 256 256 * `created_at`: Account age 257 257 * `last_active`: Recent activity ... ... @@ -260,26 +260,19 @@ 260 260 * No automated privilege escalation 261 261 * No reputation decay 262 262 * No threshold-based promotions 263 - 264 264 === Promotion Process === 265 - 266 266 **Manual review by moderators/admins**: 267 - 268 268 1. User demonstrates value through contributions 269 269 2. Moderator reviews user's contribution history 270 270 3. Moderator promotes user to contributor role 271 271 4. Admin promotes trusted contributors to moderator 272 272 **Criteria** (guidelines, not automated): 273 - 274 274 * Quality of contributions 275 275 * Consistency over time 276 276 * Collaborative behavior 277 277 * Understanding of project goals 278 - 279 279 === V2.0+ Evolution === 280 - 281 281 **Add complex reputation when**: 282 - 283 283 * 100+ active contributors 284 284 * Manual role management becomes bottleneck 285 285 * Clear patterns of abuse emerge requiring automation ... ... @@ -289,16 +289,11 @@ 289 289 * Reputation decay for inactive users 290 290 * Track record scoring for contributors 291 291 See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers. 292 - 293 293 === 1.7 Edit === 294 - 295 295 **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at 296 296 **Purpose**: Complete audit trail for all content changes 297 - 298 298 === Edit History Details === 299 - 300 300 **What Gets Edited**: 301 - 302 302 * **Claims** (20% edited): assertion, domain, status, scores, analysis 303 303 * **Evidence** (10% edited): excerpt, relevance_score, supports 304 304 * **Scenarios** (5% edited): description, assumptions, confidence ... ... @@ -315,14 +315,12 @@ 315 315 * `MODERATION_ACTION`: Hide/unhide for abuse 316 316 * `REVERT`: Rollback to previous version 317 317 **Retention Policy** (5 years total): 318 - 319 319 1. **Hot storage** (3 months): PostgreSQL, instant access 320 320 2. **Warm storage** (2 years): Partitioned, slower queries 321 321 3. **Cold storage** (3 years): S3 compressed, download required 322 322 4. **Deletion**: After 5 years (except legal holds) 323 -**Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit) 278 +**Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit) 324 324 **Use Cases**: 325 - 326 326 * View claim history timeline 327 327 * Detect vandalism patterns 328 328 * Learn from user corrections (system improvement) ... ... @@ -329,17 +329,12 @@ 329 329 * Legal compliance (audit trail) 330 330 * Rollback capability 331 331 See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases 332 - 333 333 === 1.8 Flag === 334 - 335 335 Fields: entity_id, reported_by, issue_type, status, resolution_note 336 - 337 -=== 1.9 QualityMetric === 338 - 288 +=== 1.9 QualityMetric === 339 339 **Fields**: metric_type, category, value, target, timestamp 340 340 **Purpose**: Time-series quality tracking 341 341 **Usage**: 342 - 343 343 * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times 344 344 * **Quality dashboard**: Real-time display with trend charts 345 345 * **Alerting**: Automatic alerts when metrics exceed thresholds ... ... @@ -346,13 +346,10 @@ 346 346 * **A/B testing**: Compare control vs treatment metrics 347 347 * **Improvement validation**: Measure before/after changes 348 348 **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}` 349 - 350 -=== 1.10 ErrorPattern === 351 - 298 +=== 1.10 ErrorPattern === 352 352 **Fields**: error_category, claim_id, description, root_cause, frequency, status 353 353 **Purpose**: Capture errors to trigger system improvements 354 354 **Usage**: 355 - 356 356 * **Error capture**: When users flag issues or system detects problems 357 357 * **Pattern analysis**: Weekly grouping by category and frequency 358 358 * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor ... ... @@ -364,13 +364,9 @@ 364 364 {{include reference="Test.FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}} 365 365 366 366 == 1.5 User Class Diagram == 367 - 368 368 {{include reference="Test.FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}} 369 - 370 370 == 2. Versioning Strategy == 371 - 372 372 **All Content Entities Are Versioned**: 373 - 374 374 * **Claim**: Every edit creates new version (V1→V2→V3...) 375 375 * **Evidence**: Changes tracked in edit history 376 376 * **Scenario**: Modifications versioned ... ... @@ -391,91 +391,68 @@ 391 391 Claim V2: "The sky is blue during daytime" 392 392 → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"} 393 393 ``` 394 - 395 395 == 2.5. Storage vs Computation Strategy == 396 - 397 397 **Critical architectural decision**: What to persist in databases vs compute dynamically? 398 398 **Trade-off**: 399 - 400 400 * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs 401 401 * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible 402 - 403 403 === Recommendation: Hybrid Approach === 404 - 405 405 **STORE (in PostgreSQL):** 406 - 407 407 ==== Claims (Current State + History) ==== 408 - 409 409 * **What**: assertion, domain, status, created_at, updated_at, version 410 410 * **Why**: Core entity, must be persistent 411 411 * **Also store**: confidence_score (computed once, then cached) 412 -* **Size**: 1 KB per claim 347 +* **Size**: ~1 KB per claim 413 413 * **Growth**: Linear with claims 414 414 * **Decision**: ✅ STORE - Essential 415 - 416 416 ==== Evidence (All Records) ==== 417 - 418 418 * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at 419 419 * **Why**: Hard to re-gather, user contributions, reproducibility 420 -* **Size**: 2 KB per evidence (with excerpt) 353 +* **Size**: ~2 KB per evidence (with excerpt) 421 421 * **Growth**: 3-10 evidence per claim 422 422 * **Decision**: ✅ STORE - Essential for reproducibility 423 - 424 424 ==== Sources (Track Records) ==== 425 - 426 426 * **What**: name, domain, track_record_score, accuracy_history, correction_frequency 427 427 * **Why**: Continuously updated, expensive to recompute 428 -* **Size**: 500 bytes per source 359 +* **Size**: ~500 bytes per source 429 429 * **Growth**: Slow (limited number of sources) 430 430 * **Decision**: ✅ STORE - Essential for quality 431 - 432 432 ==== Edit History (All Versions) ==== 433 - 434 434 * **What**: before_state, after_state, user_id, reason, timestamp 435 435 * **Why**: Audit trail, legal requirement, reproducibility 436 -* **Size**: 2 KB per edit 437 -* **Growth**: Linear with edits (A portion of claims get edited) 365 +* **Size**: ~2 KB per edit 366 +* **Growth**: Linear with edits (~A portion of claims get edited) 438 438 * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total 439 439 * **Decision**: ✅ STORE - Essential for accountability 440 - 441 441 ==== Flags (User Reports) ==== 442 - 443 443 * **What**: entity_id, reported_by, issue_type, description, status 444 444 * **Why**: Error detection, system improvement triggers 445 -* **Size**: 500 bytes per flag 372 +* **Size**: ~500 bytes per flag 446 446 * **Growth**: 5-high percentage of claims get flagged 447 447 * **Decision**: ✅ STORE - Essential for improvement 448 - 449 449 ==== ErrorPatterns (System Improvement) ==== 450 - 451 451 * **What**: error_category, claim_id, description, root_cause, frequency, status 452 452 * **Why**: Learning loop, prevent recurring errors 453 -* **Size**: 1 KB per pattern 378 +* **Size**: ~1 KB per pattern 454 454 * **Growth**: Slow (limited patterns, many fixed) 455 455 * **Decision**: ✅ STORE - Essential for learning 456 - 457 457 ==== QualityMetrics (Time Series) ==== 458 - 459 459 * **What**: metric_type, category, value, target, timestamp 460 460 * **Why**: Trend analysis, cannot recreate historical metrics 461 -* **Size**: 200 bytes per metric 384 +* **Size**: ~200 bytes per metric 462 462 * **Growth**: Hourly = 8,760 per year per metric type 463 463 * **Retention**: 2 years hot, then aggregate and archive 464 464 * **Decision**: ✅ STORE - Essential for monitoring 465 465 **STORE (Computed Once, Then Cached):** 466 - 467 467 ==== Analysis Summary ==== 468 - 469 469 * **What**: Neutral text summary of claim analysis (200-500 words) 470 470 * **Computed**: Once by AKEL when claim first analyzed 471 471 * **Stored in**: Claim table (text field) 472 472 * **Recomputed**: Only when system significantly improves OR claim edited 473 473 * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often 474 -* **Size**: 2 KB per claim 395 +* **Size**: ~2 KB per claim 475 475 * **Decision**: ✅ STORE (cached) - Cost-effective 476 - 477 477 ==== Confidence Score ==== 478 - 479 479 * **What**: 0-100 score of analysis confidence 480 480 * **Computed**: Once by AKEL 481 481 * **Stored in**: Claim table (integer field) ... ... @@ -483,9 +483,7 @@ 483 483 * **Why store**: Cheap to store, expensive to compute, users need it fast 484 484 * **Size**: 4 bytes per claim 485 485 * **Decision**: ✅ STORE (cached) - Performance critical 486 - 487 487 ==== Risk Score ==== 488 - 489 489 * **What**: 0-100 score of claim risk level 490 490 * **Computed**: Once by AKEL 491 491 * **Stored in**: Claim table (integer field) ... ... @@ -494,17 +494,13 @@ 494 494 * **Size**: 4 bytes per claim 495 495 * **Decision**: ✅ STORE (cached) - Performance critical 496 496 **COMPUTE DYNAMICALLY (Never Store):** 497 - 498 -==== Scenarios ==== 499 - 500 - ⚠️ CRITICAL DECISION 501 - 414 +==== Scenarios ==== ⚠️ CRITICAL DECISION 502 502 * **What**: 2-5 possible interpretations of claim with assumptions 503 503 * **Current design**: Stored in Scenario table 504 504 * **Alternative**: Compute on-demand when user views claim details 505 -* **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim 418 +* **Storage cost**: ~1 KB per scenario × 3 scenarios average = ~3 KB per claim 506 506 * **Compute cost**: $0.005-0.01 per request (LLM API call) 507 -* **Frequency**: Viewed in detail by 20% of users 420 +* **Frequency**: Viewed in detail by ~20% of users 508 508 * **Trade-off analysis**: 509 509 - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access 510 510 - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs ... ... @@ -512,17 +512,12 @@ 512 512 * **Speed**: Computed = 5-8 seconds delay, Stored = instant 513 513 * **Decision**: ✅ STORE (hybrid approach below) 514 514 **Scenario Strategy** (APPROVED): 515 - 516 516 1. **Store scenarios** initially when claim analyzed 517 517 2. **Mark as stale** when system improves significantly 518 518 3. **Recompute on next view** if marked stale 519 519 4. **Cache for 30 days** if frequently accessed 520 520 5. **Result**: Best of both worlds - speed + freshness 521 - 522 -==== Verdict Synthesis ==== 523 - 524 - 525 - 433 +==== Verdict Synthesis ==== 526 526 * **What**: Final conclusion text synthesizing all scenarios 527 527 * **Compute cost**: $0.002-0.005 per request 528 528 * **Frequency**: Every time claim viewed ... ... @@ -530,23 +530,17 @@ 530 530 * **Speed**: 2-3 seconds (acceptable) 531 531 **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale 532 532 * **Recommendation**: ✅ STORE cached version, mark stale when changes occur 533 - 534 534 ==== Search Results ==== 535 - 536 536 * **What**: Lists of claims matching search query 537 537 * **Compute from**: Elasticsearch index 538 538 * **Cache**: 15 minutes in Redis for popular queries 539 539 * **Why not store permanently**: Constantly changing, infinite possible queries 540 - 541 541 ==== Aggregated Statistics ==== 542 - 543 543 * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc. 544 544 * **Compute from**: Database queries 545 545 * **Cache**: 1 hour in Redis 546 546 * **Why not store**: Can be derived, relatively cheap to compute 547 - 548 548 ==== User Reputation ==== 549 - 550 550 * **What**: Score based on contributions 551 551 * **Current design**: Stored in User table 552 552 * **Alternative**: Compute from Edit table ... ... @@ -556,43 +556,37 @@ 556 556 * **Frequency**: Read on every user action 557 557 * **Compute cost**: Simple COUNT query, milliseconds 558 558 * **Decision**: ✅ STORE - Performance critical, read-heavy 559 - 560 560 === Summary Table === 561 - 562 -| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\ 563 -|-----|-|-||----|-----|\\ 564 -| Claim core | ✅ | - | 1 KB | STORE | Essential |\\ 565 -| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\ 566 -| Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\ 567 -| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\ 568 -| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\ 569 -| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\ 570 -| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\ 571 -| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\ 572 -| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\ 573 -| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\ 574 -| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\ 575 -| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\ 576 -| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\ 462 +| Data Type | Storage | Compute | Size per Claim | Decision | Rationale | 463 +|-----------|---------|---------|----------------|----------|-----------| 464 +| Claim core | ✅ | - | 1 KB | STORE | Essential | 465 +| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility | 466 +| Sources | ✅ | - | 500 B (shared) | STORE | Track record | 467 +| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit | 468 +| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective | 469 +| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access | 470 +| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access | 471 +| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed | 472 +| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access | 473 +| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement | 474 +| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning | 475 +| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending | 476 +| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic | 577 577 | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable | 578 -**Total storage per claim**: 18 KB (without edits and flags) 478 +**Total storage per claim**: ~18 KB (without edits and flags) 579 579 **For 1 million claims**: 580 - 581 -* **Storage**: 18 GB (manageable) 582 -* **PostgreSQL**: $50/month (standard instance) 583 -* **Redis cache**: $20/month (1 GB instance) 584 -* **S3 archives**: $5/month (old edits) 585 -* **Total**: $75/month infrastructure 480 +* **Storage**: ~18 GB (manageable) 481 +* **PostgreSQL**: ~$50/month (standard instance) 482 +* **Redis cache**: ~$20/month (1 GB instance) 483 +* **S3 archives**: ~$5/month (old edits) 484 +* **Total**: ~$75/month infrastructure 586 586 **LLM cost savings by caching**: 587 587 * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims 588 588 * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims 589 589 * Verdict stored: Save $0.003 per claim = $3K per 1M claims 590 -* **Total savings**: $35K per 1M claims vs recomputing every time 591 - 489 +* **Total savings**: ~$35K per 1M claims vs recomputing every time 592 592 === Recomputation Triggers === 593 - 594 594 **When to mark cached data as stale and recompute:** 595 - 596 596 1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores) 597 597 2. **Evidence added** → Recompute: scenarios, verdict, confidence score 598 598 3. **Source track record changes >10 points** → Recompute: confidence score, verdict ... ... @@ -599,15 +599,11 @@ 599 599 4. **System improvement deployed** → Mark affected claims stale, recompute on next view 600 600 5. **Controversy detected** (high flag rate) → Recompute: risk score 601 601 **Recomputation strategy**: 602 - 603 603 * **Eager**: Immediately recompute (for user edits) 604 604 * **Lazy**: Recompute on next view (for system improvements) 605 605 * **Batch**: Nightly re-evaluation of stale claims (if <1000) 606 - 607 607 === Database Size Projection === 608 - 609 609 **Year 1**: 10K claims 610 - 611 611 * Storage: 180 MB 612 612 * Cost: $10/month 613 613 **Year 3**: 100K claims ... ... @@ -621,21 +621,15 @@ 621 621 * Cost: $300/month 622 622 * Optimization: Archive old claims to S3 ($5/TB/month) 623 623 **Conclusion**: Storage costs are manageable, LLM cost savings are substantial. 624 - 625 625 == 3. Key Simplifications == 626 - 627 627 * **Two content states only**: Published, Hidden 628 628 * **Three user roles only**: Reader, Contributor, Moderator 629 629 * **No complex versioning**: Linear edit history 630 630 * **Reputation-based permissions**: Not role hierarchy 631 631 * **Source track records**: Continuous evaluation 632 - 633 633 == 3. What Gets Stored in the Database == 634 - 635 635 === 3.1 Primary Storage (PostgreSQL) === 636 - 637 637 **Claims Table**: 638 - 639 639 * Current state only (latest version) 640 640 * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at 641 641 **Evidence Table**: ... ... @@ -662,44 +662,31 @@ 662 662 **QualityMetric Table**: 663 663 * Time-series quality data 664 664 * Fields: id, metric_type, metric_category, value, target, timestamp 665 - 666 666 === 3.2 What's NOT Stored (Computed on-the-fly) === 667 - 668 668 * **Verdicts**: Synthesized from evidence + scenarios when requested 669 669 * **Risk scores**: Recalculated based on current factors 670 670 * **Aggregated statistics**: Computed from base data 671 671 * **Search results**: Generated from Elasticsearch index 672 - 673 673 === 3.3 Cache Layer (Redis) === 674 - 675 675 **Cached for performance**: 676 - 677 677 * Frequently accessed claims (TTL: 1 hour) 678 678 * Search results (TTL: 15 minutes) 679 679 * User sessions (TTL: 24 hours) 680 680 * Source track records (TTL: 1 hour) 681 - 682 682 === 3.4 File Storage (S3) === 683 - 684 684 **Archived content**: 685 - 686 686 * Old edit history (>3 months) 687 687 * Evidence documents (archived copies) 688 688 * Database backups 689 689 * Export files 690 - 691 691 === 3.5 Search Index (Elasticsearch) === 692 - 693 693 **Indexed for search**: 694 - 695 695 * Claim assertions (full-text) 696 696 * Evidence excerpts (full-text) 697 697 * Scenario descriptions (full-text) 698 698 * Source names (autocomplete) 699 699 Synchronized from PostgreSQL via change data capture or periodic sync. 700 - 701 701 == 4. Related Pages == 702 - 703 -* [[Architecture>>Test.FactHarbor V0\.9\.100.Specification.Architecture.WebHome]] 576 +* [[Architecture>>Test.FactHarbor.Specification.Architecture.WebHome]] 704 704 * [[Requirements>>Test.FactHarbor.Specification.Requirements.WebHome]] 705 705 * [[Workflows>>Test.FactHarbor.Specification.Workflows.WebHome]]