Changes for page Data Model
Last modified by Robert Schaub on 2026/02/08 08:32
Summary
-
Page properties (1 modified, 0 added, 0 removed)
Details
- Page properties
-
- Content
-
... ... @@ -17,26 +17,32 @@ 17 17 {{/warning}} 18 18 19 19 FactHarbor's data model is **simple, focused, designed for automated processing**. 20 + 20 20 == 1. Core Entities == 22 + 21 21 === 1.1 Claim === 24 + 22 22 Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count 26 + 23 23 ==== Performance Optimization: Denormalized Fields ==== 28 + 24 24 **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%. 25 25 **Additional cached fields in claims table**: 31 + 26 26 * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores 27 - * Avoids joining evidence table for listing/preview28 - * Updated when evidence is added/removed29 - * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`33 +* Avoids joining evidence table for listing/preview 34 +* Updated when evidence is added/removed 35 +* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]` 30 30 * **source_names** (TEXT[]): Array of source names for quick display 31 - * Avoids joining through evidence to sources32 - * Updated when sources change33 - * Format: `["New York Times", "Nature Journal", ...]`37 +* Avoids joining through evidence to sources 38 +* Updated when sources change 39 +* Format: `["New York Times", "Nature Journal", ...]` 34 34 * **scenario_count** (INTEGER): Number of scenarios for this claim 35 - * Quick metric without counting rows36 - * Updated when scenarios added/removed41 +* Quick metric without counting rows 42 +* Updated when scenarios added/removed 37 37 * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed 38 - * Helps invalidate stale caches39 - * Triggers background refresh if too old44 +* Helps invalidate stale caches 45 +* Triggers background refresh if too old 40 40 **Update Strategy**: 41 41 * **Immediate**: Update on claim edit (user-facing) 42 42 * **Deferred**: Update via background job every hour (non-critical) ... ... @@ -45,13 +45,18 @@ 45 45 * ✅ 70% fewer joins on common queries 46 46 * ✅ Much faster claim list/search pages 47 47 * ✅ Better user experience 48 -* ⚠️ Small storage increase ( ~10%)54 +* ⚠️ Small storage increase (10%) 49 49 * ⚠️ Need to keep caches in sync 56 + 50 50 === 1.2 Evidence === 58 + 51 51 Fields: claim_id, source_id, excerpt, url, relevance_score, supports 60 + 52 52 === 1.3 Source === 62 + 53 53 **Purpose**: Track reliability of information sources over time 54 54 **Fields**: 65 + 55 55 * **id** (UUID): Unique identifier 56 56 * **name** (text): Source name (e.g., "New York Times", "Nature Journal") 57 57 * **domain** (text): Website domain (e.g., "nytimes.com") ... ... @@ -74,17 +74,21 @@ 74 74 **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage 75 75 Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency** 76 76 **Key**: Automated source reliability tracking 88 + 77 77 ==== Source Scoring Process (Separation of Concerns) ==== 90 + 78 78 **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis. 79 79 **The Problem**: 93 + 80 80 * Source scores should influence claim verdicts 81 81 * Claim verdicts should update source scores 82 82 * But: Direct feedback creates circular dependency and potential feedback loops 83 83 **The Solution**: Temporal separation 98 + 84 84 ==== Weekly Background Job (Source Scoring) ==== 100 + 85 85 Runs independently of claim analysis: 86 -{{code language="python"}} 87 -def update_source_scores_weekly(): 102 +{{code language="python"}}def update_source_scores_weekly(): 88 88 """ 89 89 Background job: Calculate source reliability 90 90 Never triggered by individual claim analysis ... ... @@ -104,12 +104,12 @@ 104 104 source.last_updated = now() 105 105 source.save() 106 106 # Job runs: Sunday 2 AM UTC 107 - # Never during claim processing 108 - {{/code}}122 + # Never during claim processing{{/code}} 123 + 109 109 ==== Real-Time Claim Analysis (AKEL) ==== 125 + 110 110 Uses source scores but never updates them: 111 -{{code language="python"}} 112 -def analyze_claim(claim_text): 127 +{{code language="python"}}def analyze_claim(claim_text): 113 113 """ 114 114 Real-time: Analyze claim using current source scores 115 115 READ source scores, never UPDATE them ... ... @@ -126,10 +126,12 @@ 126 126 verdict = synthesize_verdict(evidence_list) 127 127 # NEVER update source scores here 128 128 # That happens in weekly background job 129 - return verdict 130 - {{/code}}144 + return verdict{{/code}} 145 + 131 131 ==== Monthly Audit (Quality Assurance) ==== 147 + 132 132 Moderator review of flagged source scores: 149 + 133 133 * Verify scores make sense 134 134 * Detect gaming attempts 135 135 * Identify systematic biases ... ... @@ -169,6 +169,7 @@ 169 169 → NYT score: 0.89 (trending up) 170 170 → Blog X score: 0.48 (trending down) 171 171 ``` 189 + 172 172 === 1.4 Scenario === 173 173 174 174 {{warning}} ... ... @@ -179,6 +179,7 @@ 179 179 **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated. 180 180 **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim) 181 181 **Fields**: 200 + 182 182 * **id** (UUID): Unique identifier 183 183 * **claim_id** (UUID): Foreign key to claim (one-to-many) 184 184 * **description** (text): Human-readable description of the scenario ... ... @@ -199,6 +199,7 @@ 199 199 **Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence. 200 200 201 201 **Core Fields**: 221 + 202 202 * **id** (UUID): Primary key 203 203 * **scenario_id** (UUID FK): The scenario being assessed 204 204 * **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)") ... ... @@ -214,6 +214,7 @@ 214 214 215 215 **Example**: 216 216 For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)": 237 + 217 217 * Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"] 218 218 * After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"] 219 219 * Edit entity records the complete before/after change with timestamp and reason ... ... @@ -221,12 +221,18 @@ 221 221 **Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios. 222 222 223 223 === 1.6 User === 245 + 224 224 Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count 225 -=== User Reputation System == 247 + 248 +=== User Reputation System === 249 + 226 226 **V1.0 Approach**: Simple manual role assignment 227 227 **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary. 252 + 228 228 === Roles (Manual Assignment) === 254 + 229 229 **reader** (default): 256 + 230 230 * View published claims and evidence 231 231 * Browse and search content 232 232 * No editing permissions ... ... @@ -245,8 +245,11 @@ 245 245 * System configuration 246 246 * Access to all features 247 247 * Founder-appointed initially 275 + 248 248 === Contribution Tracking (Simple) === 277 + 249 249 **Basic metrics only**: 279 + 250 250 * `contributions_count`: Total number of contributions 251 251 * `created_at`: Account age 252 252 * `last_active`: Recent activity ... ... @@ -255,19 +255,26 @@ 255 255 * No automated privilege escalation 256 256 * No reputation decay 257 257 * No threshold-based promotions 288 + 258 258 === Promotion Process === 290 + 259 259 **Manual review by moderators/admins**: 292 + 260 260 1. User demonstrates value through contributions 261 261 2. Moderator reviews user's contribution history 262 262 3. Moderator promotes user to contributor role 263 263 4. Admin promotes trusted contributors to moderator 264 264 **Criteria** (guidelines, not automated): 298 + 265 265 * Quality of contributions 266 266 * Consistency over time 267 267 * Collaborative behavior 268 268 * Understanding of project goals 303 + 269 269 === V2.0+ Evolution === 305 + 270 270 **Add complex reputation when**: 307 + 271 271 * 100+ active contributors 272 272 * Manual role management becomes bottleneck 273 273 * Clear patterns of abuse emerge requiring automation ... ... @@ -277,11 +277,16 @@ 277 277 * Reputation decay for inactive users 278 278 * Track record scoring for contributors 279 279 See [[When to Add Complexity>>FactHarbor.Specification.When-to-Add-Complexity]] for triggers. 317 + 280 280 === 1.7 Edit === 319 + 281 281 **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at 282 282 **Purpose**: Complete audit trail for all content changes 322 + 283 283 === Edit History Details === 324 + 284 284 **What Gets Edited**: 326 + 285 285 * **Claims** (20% edited): assertion, domain, status, scores, analysis 286 286 * **Evidence** (10% edited): excerpt, relevance_score, supports 287 287 * **Scenarios** (5% edited): description, assumptions, confidence ... ... @@ -298,12 +298,14 @@ 298 298 * `MODERATION_ACTION`: Hide/unhide for abuse 299 299 * `REVERT`: Rollback to previous version 300 300 **Retention Policy** (5 years total): 343 + 301 301 1. **Hot storage** (3 months): PostgreSQL, instant access 302 302 2. **Warm storage** (2 years): Partitioned, slower queries 303 303 3. **Cold storage** (3 years): S3 compressed, download required 304 304 4. **Deletion**: After 5 years (except legal holds) 305 -**Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit)348 +**Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit) 306 306 **Use Cases**: 350 + 307 307 * View claim history timeline 308 308 * Detect vandalism patterns 309 309 * Learn from user corrections (system improvement) ... ... @@ -310,12 +310,17 @@ 310 310 * Legal compliance (audit trail) 311 311 * Rollback capability 312 312 See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases 357 + 313 313 === 1.8 Flag === 359 + 314 314 Fields: entity_id, reported_by, issue_type, status, resolution_note 361 + 315 315 === 1.9 QualityMetric === 363 + 316 316 **Fields**: metric_type, category, value, target, timestamp 317 317 **Purpose**: Time-series quality tracking 318 318 **Usage**: 367 + 319 319 * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times 320 320 * **Quality dashboard**: Real-time display with trend charts 321 321 * **Alerting**: Automatic alerts when metrics exceed thresholds ... ... @@ -322,10 +322,13 @@ 322 322 * **A/B testing**: Compare control vs treatment metrics 323 323 * **Improvement validation**: Measure before/after changes 324 324 **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}` 374 + 325 325 === 1.10 ErrorPattern === 376 + 326 326 **Fields**: error_category, claim_id, description, root_cause, frequency, status 327 327 **Purpose**: Capture errors to trigger system improvements 328 328 **Usage**: 380 + 329 329 * **Error capture**: When users flag issues or system detects problems 330 330 * **Pattern analysis**: Weekly grouping by category and frequency 331 331 * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor ... ... @@ -337,9 +337,13 @@ 337 337 {{include reference="FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}} 338 338 339 339 == 1.5 User Class Diagram == 392 + 340 340 {{include reference="FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}} 394 + 341 341 == 2. Versioning Strategy == 396 + 342 342 **All Content Entities Are Versioned**: 398 + 343 343 * **Claim**: Every edit creates new version (V1→V2→V3...) 344 344 * **Evidence**: Changes tracked in edit history 345 345 * **Scenario**: Modifications versioned ... ... @@ -360,68 +360,91 @@ 360 360 Claim V2: "The sky is blue during daytime" 361 361 → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"} 362 362 ``` 419 + 363 363 == 2.5. Storage vs Computation Strategy == 421 + 364 364 **Critical architectural decision**: What to persist in databases vs compute dynamically? 365 365 **Trade-off**: 424 + 366 366 * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs 367 367 * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible 427 + 368 368 === Recommendation: Hybrid Approach === 429 + 369 369 **STORE (in PostgreSQL):** 431 + 370 370 ==== Claims (Current State + History) ==== 433 + 371 371 * **What**: assertion, domain, status, created_at, updated_at, version 372 372 * **Why**: Core entity, must be persistent 373 373 * **Also store**: confidence_score (computed once, then cached) 374 -* **Size**: ~1 KB per claim437 +* **Size**: 1 KB per claim 375 375 * **Growth**: Linear with claims 376 376 * **Decision**: ✅ STORE - Essential 440 + 377 377 ==== Evidence (All Records) ==== 442 + 378 378 * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at 379 379 * **Why**: Hard to re-gather, user contributions, reproducibility 380 -* **Size**: ~2 KB per evidence (with excerpt)445 +* **Size**: 2 KB per evidence (with excerpt) 381 381 * **Growth**: 3-10 evidence per claim 382 382 * **Decision**: ✅ STORE - Essential for reproducibility 448 + 383 383 ==== Sources (Track Records) ==== 450 + 384 384 * **What**: name, domain, track_record_score, accuracy_history, correction_frequency 385 385 * **Why**: Continuously updated, expensive to recompute 386 -* **Size**: ~500 bytes per source453 +* **Size**: 500 bytes per source 387 387 * **Growth**: Slow (limited number of sources) 388 388 * **Decision**: ✅ STORE - Essential for quality 456 + 389 389 ==== Edit History (All Versions) ==== 458 + 390 390 * **What**: before_state, after_state, user_id, reason, timestamp 391 391 * **Why**: Audit trail, legal requirement, reproducibility 392 -* **Size**: ~2 KB per edit393 -* **Growth**: Linear with edits ( ~A portion of claims get edited)461 +* **Size**: 2 KB per edit 462 +* **Growth**: Linear with edits (A portion of claims get edited) 394 394 * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total 395 395 * **Decision**: ✅ STORE - Essential for accountability 465 + 396 396 ==== Flags (User Reports) ==== 467 + 397 397 * **What**: entity_id, reported_by, issue_type, description, status 398 398 * **Why**: Error detection, system improvement triggers 399 -* **Size**: ~500 bytes per flag470 +* **Size**: 500 bytes per flag 400 400 * **Growth**: 5-high percentage of claims get flagged 401 401 * **Decision**: ✅ STORE - Essential for improvement 473 + 402 402 ==== ErrorPatterns (System Improvement) ==== 475 + 403 403 * **What**: error_category, claim_id, description, root_cause, frequency, status 404 404 * **Why**: Learning loop, prevent recurring errors 405 -* **Size**: ~1 KB per pattern478 +* **Size**: 1 KB per pattern 406 406 * **Growth**: Slow (limited patterns, many fixed) 407 407 * **Decision**: ✅ STORE - Essential for learning 481 + 408 408 ==== QualityMetrics (Time Series) ==== 483 + 409 409 * **What**: metric_type, category, value, target, timestamp 410 410 * **Why**: Trend analysis, cannot recreate historical metrics 411 -* **Size**: ~200 bytes per metric486 +* **Size**: 200 bytes per metric 412 412 * **Growth**: Hourly = 8,760 per year per metric type 413 413 * **Retention**: 2 years hot, then aggregate and archive 414 414 * **Decision**: ✅ STORE - Essential for monitoring 415 415 **STORE (Computed Once, Then Cached):** 491 + 416 416 ==== Analysis Summary ==== 493 + 417 417 * **What**: Neutral text summary of claim analysis (200-500 words) 418 418 * **Computed**: Once by AKEL when claim first analyzed 419 419 * **Stored in**: Claim table (text field) 420 420 * **Recomputed**: Only when system significantly improves OR claim edited 421 421 * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often 422 -* **Size**: ~2 KB per claim499 +* **Size**: 2 KB per claim 423 423 * **Decision**: ✅ STORE (cached) - Cost-effective 501 + 424 424 ==== Confidence Score ==== 503 + 425 425 * **What**: 0-100 score of analysis confidence 426 426 * **Computed**: Once by AKEL 427 427 * **Stored in**: Claim table (integer field) ... ... @@ -429,7 +429,9 @@ 429 429 * **Why store**: Cheap to store, expensive to compute, users need it fast 430 430 * **Size**: 4 bytes per claim 431 431 * **Decision**: ✅ STORE (cached) - Performance critical 511 + 432 432 ==== Risk Score ==== 513 + 433 433 * **What**: 0-100 score of claim risk level 434 434 * **Computed**: Once by AKEL 435 435 * **Stored in**: Claim table (integer field) ... ... @@ -438,13 +438,17 @@ 438 438 * **Size**: 4 bytes per claim 439 439 * **Decision**: ✅ STORE (cached) - Performance critical 440 440 **COMPUTE DYNAMICALLY (Never Store):** 441 -==== Scenarios ==== ⚠️ CRITICAL DECISION 522 + 523 +==== Scenarios ==== 524 + 525 + ⚠️ CRITICAL DECISION 526 + 442 442 * **What**: 2-5 possible interpretations of claim with assumptions 443 443 * **Current design**: Stored in Scenario table 444 444 * **Alternative**: Compute on-demand when user views claim details 445 -* **Storage cost**: ~1 KB per scenario × 3 scenarios average =~3 KB per claim530 +* **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim 446 446 * **Compute cost**: $0.005-0.01 per request (LLM API call) 447 -* **Frequency**: Viewed in detail by ~20% of users532 +* **Frequency**: Viewed in detail by 20% of users 448 448 * **Trade-off analysis**: 449 449 - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access 450 450 - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs ... ... @@ -452,12 +452,17 @@ 452 452 * **Speed**: Computed = 5-8 seconds delay, Stored = instant 453 453 * **Decision**: ✅ STORE (hybrid approach below) 454 454 **Scenario Strategy** (APPROVED): 540 + 455 455 1. **Store scenarios** initially when claim analyzed 456 456 2. **Mark as stale** when system improves significantly 457 457 3. **Recompute on next view** if marked stale 458 458 4. **Cache for 30 days** if frequently accessed 459 459 5. **Result**: Best of both worlds - speed + freshness 460 -==== Verdict Synthesis ==== 546 + 547 +==== Verdict Synthesis ==== 548 + 549 + 550 + 461 461 * **What**: Final conclusion text synthesizing all scenarios 462 462 * **Compute cost**: $0.002-0.005 per request 463 463 * **Frequency**: Every time claim viewed ... ... @@ -465,17 +465,23 @@ 465 465 * **Speed**: 2-3 seconds (acceptable) 466 466 **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale 467 467 * **Recommendation**: ✅ STORE cached version, mark stale when changes occur 558 + 468 468 ==== Search Results ==== 560 + 469 469 * **What**: Lists of claims matching search query 470 470 * **Compute from**: Elasticsearch index 471 471 * **Cache**: 15 minutes in Redis for popular queries 472 472 * **Why not store permanently**: Constantly changing, infinite possible queries 565 + 473 473 ==== Aggregated Statistics ==== 567 + 474 474 * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc. 475 475 * **Compute from**: Database queries 476 476 * **Cache**: 1 hour in Redis 477 477 * **Why not store**: Can be derived, relatively cheap to compute 572 + 478 478 ==== User Reputation ==== 574 + 479 479 * **What**: Score based on contributions 480 480 * **Current design**: Stored in User table 481 481 * **Alternative**: Compute from Edit table ... ... @@ -485,37 +485,43 @@ 485 485 * **Frequency**: Read on every user action 486 486 * **Compute cost**: Simple COUNT query, milliseconds 487 487 * **Decision**: ✅ STORE - Performance critical, read-heavy 584 + 488 488 === Summary Table === 489 -| Data Type | Storage | Compute | Size per Claim | Decision | Rationale | 490 -|-----------|---------|---------|----------------|----------|-----------| 491 -| Claim core | ✅ | - | 1 KB | STORE | Essential | 492 -| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility | 493 -| Sources | ✅ | - | 500 B (shared) | STORE | Track record | 494 -| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit | 495 -| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective | 496 -| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access | 497 -| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access | 498 -| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed | 499 -| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access | 500 -| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement | 501 -| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning | 502 -| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending | 503 -| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic | 586 + 587 +| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\ 588 +|-----|-|-||----|-----|\\ 589 +| Claim core | ✅ | - | 1 KB | STORE | Essential |\\ 590 +| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\ 591 +| Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\ 592 +| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\ 593 +| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\ 594 +| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\ 595 +| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\ 596 +| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\ 597 +| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\ 598 +| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\ 599 +| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\ 600 +| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\ 601 +| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\ 504 504 | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable | 505 -**Total storage per claim**: ~18 KB (without edits and flags)603 +**Total storage per claim**: 18 KB (without edits and flags) 506 506 **For 1 million claims**: 507 -* **Storage**: ~18 GB (manageable) 508 -* **PostgreSQL**: ~$50/month (standard instance) 509 -* **Redis cache**: ~$20/month (1 GB instance) 510 -* **S3 archives**: ~$5/month (old edits) 511 -* **Total**: ~$75/month infrastructure 605 + 606 +* **Storage**: 18 GB (manageable) 607 +* **PostgreSQL**: $50/month (standard instance) 608 +* **Redis cache**: $20/month (1 GB instance) 609 +* **S3 archives**: $5/month (old edits) 610 +* **Total**: $75/month infrastructure 512 512 **LLM cost savings by caching**: 513 513 * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims 514 514 * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims 515 515 * Verdict stored: Save $0.003 per claim = $3K per 1M claims 516 -* **Total savings**: ~$35K per 1M claims vs recomputing every time 615 +* **Total savings**: $35K per 1M claims vs recomputing every time 616 + 517 517 === Recomputation Triggers === 618 + 518 518 **When to mark cached data as stale and recompute:** 620 + 519 519 1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores) 520 520 2. **Evidence added** → Recompute: scenarios, verdict, confidence score 521 521 3. **Source track record changes >10 points** → Recompute: confidence score, verdict ... ... @@ -522,11 +522,15 @@ 522 522 4. **System improvement deployed** → Mark affected claims stale, recompute on next view 523 523 5. **Controversy detected** (high flag rate) → Recompute: risk score 524 524 **Recomputation strategy**: 627 + 525 525 * **Eager**: Immediately recompute (for user edits) 526 526 * **Lazy**: Recompute on next view (for system improvements) 527 527 * **Batch**: Nightly re-evaluation of stale claims (if <1000) 631 + 528 528 === Database Size Projection === 633 + 529 529 **Year 1**: 10K claims 635 + 530 530 * Storage: 180 MB 531 531 * Cost: $10/month 532 532 **Year 3**: 100K claims ... ... @@ -540,15 +540,21 @@ 540 540 * Cost: $300/month 541 541 * Optimization: Archive old claims to S3 ($5/TB/month) 542 542 **Conclusion**: Storage costs are manageable, LLM cost savings are substantial. 649 + 543 543 == 3. Key Simplifications == 651 + 544 544 * **Two content states only**: Published, Hidden 545 545 * **Three user roles only**: Reader, Contributor, Moderator 546 546 * **No complex versioning**: Linear edit history 547 547 * **Reputation-based permissions**: Not role hierarchy 548 548 * **Source track records**: Continuous evaluation 657 + 549 549 == 3. What Gets Stored in the Database == 659 + 550 550 === 3.1 Primary Storage (PostgreSQL) === 661 + 551 551 **Claims Table**: 663 + 552 552 * Current state only (latest version) 553 553 * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at 554 554 **Evidence Table**: ... ... @@ -575,11 +575,14 @@ 575 575 **QualityMetric Table**: 576 576 * Time-series quality data 577 577 * Fields: id, metric_type, metric_category, value, target, timestamp 690 + 578 578 === 3.2 What's NOT Stored (Computed on-the-fly) === 692 + 579 579 * **Verdicts**: Synthesized from evidence + scenarios when requested 580 580 * **Risk scores**: Recalculated based on current factors 581 581 * **Aggregated statistics**: Computed from base data 582 582 * **Search results**: Generated from Elasticsearch index 697 + 583 583 === 3.3 Cache Layer (Redis) === 584 584 585 585 {{warning}} ... ... @@ -587,24 +587,33 @@ 587 587 {{/warning}} 588 588 589 589 **Cached for performance (Planned)**: 705 + 590 590 * Frequently accessed claims (TTL: 1 hour) 591 591 * Search results (TTL: 15 minutes) 592 592 * User sessions (TTL: 24 hours) 593 593 * Source track records (TTL: 1 hour) 710 + 594 594 === 3.4 File Storage (S3) === 712 + 595 595 **Archived content**: 714 + 596 596 * Old edit history (>3 months) 597 597 * Evidence documents (archived copies) 598 598 * Database backups 599 599 * Export files 719 + 600 600 === 3.5 Search Index (Elasticsearch) === 721 + 601 601 **Indexed for search**: 723 + 602 602 * Claim assertions (full-text) 603 603 * Evidence excerpts (full-text) 604 604 * Scenario descriptions (full-text) 605 605 * Source names (autocomplete) 606 606 Synchronized from PostgreSQL via change data capture or periodic sync. 729 + 607 607 == 4. Related Pages == 608 -* [[Architecture>>FactHarbor.Specification.Architecture.WebHome]] 731 + 732 +* [[Architecture>>Archive.FactHarbor 2026\.02\.08.Specification.Architecture.WebHome]] 609 609 * [[Requirements>>FactHarbor.Specification.Requirements.WebHome]] 610 610 * [[Workflows>>FactHarbor.Specification.Workflows.WebHome]]