Changes for page Data Model
Last modified by Robert Schaub on 2025/12/24 21:46
Summary
-
Page properties (1 modified, 0 added, 0 removed)
Details
- Page properties
-
- Content
-
... ... @@ -1,25 +1,32 @@ 1 1 = Data Model = 2 + 2 2 FactHarbor's data model is **simple, focused, designed for automated processing**. 4 + 3 3 == 1. Core Entities == 6 + 4 4 === 1.1 Claim === 8 + 5 5 Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count 10 + 6 6 ==== Performance Optimization: Denormalized Fields ==== 12 + 7 7 **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%. 8 8 **Additional cached fields in claims table**: 15 + 9 9 * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores 10 - * Avoids joining evidence table for listing/preview11 - * Updated when evidence is added/removed12 - * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`17 +* Avoids joining evidence table for listing/preview 18 +* Updated when evidence is added/removed 19 +* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]` 13 13 * **source_names** (TEXT[]): Array of source names for quick display 14 - * Avoids joining through evidence to sources15 - * Updated when sources change16 - * Format: `["New York Times", "Nature Journal", ...]`21 +* Avoids joining through evidence to sources 22 +* Updated when sources change 23 +* Format: `["New York Times", "Nature Journal", ...]` 17 17 * **scenario_count** (INTEGER): Number of scenarios for this claim 18 - * Quick metric without counting rows19 - * Updated when scenarios added/removed25 +* Quick metric without counting rows 26 +* Updated when scenarios added/removed 20 20 * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed 21 - * Helps invalidate stale caches22 - * Triggers background refresh if too old28 +* Helps invalidate stale caches 29 +* Triggers background refresh if too old 23 23 **Update Strategy**: 24 24 * **Immediate**: Update on claim edit (user-facing) 25 25 * **Deferred**: Update via background job every hour (non-critical) ... ... @@ -28,13 +28,18 @@ 28 28 * ✅ 70% fewer joins on common queries 29 29 * ✅ Much faster claim list/search pages 30 30 * ✅ Better user experience 31 -* ⚠️ Small storage increase ( ~10%)38 +* ⚠️ Small storage increase (10%) 32 32 * ⚠️ Need to keep caches in sync 40 + 33 33 === 1.2 Evidence === 42 + 34 34 Fields: claim_id, source_id, excerpt, url, relevance_score, supports 44 + 35 35 === 1.3 Source === 46 + 36 36 **Purpose**: Track reliability of information sources over time 37 37 **Fields**: 49 + 38 38 * **id** (UUID): Unique identifier 39 39 * **name** (text): Source name (e.g., "New York Times", "Nature Journal") 40 40 * **domain** (text): Website domain (e.g., "nytimes.com") ... ... @@ -52,17 +52,21 @@ 52 52 **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage 53 53 Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency** 54 54 **Key**: Automated source reliability tracking 67 + 55 55 ==== Source Scoring Process (Separation of Concerns) ==== 69 + 56 56 **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis. 57 57 **The Problem**: 72 + 58 58 * Source scores should influence claim verdicts 59 59 * Claim verdicts should update source scores 60 60 * But: Direct feedback creates circular dependency and potential feedback loops 61 61 **The Solution**: Temporal separation 77 + 62 62 ==== Weekly Background Job (Source Scoring) ==== 79 + 63 63 Runs independently of claim analysis: 64 -{{code language="python"}} 65 -def update_source_scores_weekly(): 81 +{{code language="python"}}def update_source_scores_weekly(): 66 66 """ 67 67 Background job: Calculate source reliability 68 68 Never triggered by individual claim analysis ... ... @@ -82,12 +82,12 @@ 82 82 source.last_updated = now() 83 83 source.save() 84 84 # Job runs: Sunday 2 AM UTC 85 - # Never during claim processing 86 - {{/code}}101 + # Never during claim processing{{/code}} 102 + 87 87 ==== Real-Time Claim Analysis (AKEL) ==== 104 + 88 88 Uses source scores but never updates them: 89 -{{code language="python"}} 90 -def analyze_claim(claim_text): 106 +{{code language="python"}}def analyze_claim(claim_text): 91 91 """ 92 92 Real-time: Analyze claim using current source scores 93 93 READ source scores, never UPDATE them ... ... @@ -104,10 +104,12 @@ 104 104 verdict = synthesize_verdict(evidence_list) 105 105 # NEVER update source scores here 106 106 # That happens in weekly background job 107 - return verdict 108 - {{/code}}123 + return verdict{{/code}} 124 + 109 109 ==== Monthly Audit (Quality Assurance) ==== 126 + 110 110 Moderator review of flagged source scores: 128 + 111 111 * Verify scores make sense 112 112 * Detect gaming attempts 113 113 * Identify systematic biases ... ... @@ -147,18 +147,19 @@ 147 147 → NYT score: 0.89 (trending up) 148 148 → Blog X score: 0.48 (trending down) 149 149 ``` 168 + 150 150 === 1.4 Scenario === 170 + 151 151 **Purpose**: Different interpretations or contexts for evaluating claims 152 152 **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated. 153 153 **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim) 154 154 **Fields**: 175 + 155 155 * **id** (UUID): Unique identifier 156 156 * **claim_id** (UUID): Foreign key to claim (one-to-many) 157 157 * **description** (text): Human-readable description of the scenario 158 158 * **assumptions** (JSONB): Key assumptions that define this scenario context 159 159 * **extracted_from** (UUID): Reference to evidence that this scenario was extracted from 160 -* **verdict_summary** (text): Compiled verdict for this scenario 161 -* **confidence** (decimal 0-1): Confidence level for verdict in this scenario 162 162 * **created_at** (timestamp): When scenario was created 163 163 * **updated_at** (timestamp): Last modification 164 164 **How Found**: Evidence search → Extract context → Create scenario → Link to claim ... ... @@ -168,13 +168,48 @@ 168 168 * Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data 169 169 * Scenario 3: "Immunocompromised patients" from specialist study 170 170 **V2.0 Evolution**: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality. 171 -=== 1.5 User === 190 + 191 +=== 1.5 Verdict === 192 + 193 +**Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence. 194 + 195 +**Core Fields**: 196 + 197 +* **id** (UUID): Primary key 198 +* **scenario_id** (UUID FK): The scenario being assessed 199 +* **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)") 200 +* **confidence** (decimal 0-1): How confident we are in this assessment 201 +* **explanation_summary** (text): Human-readable reasoning explaining the verdict 202 +* **uncertainty_factors** (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown") 203 +* **created_at** (timestamp): When verdict was created 204 +* **updated_at** (timestamp): Last modification 205 + 206 +**Change Tracking**: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states. 207 + 208 +**Relationship**: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity. 209 + 210 +**Example**: 211 +For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)": 212 + 213 +* Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"] 214 +* After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"] 215 +* Edit entity records the complete before/after change with timestamp and reason 216 + 217 +**Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios. 218 + 219 +=== 1.6 User === 220 + 172 172 Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count 173 -=== User Reputation System == 222 + 223 +=== User Reputation System === 224 + 174 174 **V1.0 Approach**: Simple manual role assignment 175 175 **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary. 227 + 176 176 === Roles (Manual Assignment) === 229 + 177 177 **reader** (default): 231 + 178 178 * View published claims and evidence 179 179 * Browse and search content 180 180 * No editing permissions ... ... @@ -193,8 +193,11 @@ 193 193 * System configuration 194 194 * Access to all features 195 195 * Founder-appointed initially 250 + 196 196 === Contribution Tracking (Simple) === 252 + 197 197 **Basic metrics only**: 254 + 198 198 * `contributions_count`: Total number of contributions 199 199 * `created_at`: Account age 200 200 * `last_active`: Recent activity ... ... @@ -203,19 +203,26 @@ 203 203 * No automated privilege escalation 204 204 * No reputation decay 205 205 * No threshold-based promotions 263 + 206 206 === Promotion Process === 265 + 207 207 **Manual review by moderators/admins**: 267 + 208 208 1. User demonstrates value through contributions 209 209 2. Moderator reviews user's contribution history 210 210 3. Moderator promotes user to contributor role 211 211 4. Admin promotes trusted contributors to moderator 212 212 **Criteria** (guidelines, not automated): 273 + 213 213 * Quality of contributions 214 214 * Consistency over time 215 215 * Collaborative behavior 216 216 * Understanding of project goals 278 + 217 217 === V2.0+ Evolution === 280 + 218 218 **Add complex reputation when**: 282 + 219 219 * 100+ active contributors 220 220 * Manual role management becomes bottleneck 221 221 * Clear patterns of abuse emerge requiring automation ... ... @@ -224,12 +224,17 @@ 224 224 * Threshold-based promotions 225 225 * Reputation decay for inactive users 226 226 * Track record scoring for contributors 227 -See [[When to Add Complexity>>FactHarbor.Specification.When-to-Add-Complexity]] for triggers. 228 -=== 1.6 Edit === 291 +See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers. 292 + 293 +=== 1.7 Edit === 294 + 229 229 **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at 230 230 **Purpose**: Complete audit trail for all content changes 297 + 231 231 === Edit History Details === 299 + 232 232 **What Gets Edited**: 301 + 233 233 * **Claims** (20% edited): assertion, domain, status, scores, analysis 234 234 * **Evidence** (10% edited): excerpt, relevance_score, supports 235 235 * **Scenarios** (5% edited): description, assumptions, confidence ... ... @@ -246,12 +246,14 @@ 246 246 * `MODERATION_ACTION`: Hide/unhide for abuse 247 247 * `REVERT`: Rollback to previous version 248 248 **Retention Policy** (5 years total): 318 + 249 249 1. **Hot storage** (3 months): PostgreSQL, instant access 250 250 2. **Warm storage** (2 years): Partitioned, slower queries 251 251 3. **Cold storage** (3 years): S3 compressed, download required 252 252 4. **Deletion**: After 5 years (except legal holds) 253 -**Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit)323 +**Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit) 254 254 **Use Cases**: 325 + 255 255 * View claim history timeline 256 256 * Detect vandalism patterns 257 257 * Learn from user corrections (system improvement) ... ... @@ -258,12 +258,17 @@ 258 258 * Legal compliance (audit trail) 259 259 * Rollback capability 260 260 See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases 261 -=== 1.7 Flag === 332 + 333 +=== 1.8 Flag === 334 + 262 262 Fields: entity_id, reported_by, issue_type, status, resolution_note 263 -=== 1.8 QualityMetric === 336 + 337 +=== 1.9 QualityMetric === 338 + 264 264 **Fields**: metric_type, category, value, target, timestamp 265 265 **Purpose**: Time-series quality tracking 266 266 **Usage**: 342 + 267 267 * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times 268 268 * **Quality dashboard**: Real-time display with trend charts 269 269 * **Alerting**: Automatic alerts when metrics exceed thresholds ... ... @@ -270,19 +270,31 @@ 270 270 * **A/B testing**: Compare control vs treatment metrics 271 271 * **Improvement validation**: Measure before/after changes 272 272 **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}` 273 -=== 1.9 ErrorPattern === 349 + 350 +=== 1.10 ErrorPattern === 351 + 274 274 **Fields**: error_category, claim_id, description, root_cause, frequency, status 275 275 **Purpose**: Capture errors to trigger system improvements 276 276 **Usage**: 355 + 277 277 * **Error capture**: When users flag issues or system detects problems 278 278 * **Pattern analysis**: Weekly grouping by category and frequency 279 279 * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor 280 280 * **Metrics**: Track error rate reduction over time 281 281 **Example**: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}` 361 + 362 +== 1.4 Core Data Model ERD == 363 + 364 +{{include reference="Test.FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}} 365 + 282 282 == 1.5 User Class Diagram == 283 -{{include reference="FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}} 367 + 368 +{{include reference="Test.FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}} 369 + 284 284 == 2. Versioning Strategy == 371 + 285 285 **All Content Entities Are Versioned**: 373 + 286 286 * **Claim**: Every edit creates new version (V1→V2→V3...) 287 287 * **Evidence**: Changes tracked in edit history 288 288 * **Scenario**: Modifications versioned ... ... @@ -303,68 +303,91 @@ 303 303 Claim V2: "The sky is blue during daytime" 304 304 → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"} 305 305 ``` 394 + 306 306 == 2.5. Storage vs Computation Strategy == 396 + 307 307 **Critical architectural decision**: What to persist in databases vs compute dynamically? 308 308 **Trade-off**: 399 + 309 309 * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs 310 310 * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible 402 + 311 311 === Recommendation: Hybrid Approach === 404 + 312 312 **STORE (in PostgreSQL):** 406 + 313 313 ==== Claims (Current State + History) ==== 408 + 314 314 * **What**: assertion, domain, status, created_at, updated_at, version 315 315 * **Why**: Core entity, must be persistent 316 316 * **Also store**: confidence_score (computed once, then cached) 317 -* **Size**: ~1 KB per claim412 +* **Size**: 1 KB per claim 318 318 * **Growth**: Linear with claims 319 319 * **Decision**: ✅ STORE - Essential 415 + 320 320 ==== Evidence (All Records) ==== 417 + 321 321 * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at 322 322 * **Why**: Hard to re-gather, user contributions, reproducibility 323 -* **Size**: ~2 KB per evidence (with excerpt)420 +* **Size**: 2 KB per evidence (with excerpt) 324 324 * **Growth**: 3-10 evidence per claim 325 325 * **Decision**: ✅ STORE - Essential for reproducibility 423 + 326 326 ==== Sources (Track Records) ==== 425 + 327 327 * **What**: name, domain, track_record_score, accuracy_history, correction_frequency 328 328 * **Why**: Continuously updated, expensive to recompute 329 -* **Size**: ~500 bytes per source428 +* **Size**: 500 bytes per source 330 330 * **Growth**: Slow (limited number of sources) 331 331 * **Decision**: ✅ STORE - Essential for quality 431 + 332 332 ==== Edit History (All Versions) ==== 433 + 333 333 * **What**: before_state, after_state, user_id, reason, timestamp 334 334 * **Why**: Audit trail, legal requirement, reproducibility 335 -* **Size**: ~2 KB per edit336 -* **Growth**: Linear with edits ( ~A portion of claims get edited)436 +* **Size**: 2 KB per edit 437 +* **Growth**: Linear with edits (A portion of claims get edited) 337 337 * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total 338 338 * **Decision**: ✅ STORE - Essential for accountability 440 + 339 339 ==== Flags (User Reports) ==== 442 + 340 340 * **What**: entity_id, reported_by, issue_type, description, status 341 341 * **Why**: Error detection, system improvement triggers 342 -* **Size**: ~500 bytes per flag445 +* **Size**: 500 bytes per flag 343 343 * **Growth**: 5-high percentage of claims get flagged 344 344 * **Decision**: ✅ STORE - Essential for improvement 448 + 345 345 ==== ErrorPatterns (System Improvement) ==== 450 + 346 346 * **What**: error_category, claim_id, description, root_cause, frequency, status 347 347 * **Why**: Learning loop, prevent recurring errors 348 -* **Size**: ~1 KB per pattern453 +* **Size**: 1 KB per pattern 349 349 * **Growth**: Slow (limited patterns, many fixed) 350 350 * **Decision**: ✅ STORE - Essential for learning 456 + 351 351 ==== QualityMetrics (Time Series) ==== 458 + 352 352 * **What**: metric_type, category, value, target, timestamp 353 353 * **Why**: Trend analysis, cannot recreate historical metrics 354 -* **Size**: ~200 bytes per metric461 +* **Size**: 200 bytes per metric 355 355 * **Growth**: Hourly = 8,760 per year per metric type 356 356 * **Retention**: 2 years hot, then aggregate and archive 357 357 * **Decision**: ✅ STORE - Essential for monitoring 358 358 **STORE (Computed Once, Then Cached):** 466 + 359 359 ==== Analysis Summary ==== 468 + 360 360 * **What**: Neutral text summary of claim analysis (200-500 words) 361 361 * **Computed**: Once by AKEL when claim first analyzed 362 362 * **Stored in**: Claim table (text field) 363 363 * **Recomputed**: Only when system significantly improves OR claim edited 364 364 * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often 365 -* **Size**: ~2 KB per claim474 +* **Size**: 2 KB per claim 366 366 * **Decision**: ✅ STORE (cached) - Cost-effective 476 + 367 367 ==== Confidence Score ==== 478 + 368 368 * **What**: 0-100 score of analysis confidence 369 369 * **Computed**: Once by AKEL 370 370 * **Stored in**: Claim table (integer field) ... ... @@ -372,7 +372,9 @@ 372 372 * **Why store**: Cheap to store, expensive to compute, users need it fast 373 373 * **Size**: 4 bytes per claim 374 374 * **Decision**: ✅ STORE (cached) - Performance critical 486 + 375 375 ==== Risk Score ==== 488 + 376 376 * **What**: 0-100 score of claim risk level 377 377 * **Computed**: Once by AKEL 378 378 * **Stored in**: Claim table (integer field) ... ... @@ -381,13 +381,17 @@ 381 381 * **Size**: 4 bytes per claim 382 382 * **Decision**: ✅ STORE (cached) - Performance critical 383 383 **COMPUTE DYNAMICALLY (Never Store):** 384 -==== Scenarios ==== ⚠️ CRITICAL DECISION 497 + 498 +==== Scenarios ==== 499 + 500 + ⚠️ CRITICAL DECISION 501 + 385 385 * **What**: 2-5 possible interpretations of claim with assumptions 386 386 * **Current design**: Stored in Scenario table 387 387 * **Alternative**: Compute on-demand when user views claim details 388 -* **Storage cost**: ~1 KB per scenario × 3 scenarios average =~3 KB per claim505 +* **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim 389 389 * **Compute cost**: $0.005-0.01 per request (LLM API call) 390 -* **Frequency**: Viewed in detail by ~20% of users507 +* **Frequency**: Viewed in detail by 20% of users 391 391 * **Trade-off analysis**: 392 392 - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access 393 393 - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs ... ... @@ -395,12 +395,17 @@ 395 395 * **Speed**: Computed = 5-8 seconds delay, Stored = instant 396 396 * **Decision**: ✅ STORE (hybrid approach below) 397 397 **Scenario Strategy** (APPROVED): 515 + 398 398 1. **Store scenarios** initially when claim analyzed 399 399 2. **Mark as stale** when system improves significantly 400 400 3. **Recompute on next view** if marked stale 401 401 4. **Cache for 30 days** if frequently accessed 402 402 5. **Result**: Best of both worlds - speed + freshness 403 -==== Verdict Synthesis ==== 521 + 522 +==== Verdict Synthesis ==== 523 + 524 + 525 + 404 404 * **What**: Final conclusion text synthesizing all scenarios 405 405 * **Compute cost**: $0.002-0.005 per request 406 406 * **Frequency**: Every time claim viewed ... ... @@ -408,17 +408,23 @@ 408 408 * **Speed**: 2-3 seconds (acceptable) 409 409 **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale 410 410 * **Recommendation**: ✅ STORE cached version, mark stale when changes occur 533 + 411 411 ==== Search Results ==== 535 + 412 412 * **What**: Lists of claims matching search query 413 413 * **Compute from**: Elasticsearch index 414 414 * **Cache**: 15 minutes in Redis for popular queries 415 415 * **Why not store permanently**: Constantly changing, infinite possible queries 540 + 416 416 ==== Aggregated Statistics ==== 542 + 417 417 * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc. 418 418 * **Compute from**: Database queries 419 419 * **Cache**: 1 hour in Redis 420 420 * **Why not store**: Can be derived, relatively cheap to compute 547 + 421 421 ==== User Reputation ==== 549 + 422 422 * **What**: Score based on contributions 423 423 * **Current design**: Stored in User table 424 424 * **Alternative**: Compute from Edit table ... ... @@ -428,37 +428,43 @@ 428 428 * **Frequency**: Read on every user action 429 429 * **Compute cost**: Simple COUNT query, milliseconds 430 430 * **Decision**: ✅ STORE - Performance critical, read-heavy 559 + 431 431 === Summary Table === 432 -| Data Type | Storage | Compute | Size per Claim | Decision | Rationale | 433 -|-----------|---------|---------|----------------|----------|-----------| 434 -| Claim core | ✅ | - | 1 KB | STORE | Essential | 435 -| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility | 436 -| Sources | ✅ | - | 500 B (shared) | STORE | Track record | 437 -| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit | 438 -| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective | 439 -| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access | 440 -| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access | 441 -| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed | 442 -| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access | 443 -| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement | 444 -| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning | 445 -| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending | 446 -| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic | 561 + 562 +| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\ 563 +|-----|-|-||----|-----|\\ 564 +| Claim core | ✅ | - | 1 KB | STORE | Essential |\\ 565 +| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\ 566 +| Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\ 567 +| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\ 568 +| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\ 569 +| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\ 570 +| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\ 571 +| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\ 572 +| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\ 573 +| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\ 574 +| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\ 575 +| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\ 576 +| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\ 447 447 | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable | 448 -**Total storage per claim**: ~18 KB (without edits and flags)578 +**Total storage per claim**: 18 KB (without edits and flags) 449 449 **For 1 million claims**: 450 -* **Storage**: ~18 GB (manageable) 451 -* **PostgreSQL**: ~$50/month (standard instance) 452 -* **Redis cache**: ~$20/month (1 GB instance) 453 -* **S3 archives**: ~$5/month (old edits) 454 -* **Total**: ~$75/month infrastructure 580 + 581 +* **Storage**: 18 GB (manageable) 582 +* **PostgreSQL**: $50/month (standard instance) 583 +* **Redis cache**: $20/month (1 GB instance) 584 +* **S3 archives**: $5/month (old edits) 585 +* **Total**: $75/month infrastructure 455 455 **LLM cost savings by caching**: 456 456 * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims 457 457 * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims 458 458 * Verdict stored: Save $0.003 per claim = $3K per 1M claims 459 -* **Total savings**: ~$35K per 1M claims vs recomputing every time 590 +* **Total savings**: $35K per 1M claims vs recomputing every time 591 + 460 460 === Recomputation Triggers === 593 + 461 461 **When to mark cached data as stale and recompute:** 595 + 462 462 1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores) 463 463 2. **Evidence added** → Recompute: scenarios, verdict, confidence score 464 464 3. **Source track record changes >10 points** → Recompute: confidence score, verdict ... ... @@ -465,11 +465,15 @@ 465 465 4. **System improvement deployed** → Mark affected claims stale, recompute on next view 466 466 5. **Controversy detected** (high flag rate) → Recompute: risk score 467 467 **Recomputation strategy**: 602 + 468 468 * **Eager**: Immediately recompute (for user edits) 469 469 * **Lazy**: Recompute on next view (for system improvements) 470 470 * **Batch**: Nightly re-evaluation of stale claims (if <1000) 606 + 471 471 === Database Size Projection === 608 + 472 472 **Year 1**: 10K claims 610 + 473 473 * Storage: 180 MB 474 474 * Cost: $10/month 475 475 **Year 3**: 100K claims ... ... @@ -483,15 +483,21 @@ 483 483 * Cost: $300/month 484 484 * Optimization: Archive old claims to S3 ($5/TB/month) 485 485 **Conclusion**: Storage costs are manageable, LLM cost savings are substantial. 624 + 486 486 == 3. Key Simplifications == 626 + 487 487 * **Two content states only**: Published, Hidden 488 488 * **Three user roles only**: Reader, Contributor, Moderator 489 489 * **No complex versioning**: Linear edit history 490 490 * **Reputation-based permissions**: Not role hierarchy 491 491 * **Source track records**: Continuous evaluation 632 + 492 492 == 3. What Gets Stored in the Database == 634 + 493 493 === 3.1 Primary Storage (PostgreSQL) === 636 + 494 494 **Claims Table**: 638 + 495 495 * Current state only (latest version) 496 496 * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at 497 497 **Evidence Table**: ... ... @@ -518,31 +518,44 @@ 518 518 **QualityMetric Table**: 519 519 * Time-series quality data 520 520 * Fields: id, metric_type, metric_category, value, target, timestamp 665 + 521 521 === 3.2 What's NOT Stored (Computed on-the-fly) === 667 + 522 522 * **Verdicts**: Synthesized from evidence + scenarios when requested 523 523 * **Risk scores**: Recalculated based on current factors 524 524 * **Aggregated statistics**: Computed from base data 525 525 * **Search results**: Generated from Elasticsearch index 672 + 526 526 === 3.3 Cache Layer (Redis) === 674 + 527 527 **Cached for performance**: 676 + 528 528 * Frequently accessed claims (TTL: 1 hour) 529 529 * Search results (TTL: 15 minutes) 530 530 * User sessions (TTL: 24 hours) 531 531 * Source track records (TTL: 1 hour) 681 + 532 532 === 3.4 File Storage (S3) === 683 + 533 533 **Archived content**: 685 + 534 534 * Old edit history (>3 months) 535 535 * Evidence documents (archived copies) 536 536 * Database backups 537 537 * Export files 690 + 538 538 === 3.5 Search Index (Elasticsearch) === 692 + 539 539 **Indexed for search**: 694 + 540 540 * Claim assertions (full-text) 541 541 * Evidence excerpts (full-text) 542 542 * Scenario descriptions (full-text) 543 543 * Source names (autocomplete) 544 544 Synchronized from PostgreSQL via change data capture or periodic sync. 700 + 545 545 == 4. Related Pages == 546 -* [[Architecture>>FactHarbor.Specification.Architecture.WebHome]] 547 -* [[Requirements>>FactHarbor.Specification.Requirements.WebHome]] 548 -* [[Workflows>>FactHarbor.Specification.Workflows.WebHome]] 702 + 703 +* [[Architecture>>FactHarbor.Archive.FactHarbor delta for V0\.9\.70.Specification.Architecture.WebHome]] 704 +* [[Requirements>>Test.FactHarbor.Specification.Requirements.WebHome]] 705 +* [[Workflows>>Test.FactHarbor.Specification.Workflows.WebHome]]