Changes for page Data Model
Last modified by Robert Schaub on 2025/12/24 21:46
Summary
-
Page properties (1 modified, 0 added, 0 removed)
Details
- Page properties
-
- Content
-
... ... @@ -1,25 +1,32 @@ 1 1 = Data Model = 2 + 2 2 FactHarbor's data model is **simple, focused, designed for automated processing**. 4 + 3 3 == 1. Core Entities == 6 + 4 4 === 1.1 Claim === 8 + 5 5 Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count 10 + 6 6 ==== Performance Optimization: Denormalized Fields ==== 12 + 7 7 **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%. 8 8 **Additional cached fields in claims table**: 15 + 9 9 * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores 10 - * Avoids joining evidence table for listing/preview11 - * Updated when evidence is added/removed12 - * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`17 +* Avoids joining evidence table for listing/preview 18 +* Updated when evidence is added/removed 19 +* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]` 13 13 * **source_names** (TEXT[]): Array of source names for quick display 14 - * Avoids joining through evidence to sources15 - * Updated when sources change16 - * Format: `["New York Times", "Nature Journal", ...]`21 +* Avoids joining through evidence to sources 22 +* Updated when sources change 23 +* Format: `["New York Times", "Nature Journal", ...]` 17 17 * **scenario_count** (INTEGER): Number of scenarios for this claim 18 - * Quick metric without counting rows19 - * Updated when scenarios added/removed25 +* Quick metric without counting rows 26 +* Updated when scenarios added/removed 20 20 * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed 21 - * Helps invalidate stale caches22 - * Triggers background refresh if too old28 +* Helps invalidate stale caches 29 +* Triggers background refresh if too old 23 23 **Update Strategy**: 24 24 * **Immediate**: Update on claim edit (user-facing) 25 25 * **Deferred**: Update via background job every hour (non-critical) ... ... @@ -28,13 +28,18 @@ 28 28 * ✅ 70% fewer joins on common queries 29 29 * ✅ Much faster claim list/search pages 30 30 * ✅ Better user experience 31 -* ⚠️ Small storage increase ( ~10%)38 +* ⚠️ Small storage increase (10%) 32 32 * ⚠️ Need to keep caches in sync 40 + 33 33 === 1.2 Evidence === 42 + 34 34 Fields: claim_id, source_id, excerpt, url, relevance_score, supports 44 + 35 35 === 1.3 Source === 46 + 36 36 **Purpose**: Track reliability of information sources over time 37 37 **Fields**: 49 + 38 38 * **id** (UUID): Unique identifier 39 39 * **name** (text): Source name (e.g., "New York Times", "Nature Journal") 40 40 * **domain** (text): Website domain (e.g., "nytimes.com") ... ... @@ -52,17 +52,21 @@ 52 52 **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage 53 53 Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency** 54 54 **Key**: Automated source reliability tracking 67 + 55 55 ==== Source Scoring Process (Separation of Concerns) ==== 69 + 56 56 **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis. 57 57 **The Problem**: 72 + 58 58 * Source scores should influence claim verdicts 59 59 * Claim verdicts should update source scores 60 60 * But: Direct feedback creates circular dependency and potential feedback loops 61 61 **The Solution**: Temporal separation 77 + 62 62 ==== Weekly Background Job (Source Scoring) ==== 79 + 63 63 Runs independently of claim analysis: 64 -{{code language="python"}} 65 -def update_source_scores_weekly(): 81 +{{code language="python"}}def update_source_scores_weekly(): 66 66 """ 67 67 Background job: Calculate source reliability 68 68 Never triggered by individual claim analysis ... ... @@ -82,12 +82,12 @@ 82 82 source.last_updated = now() 83 83 source.save() 84 84 # Job runs: Sunday 2 AM UTC 85 - # Never during claim processing 86 - {{/code}}101 + # Never during claim processing{{/code}} 102 + 87 87 ==== Real-Time Claim Analysis (AKEL) ==== 104 + 88 88 Uses source scores but never updates them: 89 -{{code language="python"}} 90 -def analyze_claim(claim_text): 106 +{{code language="python"}}def analyze_claim(claim_text): 91 91 """ 92 92 Real-time: Analyze claim using current source scores 93 93 READ source scores, never UPDATE them ... ... @@ -104,10 +104,12 @@ 104 104 verdict = synthesize_verdict(evidence_list) 105 105 # NEVER update source scores here 106 106 # That happens in weekly background job 107 - return verdict 108 - {{/code}}123 + return verdict{{/code}} 124 + 109 109 ==== Monthly Audit (Quality Assurance) ==== 126 + 110 110 Moderator review of flagged source scores: 128 + 111 111 * Verify scores make sense 112 112 * Detect gaming attempts 113 113 * Identify systematic biases ... ... @@ -147,18 +147,19 @@ 147 147 → NYT score: 0.89 (trending up) 148 148 → Blog X score: 0.48 (trending down) 149 149 ``` 168 + 150 150 === 1.4 Scenario === 170 + 151 151 **Purpose**: Different interpretations or contexts for evaluating claims 152 152 **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated. 153 153 **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim) 154 154 **Fields**: 175 + 155 155 * **id** (UUID): Unique identifier 156 156 * **claim_id** (UUID): Foreign key to claim (one-to-many) 157 157 * **description** (text): Human-readable description of the scenario 158 158 * **assumptions** (JSONB): Key assumptions that define this scenario context 159 159 * **extracted_from** (UUID): Reference to evidence that this scenario was extracted from 160 -* **verdict_summary** (text): Compiled verdict for this scenario 161 -* **confidence** (decimal 0-1): Confidence level for verdict in this scenario 162 162 * **created_at** (timestamp): When scenario was created 163 163 * **updated_at** (timestamp): Last modification 164 164 **How Found**: Evidence search → Extract context → Create scenario → Link to claim ... ... @@ -168,13 +168,48 @@ 168 168 * Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data 169 169 * Scenario 3: "Immunocompromised patients" from specialist study 170 170 **V2.0 Evolution**: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality. 171 -=== 1.5 User === 190 + 191 +=== 1.5 Verdict === 192 + 193 +**Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence. 194 + 195 +**Core Fields**: 196 + 197 +* **id** (UUID): Primary key 198 +* **scenario_id** (UUID FK): The scenario being assessed 199 +* **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)") 200 +* **confidence** (decimal 0-1): How confident we are in this assessment 201 +* **explanation_summary** (text): Human-readable reasoning explaining the verdict 202 +* **uncertainty_factors** (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown") 203 +* **created_at** (timestamp): When verdict was created 204 +* **updated_at** (timestamp): Last modification 205 + 206 +**Change Tracking**: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states. 207 + 208 +**Relationship**: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity. 209 + 210 +**Example**: 211 +For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)": 212 + 213 +* Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"] 214 +* After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"] 215 +* Edit entity records the complete before/after change with timestamp and reason 216 + 217 +**Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios. 218 + 219 +=== 1.6 User === 220 + 172 172 Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count 173 -=== User Reputation System == 222 + 223 +=== User Reputation System === 224 + 174 174 **V1.0 Approach**: Simple manual role assignment 175 175 **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary. 227 + 176 176 === Roles (Manual Assignment) === 229 + 177 177 **reader** (default): 231 + 178 178 * View published claims and evidence 179 179 * Browse and search content 180 180 * No editing permissions ... ... @@ -193,8 +193,11 @@ 193 193 * System configuration 194 194 * Access to all features 195 195 * Founder-appointed initially 250 + 196 196 === Contribution Tracking (Simple) === 252 + 197 197 **Basic metrics only**: 254 + 198 198 * `contributions_count`: Total number of contributions 199 199 * `created_at`: Account age 200 200 * `last_active`: Recent activity ... ... @@ -203,19 +203,26 @@ 203 203 * No automated privilege escalation 204 204 * No reputation decay 205 205 * No threshold-based promotions 263 + 206 206 === Promotion Process === 265 + 207 207 **Manual review by moderators/admins**: 267 + 208 208 1. User demonstrates value through contributions 209 209 2. Moderator reviews user's contribution history 210 210 3. Moderator promotes user to contributor role 211 211 4. Admin promotes trusted contributors to moderator 212 212 **Criteria** (guidelines, not automated): 273 + 213 213 * Quality of contributions 214 214 * Consistency over time 215 215 * Collaborative behavior 216 216 * Understanding of project goals 278 + 217 217 === V2.0+ Evolution === 280 + 218 218 **Add complex reputation when**: 282 + 219 219 * 100+ active contributors 220 220 * Manual role management becomes bottleneck 221 221 * Clear patterns of abuse emerge requiring automation ... ... @@ -224,12 +224,17 @@ 224 224 * Threshold-based promotions 225 225 * Reputation decay for inactive users 226 226 * Track record scoring for contributors 227 -See [[When to Add Complexity>>FactHarbor.Specification.When-to-Add-Complexity]] for triggers. 228 -=== 1.6 Edit === 291 +See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers. 292 + 293 +=== 1.7 Edit === 294 + 229 229 **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at 230 230 **Purpose**: Complete audit trail for all content changes 297 + 231 231 === Edit History Details === 299 + 232 232 **What Gets Edited**: 301 + 233 233 * **Claims** (20% edited): assertion, domain, status, scores, analysis 234 234 * **Evidence** (10% edited): excerpt, relevance_score, supports 235 235 * **Scenarios** (5% edited): description, assumptions, confidence ... ... @@ -246,12 +246,14 @@ 246 246 * `MODERATION_ACTION`: Hide/unhide for abuse 247 247 * `REVERT`: Rollback to previous version 248 248 **Retention Policy** (5 years total): 318 + 249 249 1. **Hot storage** (3 months): PostgreSQL, instant access 250 250 2. **Warm storage** (2 years): Partitioned, slower queries 251 251 3. **Cold storage** (3 years): S3 compressed, download required 252 252 4. **Deletion**: After 5 years (except legal holds) 253 -**Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit)323 +**Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit) 254 254 **Use Cases**: 325 + 255 255 * View claim history timeline 256 256 * Detect vandalism patterns 257 257 * Learn from user corrections (system improvement) ... ... @@ -258,12 +258,17 @@ 258 258 * Legal compliance (audit trail) 259 259 * Rollback capability 260 260 See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases 261 -=== 1.7 Flag === 332 + 333 +=== 1.8 Flag === 334 + 262 262 Fields: entity_id, reported_by, issue_type, status, resolution_note 263 -=== 1.8 QualityMetric === 336 + 337 +=== 1.9 QualityMetric === 338 + 264 264 **Fields**: metric_type, category, value, target, timestamp 265 265 **Purpose**: Time-series quality tracking 266 266 **Usage**: 342 + 267 267 * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times 268 268 * **Quality dashboard**: Real-time display with trend charts 269 269 * **Alerting**: Automatic alerts when metrics exceed thresholds ... ... @@ -270,10 +270,13 @@ 270 270 * **A/B testing**: Compare control vs treatment metrics 271 271 * **Improvement validation**: Measure before/after changes 272 272 **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}` 273 -=== 1.9 ErrorPattern === 349 + 350 +=== 1.10 ErrorPattern === 351 + 274 274 **Fields**: error_category, claim_id, description, root_cause, frequency, status 275 275 **Purpose**: Capture errors to trigger system improvements 276 276 **Usage**: 355 + 277 277 * **Error capture**: When users flag issues or system detects problems 278 278 * **Pattern analysis**: Weekly grouping by category and frequency 279 279 * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor ... ... @@ -282,12 +282,16 @@ 282 282 283 283 == 1.4 Core Data Model ERD == 284 284 285 -{{include reference="FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}} 364 +{{include reference="Test.FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}} 286 286 287 287 == 1.5 User Class Diagram == 288 -{{include reference="FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}} 367 + 368 +{{include reference="Test.FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}} 369 + 289 289 == 2. Versioning Strategy == 371 + 290 290 **All Content Entities Are Versioned**: 373 + 291 291 * **Claim**: Every edit creates new version (V1→V2→V3...) 292 292 * **Evidence**: Changes tracked in edit history 293 293 * **Scenario**: Modifications versioned ... ... @@ -308,68 +308,91 @@ 308 308 Claim V2: "The sky is blue during daytime" 309 309 → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"} 310 310 ``` 394 + 311 311 == 2.5. Storage vs Computation Strategy == 396 + 312 312 **Critical architectural decision**: What to persist in databases vs compute dynamically? 313 313 **Trade-off**: 399 + 314 314 * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs 315 315 * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible 402 + 316 316 === Recommendation: Hybrid Approach === 404 + 317 317 **STORE (in PostgreSQL):** 406 + 318 318 ==== Claims (Current State + History) ==== 408 + 319 319 * **What**: assertion, domain, status, created_at, updated_at, version 320 320 * **Why**: Core entity, must be persistent 321 321 * **Also store**: confidence_score (computed once, then cached) 322 -* **Size**: ~1 KB per claim412 +* **Size**: 1 KB per claim 323 323 * **Growth**: Linear with claims 324 324 * **Decision**: ✅ STORE - Essential 415 + 325 325 ==== Evidence (All Records) ==== 417 + 326 326 * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at 327 327 * **Why**: Hard to re-gather, user contributions, reproducibility 328 -* **Size**: ~2 KB per evidence (with excerpt)420 +* **Size**: 2 KB per evidence (with excerpt) 329 329 * **Growth**: 3-10 evidence per claim 330 330 * **Decision**: ✅ STORE - Essential for reproducibility 423 + 331 331 ==== Sources (Track Records) ==== 425 + 332 332 * **What**: name, domain, track_record_score, accuracy_history, correction_frequency 333 333 * **Why**: Continuously updated, expensive to recompute 334 -* **Size**: ~500 bytes per source428 +* **Size**: 500 bytes per source 335 335 * **Growth**: Slow (limited number of sources) 336 336 * **Decision**: ✅ STORE - Essential for quality 431 + 337 337 ==== Edit History (All Versions) ==== 433 + 338 338 * **What**: before_state, after_state, user_id, reason, timestamp 339 339 * **Why**: Audit trail, legal requirement, reproducibility 340 -* **Size**: ~2 KB per edit341 -* **Growth**: Linear with edits ( ~A portion of claims get edited)436 +* **Size**: 2 KB per edit 437 +* **Growth**: Linear with edits (A portion of claims get edited) 342 342 * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total 343 343 * **Decision**: ✅ STORE - Essential for accountability 440 + 344 344 ==== Flags (User Reports) ==== 442 + 345 345 * **What**: entity_id, reported_by, issue_type, description, status 346 346 * **Why**: Error detection, system improvement triggers 347 -* **Size**: ~500 bytes per flag445 +* **Size**: 500 bytes per flag 348 348 * **Growth**: 5-high percentage of claims get flagged 349 349 * **Decision**: ✅ STORE - Essential for improvement 448 + 350 350 ==== ErrorPatterns (System Improvement) ==== 450 + 351 351 * **What**: error_category, claim_id, description, root_cause, frequency, status 352 352 * **Why**: Learning loop, prevent recurring errors 353 -* **Size**: ~1 KB per pattern453 +* **Size**: 1 KB per pattern 354 354 * **Growth**: Slow (limited patterns, many fixed) 355 355 * **Decision**: ✅ STORE - Essential for learning 456 + 356 356 ==== QualityMetrics (Time Series) ==== 458 + 357 357 * **What**: metric_type, category, value, target, timestamp 358 358 * **Why**: Trend analysis, cannot recreate historical metrics 359 -* **Size**: ~200 bytes per metric461 +* **Size**: 200 bytes per metric 360 360 * **Growth**: Hourly = 8,760 per year per metric type 361 361 * **Retention**: 2 years hot, then aggregate and archive 362 362 * **Decision**: ✅ STORE - Essential for monitoring 363 363 **STORE (Computed Once, Then Cached):** 466 + 364 364 ==== Analysis Summary ==== 468 + 365 365 * **What**: Neutral text summary of claim analysis (200-500 words) 366 366 * **Computed**: Once by AKEL when claim first analyzed 367 367 * **Stored in**: Claim table (text field) 368 368 * **Recomputed**: Only when system significantly improves OR claim edited 369 369 * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often 370 -* **Size**: ~2 KB per claim474 +* **Size**: 2 KB per claim 371 371 * **Decision**: ✅ STORE (cached) - Cost-effective 476 + 372 372 ==== Confidence Score ==== 478 + 373 373 * **What**: 0-100 score of analysis confidence 374 374 * **Computed**: Once by AKEL 375 375 * **Stored in**: Claim table (integer field) ... ... @@ -377,7 +377,9 @@ 377 377 * **Why store**: Cheap to store, expensive to compute, users need it fast 378 378 * **Size**: 4 bytes per claim 379 379 * **Decision**: ✅ STORE (cached) - Performance critical 486 + 380 380 ==== Risk Score ==== 488 + 381 381 * **What**: 0-100 score of claim risk level 382 382 * **Computed**: Once by AKEL 383 383 * **Stored in**: Claim table (integer field) ... ... @@ -386,13 +386,17 @@ 386 386 * **Size**: 4 bytes per claim 387 387 * **Decision**: ✅ STORE (cached) - Performance critical 388 388 **COMPUTE DYNAMICALLY (Never Store):** 389 -==== Scenarios ==== ⚠️ CRITICAL DECISION 497 + 498 +==== Scenarios ==== 499 + 500 + ⚠️ CRITICAL DECISION 501 + 390 390 * **What**: 2-5 possible interpretations of claim with assumptions 391 391 * **Current design**: Stored in Scenario table 392 392 * **Alternative**: Compute on-demand when user views claim details 393 -* **Storage cost**: ~1 KB per scenario × 3 scenarios average =~3 KB per claim505 +* **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim 394 394 * **Compute cost**: $0.005-0.01 per request (LLM API call) 395 -* **Frequency**: Viewed in detail by ~20% of users507 +* **Frequency**: Viewed in detail by 20% of users 396 396 * **Trade-off analysis**: 397 397 - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access 398 398 - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs ... ... @@ -400,12 +400,17 @@ 400 400 * **Speed**: Computed = 5-8 seconds delay, Stored = instant 401 401 * **Decision**: ✅ STORE (hybrid approach below) 402 402 **Scenario Strategy** (APPROVED): 515 + 403 403 1. **Store scenarios** initially when claim analyzed 404 404 2. **Mark as stale** when system improves significantly 405 405 3. **Recompute on next view** if marked stale 406 406 4. **Cache for 30 days** if frequently accessed 407 407 5. **Result**: Best of both worlds - speed + freshness 408 -==== Verdict Synthesis ==== 521 + 522 +==== Verdict Synthesis ==== 523 + 524 + 525 + 409 409 * **What**: Final conclusion text synthesizing all scenarios 410 410 * **Compute cost**: $0.002-0.005 per request 411 411 * **Frequency**: Every time claim viewed ... ... @@ -413,17 +413,23 @@ 413 413 * **Speed**: 2-3 seconds (acceptable) 414 414 **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale 415 415 * **Recommendation**: ✅ STORE cached version, mark stale when changes occur 533 + 416 416 ==== Search Results ==== 535 + 417 417 * **What**: Lists of claims matching search query 418 418 * **Compute from**: Elasticsearch index 419 419 * **Cache**: 15 minutes in Redis for popular queries 420 420 * **Why not store permanently**: Constantly changing, infinite possible queries 540 + 421 421 ==== Aggregated Statistics ==== 542 + 422 422 * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc. 423 423 * **Compute from**: Database queries 424 424 * **Cache**: 1 hour in Redis 425 425 * **Why not store**: Can be derived, relatively cheap to compute 547 + 426 426 ==== User Reputation ==== 549 + 427 427 * **What**: Score based on contributions 428 428 * **Current design**: Stored in User table 429 429 * **Alternative**: Compute from Edit table ... ... @@ -433,37 +433,43 @@ 433 433 * **Frequency**: Read on every user action 434 434 * **Compute cost**: Simple COUNT query, milliseconds 435 435 * **Decision**: ✅ STORE - Performance critical, read-heavy 559 + 436 436 === Summary Table === 437 -| Data Type | Storage | Compute | Size per Claim | Decision | Rationale | 438 -|-----------|---------|---------|----------------|----------|-----------| 439 -| Claim core | ✅ | - | 1 KB | STORE | Essential | 440 -| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility | 441 -| Sources | ✅ | - | 500 B (shared) | STORE | Track record | 442 -| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit | 443 -| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective | 444 -| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access | 445 -| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access | 446 -| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed | 447 -| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access | 448 -| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement | 449 -| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning | 450 -| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending | 451 -| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic | 561 + 562 +| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\ 563 +|-----|-|-||----|-----|\\ 564 +| Claim core | ✅ | - | 1 KB | STORE | Essential |\\ 565 +| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\ 566 +| Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\ 567 +| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\ 568 +| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\ 569 +| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\ 570 +| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\ 571 +| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\ 572 +| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\ 573 +| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\ 574 +| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\ 575 +| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\ 576 +| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\ 452 452 | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable | 453 -**Total storage per claim**: ~18 KB (without edits and flags)578 +**Total storage per claim**: 18 KB (without edits and flags) 454 454 **For 1 million claims**: 455 -* **Storage**: ~18 GB (manageable) 456 -* **PostgreSQL**: ~$50/month (standard instance) 457 -* **Redis cache**: ~$20/month (1 GB instance) 458 -* **S3 archives**: ~$5/month (old edits) 459 -* **Total**: ~$75/month infrastructure 580 + 581 +* **Storage**: 18 GB (manageable) 582 +* **PostgreSQL**: $50/month (standard instance) 583 +* **Redis cache**: $20/month (1 GB instance) 584 +* **S3 archives**: $5/month (old edits) 585 +* **Total**: $75/month infrastructure 460 460 **LLM cost savings by caching**: 461 461 * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims 462 462 * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims 463 463 * Verdict stored: Save $0.003 per claim = $3K per 1M claims 464 -* **Total savings**: ~$35K per 1M claims vs recomputing every time 590 +* **Total savings**: $35K per 1M claims vs recomputing every time 591 + 465 465 === Recomputation Triggers === 593 + 466 466 **When to mark cached data as stale and recompute:** 595 + 467 467 1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores) 468 468 2. **Evidence added** → Recompute: scenarios, verdict, confidence score 469 469 3. **Source track record changes >10 points** → Recompute: confidence score, verdict ... ... @@ -470,11 +470,15 @@ 470 470 4. **System improvement deployed** → Mark affected claims stale, recompute on next view 471 471 5. **Controversy detected** (high flag rate) → Recompute: risk score 472 472 **Recomputation strategy**: 602 + 473 473 * **Eager**: Immediately recompute (for user edits) 474 474 * **Lazy**: Recompute on next view (for system improvements) 475 475 * **Batch**: Nightly re-evaluation of stale claims (if <1000) 606 + 476 476 === Database Size Projection === 608 + 477 477 **Year 1**: 10K claims 610 + 478 478 * Storage: 180 MB 479 479 * Cost: $10/month 480 480 **Year 3**: 100K claims ... ... @@ -488,15 +488,21 @@ 488 488 * Cost: $300/month 489 489 * Optimization: Archive old claims to S3 ($5/TB/month) 490 490 **Conclusion**: Storage costs are manageable, LLM cost savings are substantial. 624 + 491 491 == 3. Key Simplifications == 626 + 492 492 * **Two content states only**: Published, Hidden 493 493 * **Three user roles only**: Reader, Contributor, Moderator 494 494 * **No complex versioning**: Linear edit history 495 495 * **Reputation-based permissions**: Not role hierarchy 496 496 * **Source track records**: Continuous evaluation 632 + 497 497 == 3. What Gets Stored in the Database == 634 + 498 498 === 3.1 Primary Storage (PostgreSQL) === 636 + 499 499 **Claims Table**: 638 + 500 500 * Current state only (latest version) 501 501 * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at 502 502 **Evidence Table**: ... ... @@ -523,31 +523,44 @@ 523 523 **QualityMetric Table**: 524 524 * Time-series quality data 525 525 * Fields: id, metric_type, metric_category, value, target, timestamp 665 + 526 526 === 3.2 What's NOT Stored (Computed on-the-fly) === 667 + 527 527 * **Verdicts**: Synthesized from evidence + scenarios when requested 528 528 * **Risk scores**: Recalculated based on current factors 529 529 * **Aggregated statistics**: Computed from base data 530 530 * **Search results**: Generated from Elasticsearch index 672 + 531 531 === 3.3 Cache Layer (Redis) === 674 + 532 532 **Cached for performance**: 676 + 533 533 * Frequently accessed claims (TTL: 1 hour) 534 534 * Search results (TTL: 15 minutes) 535 535 * User sessions (TTL: 24 hours) 536 536 * Source track records (TTL: 1 hour) 681 + 537 537 === 3.4 File Storage (S3) === 683 + 538 538 **Archived content**: 685 + 539 539 * Old edit history (>3 months) 540 540 * Evidence documents (archived copies) 541 541 * Database backups 542 542 * Export files 690 + 543 543 === 3.5 Search Index (Elasticsearch) === 692 + 544 544 **Indexed for search**: 694 + 545 545 * Claim assertions (full-text) 546 546 * Evidence excerpts (full-text) 547 547 * Scenario descriptions (full-text) 548 548 * Source names (autocomplete) 549 549 Synchronized from PostgreSQL via change data capture or periodic sync. 700 + 550 550 == 4. Related Pages == 551 -* [[Architecture>>FactHarbor.Specification.Architecture.WebHome]] 552 -* [[Requirements>>FactHarbor.Specification.Requirements.WebHome]] 553 -* [[Workflows>>FactHarbor.Specification.Workflows.WebHome]] 702 + 703 +* [[Architecture>>FactHarbor.Archive.FactHarbor delta for V0\.9\.70.Specification.Architecture.WebHome]] 704 +* [[Requirements>>Test.FactHarbor.Specification.Requirements.WebHome]] 705 +* [[Workflows>>Test.FactHarbor.Specification.Workflows.WebHome]]