Changes for page Data Model
Last modified by Robert Schaub on 2025/12/22 14:16
From version 1.3
edited by Robert Schaub
on 2025/12/22 14:16
on 2025/12/22 14:16
Change comment:
Update document after refactoring.
Summary
-
Page properties (2 modified, 0 added, 0 removed)
Details
- Page properties
-
- Parent
-
... ... @@ -1,1 +1,1 @@ 1 -Test.FactHarbor pre11 V0\.9\.70.Specification.WebHome1 +Test.FactHarbor.Specification.WebHome - Content
-
... ... @@ -1,18 +1,11 @@ 1 1 = Data Model = 2 - 3 3 FactHarbor's data model is **simple, focused, designed for automated processing**. 4 - 5 5 == 1. Core Entities == 6 - 7 7 === 1.1 Claim === 8 - 9 9 Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count 10 - 11 11 ==== Performance Optimization: Denormalized Fields ==== 12 - 13 13 **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%. 14 14 **Additional cached fields in claims table**: 15 - 16 16 * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores * Avoids joining evidence table for listing/preview * Updated when evidence is added/removed * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]` 17 17 * **source_names** (TEXT[]): Array of source names for quick display * Avoids joining through evidence to sources * Updated when sources change * Format: `["New York Times", "Nature Journal", ...]` 18 18 * **scenario_count** (INTEGER): Number of scenarios for this claim * Quick metric without counting rows * Updated when scenarios added/removed ... ... @@ -25,18 +25,13 @@ 25 25 * ✅ 70% fewer joins on common queries 26 26 * ✅ Much faster claim list/search pages 27 27 * ✅ Better user experience 28 -* ⚠️ Small storage increase (10%) 21 +* ⚠️ Small storage increase (~10%) 29 29 * ⚠️ Need to keep caches in sync 30 - 31 31 === 1.2 Evidence === 32 - 33 33 Fields: claim_id, source_id, excerpt, url, relevance_score, supports 34 - 35 35 === 1.3 Source === 36 - 37 37 **Purpose**: Track reliability of information sources over time 38 38 **Fields**: 39 - 40 40 * **id** (UUID): Unique identifier 41 41 * **name** (text): Source name (e.g., "New York Times", "Nature Journal") 42 42 * **domain** (text): Website domain (e.g., "nytimes.com") ... ... @@ -54,30 +54,24 @@ 54 54 **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage 55 55 Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency** 56 56 **Key**: Automated source reliability tracking 57 - 58 58 ==== Source Scoring Process (Separation of Concerns) ==== 59 - 60 60 **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis. 61 61 **The Problem**: * Source scores should influence claim verdicts 62 - 63 63 * Claim verdicts should update source scores 64 64 * But: Direct feedback creates circular dependency and potential feedback loops 65 65 **The Solution**: Temporal separation 66 - 67 67 ==== Weekly Background Job (Source Scoring) ==== 68 - 69 69 Runs independently of claim analysis: 70 -{{code language="python"}}def update_source_scores_weekly(): """ Background job: Calculate source reliability Never triggered by individual claim analysis """ # Analyze all claims from past week claims = get_claims_from_past_week() for source in get_all_sources(): # Calculate accuracy metrics correct_verdicts = count_correct_verdicts_citing(source, claims) total_citations = count_total_citations(source, claims) accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5 # Weight by claim importance weighted_score = calculate_weighted_score(source, claims) # Update source record source.track_record_score = weighted_score source.total_citations = total_citations source.last_updated = now() source.save() # Job runs: Sunday 2 AM UTC # Never during claim processing{{/code}} 71 - 53 +{{code language="python"}} 54 +def update_source_scores_weekly(): """ Background job: Calculate source reliability Never triggered by individual claim analysis """ # Analyze all claims from past week claims = get_claims_from_past_week() for source in get_all_sources(): # Calculate accuracy metrics correct_verdicts = count_correct_verdicts_citing(source, claims) total_citations = count_total_citations(source, claims) accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5 # Weight by claim importance weighted_score = calculate_weighted_score(source, claims) # Update source record source.track_record_score = weighted_score source.total_citations = total_citations source.last_updated = now() source.save() # Job runs: Sunday 2 AM UTC # Never during claim processing 55 +{{/code}} 72 72 ==== Real-Time Claim Analysis (AKEL) ==== 73 - 74 74 Uses source scores but never updates them: 75 -{{code language="python"}}def analyze_claim(claim_text): """ Real-time: Analyze claim using current source scores READ source scores, never UPDATE them """ # Gather evidence evidence_list = gather_evidence(claim_text) for evidence in evidence_list: # READ source score (snapshot from last weekly update) source = get_source(evidence.source_id) source_score = source.track_record_score # Use score to weight evidence evidence.weighted_relevance = evidence.relevance * source_score # Generate verdict using weighted evidence verdict = synthesize_verdict(evidence_list) # NEVER update source scores here # That happens in weekly background job return verdict{{/code}} 76 - 58 +{{code language="python"}} 59 +def analyze_claim(claim_text): """ Real-time: Analyze claim using current source scores READ source scores, never UPDATE them """ # Gather evidence evidence_list = gather_evidence(claim_text) for evidence in evidence_list: # READ source score (snapshot from last weekly update) source = get_source(evidence.source_id) source_score = source.track_record_score # Use score to weight evidence evidence.weighted_relevance = evidence.relevance * source_score # Generate verdict using weighted evidence verdict = synthesize_verdict(evidence_list) # NEVER update source scores here # That happens in weekly background job return verdict 60 +{{/code}} 77 77 ==== Monthly Audit (Quality Assurance) ==== 78 - 79 79 Moderator review of flagged source scores: 80 - 81 81 * Verify scores make sense 82 82 * Detect gaming attempts 83 83 * Identify systematic biases ... ... @@ -111,14 +111,11 @@ 111 111 Monday-Saturday: Claims processed using these scores → All claims this week use NYT=0.87 → All claims this week use Blog X=0.52 112 112 Next Sunday 2 AM: Recalculate scores including this week's claims → NYT score: 0.89 (trending up) → Blog X score: 0.48 (trending down) 113 113 ``` 114 - 115 115 === 1.4 Scenario === 116 - 117 117 **Purpose**: Different interpretations or contexts for evaluating claims 118 118 **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated. 119 119 **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim) 120 120 **Fields**: 121 - 122 122 * **id** (UUID): Unique identifier 123 123 * **claim_id** (UUID): Foreign key to claim (one-to-many) 124 124 * **description** (text): Human-readable description of the scenario ... ... @@ -145,16 +145,11 @@ 145 145 * After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"] 146 146 * Edit entity records the complete before/after change with timestamp and reason **Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios. === 1.6 User === 147 147 Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count 148 - 149 -=== User Reputation System === 150 - 127 +=== User Reputation System == 151 151 **V1.0 Approach**: Simple manual role assignment 152 152 **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary. 153 - 154 154 === Roles (Manual Assignment) === 155 - 156 156 **reader** (default): 157 - 158 158 * View published claims and evidence 159 159 * Browse and search content 160 160 * No editing permissions ... ... @@ -173,11 +173,8 @@ 173 173 * System configuration 174 174 * Access to all features 175 175 * Founder-appointed initially 176 - 177 177 === Contribution Tracking (Simple) === 178 - 179 179 **Basic metrics only**: 180 - 181 181 * `contributions_count`: Total number of contributions 182 182 * `created_at`: Account age 183 183 * `last_active`: Recent activity ... ... @@ -186,26 +186,19 @@ 186 186 * No automated privilege escalation 187 187 * No reputation decay 188 188 * No threshold-based promotions 189 - 190 190 === Promotion Process === 191 - 192 192 **Manual review by moderators/admins**: 193 - 194 194 1. User demonstrates value through contributions 195 195 2. Moderator reviews user's contribution history 196 196 3. Moderator promotes user to contributor role 197 197 4. Admin promotes trusted contributors to moderator 198 198 **Criteria** (guidelines, not automated): 199 - 200 200 * Quality of contributions 201 201 * Consistency over time 202 202 * Collaborative behavior 203 203 * Understanding of project goals 204 - 205 205 === V2.0+ Evolution === 206 - 207 207 **Add complex reputation when**: 208 - 209 209 * 100+ active contributors 210 210 * Manual role management becomes bottleneck 211 211 * Clear patterns of abuse emerge requiring automation ... ... @@ -215,16 +215,11 @@ 215 215 * Reputation decay for inactive users 216 216 * Track record scoring for contributors 217 217 See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers. 218 - 219 219 === 1.7 Edit === 220 - 221 221 **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at 222 222 **Purpose**: Complete audit trail for all content changes 223 - 224 224 === Edit History Details === 225 - 226 226 **What Gets Edited**: 227 - 228 228 * **Claims** (20% edited): assertion, domain, status, scores, analysis 229 229 * **Evidence** (10% edited): excerpt, relevance_score, supports 230 230 * **Scenarios** (5% edited): description, assumptions, confidence ... ... @@ -241,14 +241,12 @@ 241 241 * `MODERATION_ACTION`: Hide/unhide for abuse 242 242 * `REVERT`: Rollback to previous version 243 243 **Retention Policy** (5 years total): 244 - 245 245 1. **Hot storage** (3 months): PostgreSQL, instant access 246 246 2. **Warm storage** (2 years): Partitioned, slower queries 247 247 3. **Cold storage** (3 years): S3 compressed, download required 248 248 4. **Deletion**: After 5 years (except legal holds) 249 -**Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit) 207 +**Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit) 250 250 **Use Cases**: 251 - 252 252 * View claim history timeline 253 253 * Detect vandalism patterns 254 254 * Learn from user corrections (system improvement) ... ... @@ -255,17 +255,12 @@ 255 255 * Legal compliance (audit trail) 256 256 * Rollback capability 257 257 See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases 258 - 259 259 === 1.8 Flag === 260 - 261 261 Fields: entity_id, reported_by, issue_type, status, resolution_note 262 - 263 263 === 1.9 QualityMetric === 264 - 265 265 **Fields**: metric_type, category, value, target, timestamp 266 266 **Purpose**: Time-series quality tracking 267 267 **Usage**: 268 - 269 269 * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times 270 270 * **Quality dashboard**: Real-time display with trend charts 271 271 * **Alerting**: Automatic alerts when metrics exceed thresholds ... ... @@ -272,13 +272,10 @@ 272 272 * **A/B testing**: Compare control vs treatment metrics 273 273 * **Improvement validation**: Measure before/after changes 274 274 **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}` 275 - 276 276 === 1.10 ErrorPattern === 277 - 278 278 **Fields**: error_category, claim_id, description, root_cause, frequency, status 279 279 **Purpose**: Capture errors to trigger system improvements 280 280 **Usage**: 281 - 282 282 * **Error capture**: When users flag issues or system detects problems 283 283 * **Pattern analysis**: Weekly grouping by category and frequency 284 284 * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor ... ... @@ -285,11 +285,8 @@ 285 285 * **Metrics**: Track error rate reduction over time 286 286 **Example**: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}` == 1.4 Core Data Model ERD == {{include reference="Test.FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}} == 1.5 User Class Diagram == 287 287 {{include reference="Test.FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}} 288 - 289 289 == 2. Versioning Strategy == 290 - 291 291 **All Content Entities Are Versioned**: 292 - 293 293 * **Claim**: Every edit creates new version (V1→V2→V3...) 294 294 * **Evidence**: Changes tracked in edit history 295 295 * **Scenario**: Modifications versioned ... ... @@ -307,91 +307,68 @@ 307 307 ``` 308 308 Claim V1: "The sky is blue" → User edits → Claim V2: "The sky is blue during daytime" → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"} 309 309 ``` 310 - 311 311 == 2.5. Storage vs Computation Strategy == 312 - 313 313 **Critical architectural decision**: What to persist in databases vs compute dynamically? 314 314 **Trade-off**: 315 - 316 316 * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs 317 317 * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible 318 - 319 319 === Recommendation: Hybrid Approach === 320 - 321 321 **STORE (in PostgreSQL):** 322 - 323 323 ==== Claims (Current State + History) ==== 324 - 325 325 * **What**: assertion, domain, status, created_at, updated_at, version 326 326 * **Why**: Core entity, must be persistent 327 327 * **Also store**: confidence_score (computed once, then cached) 328 -* **Size**: 1 KB per claim 267 +* **Size**: ~1 KB per claim 329 329 * **Growth**: Linear with claims 330 330 * **Decision**: ✅ STORE - Essential 331 - 332 332 ==== Evidence (All Records) ==== 333 - 334 334 * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at 335 335 * **Why**: Hard to re-gather, user contributions, reproducibility 336 -* **Size**: 2 KB per evidence (with excerpt) 273 +* **Size**: ~2 KB per evidence (with excerpt) 337 337 * **Growth**: 3-10 evidence per claim 338 338 * **Decision**: ✅ STORE - Essential for reproducibility 339 - 340 340 ==== Sources (Track Records) ==== 341 - 342 342 * **What**: name, domain, track_record_score, accuracy_history, correction_frequency 343 343 * **Why**: Continuously updated, expensive to recompute 344 -* **Size**: 500 bytes per source 279 +* **Size**: ~500 bytes per source 345 345 * **Growth**: Slow (limited number of sources) 346 346 * **Decision**: ✅ STORE - Essential for quality 347 - 348 348 ==== Edit History (All Versions) ==== 349 - 350 350 * **What**: before_state, after_state, user_id, reason, timestamp 351 351 * **Why**: Audit trail, legal requirement, reproducibility 352 -* **Size**: 2 KB per edit 353 -* **Growth**: Linear with edits (A portion of claims get edited) 285 +* **Size**: ~2 KB per edit 286 +* **Growth**: Linear with edits (~A portion of claims get edited) 354 354 * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total 355 355 * **Decision**: ✅ STORE - Essential for accountability 356 - 357 357 ==== Flags (User Reports) ==== 358 - 359 359 * **What**: entity_id, reported_by, issue_type, description, status 360 360 * **Why**: Error detection, system improvement triggers 361 -* **Size**: 500 bytes per flag 292 +* **Size**: ~500 bytes per flag 362 362 * **Growth**: 5-high percentage of claims get flagged 363 363 * **Decision**: ✅ STORE - Essential for improvement 364 - 365 365 ==== ErrorPatterns (System Improvement) ==== 366 - 367 367 * **What**: error_category, claim_id, description, root_cause, frequency, status 368 368 * **Why**: Learning loop, prevent recurring errors 369 -* **Size**: 1 KB per pattern 298 +* **Size**: ~1 KB per pattern 370 370 * **Growth**: Slow (limited patterns, many fixed) 371 371 * **Decision**: ✅ STORE - Essential for learning 372 - 373 373 ==== QualityMetrics (Time Series) ==== 374 - 375 375 * **What**: metric_type, category, value, target, timestamp 376 376 * **Why**: Trend analysis, cannot recreate historical metrics 377 -* **Size**: 200 bytes per metric 304 +* **Size**: ~200 bytes per metric 378 378 * **Growth**: Hourly = 8,760 per year per metric type 379 379 * **Retention**: 2 years hot, then aggregate and archive 380 380 * **Decision**: ✅ STORE - Essential for monitoring 381 381 **STORE (Computed Once, Then Cached):** 382 - 383 383 ==== Analysis Summary ==== 384 - 385 385 * **What**: Neutral text summary of claim analysis (200-500 words) 386 386 * **Computed**: Once by AKEL when claim first analyzed 387 387 * **Stored in**: Claim table (text field) 388 388 * **Recomputed**: Only when system significantly improves OR claim edited 389 389 * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often 390 -* **Size**: 2 KB per claim 315 +* **Size**: ~2 KB per claim 391 391 * **Decision**: ✅ STORE (cached) - Cost-effective 392 - 393 393 ==== Confidence Score ==== 394 - 395 395 * **What**: 0-100 score of analysis confidence 396 396 * **Computed**: Once by AKEL 397 397 * **Stored in**: Claim table (integer field) ... ... @@ -399,9 +399,7 @@ 399 399 * **Why store**: Cheap to store, expensive to compute, users need it fast 400 400 * **Size**: 4 bytes per claim 401 401 * **Decision**: ✅ STORE (cached) - Performance critical 402 - 403 403 ==== Risk Score ==== 404 - 405 405 * **What**: 0-100 score of claim risk level 406 406 * **Computed**: Once by AKEL 407 407 * **Stored in**: Claim table (integer field) ... ... @@ -410,33 +410,24 @@ 410 410 * **Size**: 4 bytes per claim 411 411 * **Decision**: ✅ STORE (cached) - Performance critical 412 412 **COMPUTE DYNAMICALLY (Never Store):** 413 - 414 -==== Scenarios ==== 415 - 416 - ⚠️ CRITICAL DECISION 417 - 334 +==== Scenarios ==== ⚠️ CRITICAL DECISION 418 418 * **What**: 2-5 possible interpretations of claim with assumptions 419 419 * **Current design**: Stored in Scenario table 420 420 * **Alternative**: Compute on-demand when user views claim details 421 -* **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim 338 +* **Storage cost**: ~1 KB per scenario × 3 scenarios average = ~3 KB per claim 422 422 * **Compute cost**: $0.005-0.01 per request (LLM API call) 423 -* **Frequency**: Viewed in detail by 20% of users 340 +* **Frequency**: Viewed in detail by ~20% of users 424 424 * **Trade-off analysis**: - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs 425 425 * **Reproducibility**: Scenarios may improve as AI improves (good to recompute) 426 426 * **Speed**: Computed = 5-8 seconds delay, Stored = instant 427 427 * **Decision**: ✅ STORE (hybrid approach below) 428 428 **Scenario Strategy** (APPROVED): 429 - 430 430 1. **Store scenarios** initially when claim analyzed 431 431 2. **Mark as stale** when system improves significantly 432 432 3. **Recompute on next view** if marked stale 433 433 4. **Cache for 30 days** if frequently accessed 434 434 5. **Result**: Best of both worlds - speed + freshness 435 - 436 -==== Verdict Synthesis ==== 437 - 438 - ~* **What**: Final conclusion text synthesizing all scenarios 439 - 351 +==== Verdict Synthesis ==== * **What**: Final conclusion text synthesizing all scenarios 440 440 * **Compute cost**: $0.002-0.005 per request 441 441 * **Frequency**: Every time claim viewed 442 442 * **Why not store**: Changes as evidence/scenarios change, users want fresh analysis ... ... @@ -443,23 +443,17 @@ 443 443 * **Speed**: 2-3 seconds (acceptable) 444 444 **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale 445 445 * **Recommendation**: ✅ STORE cached version, mark stale when changes occur 446 - 447 447 ==== Search Results ==== 448 - 449 449 * **What**: Lists of claims matching search query 450 450 * **Compute from**: Elasticsearch index 451 451 * **Cache**: 15 minutes in Redis for popular queries 452 452 * **Why not store permanently**: Constantly changing, infinite possible queries 453 - 454 454 ==== Aggregated Statistics ==== 455 - 456 456 * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc. 457 457 * **Compute from**: Database queries 458 458 * **Cache**: 1 hour in Redis 459 459 * **Why not store**: Can be derived, relatively cheap to compute 460 - 461 461 ==== User Reputation ==== 462 - 463 463 * **What**: Score based on contributions 464 464 * **Current design**: Stored in User table 465 465 * **Alternative**: Compute from Edit table ... ... @@ -467,42 +467,36 @@ 467 467 * **Frequency**: Read on every user action 468 468 * **Compute cost**: Simple COUNT query, milliseconds 469 469 * **Decision**: ✅ STORE - Performance critical, read-heavy 470 - 471 471 === Summary Table === 472 - 473 -| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\ 474 -|-----|-|-||----|-----|\\ 475 -| Claim core | ✅ | - | 1 KB | STORE | Essential |\\ 476 -| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\ 477 -| Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\ 478 -| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\ 479 -| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\ 480 -| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\ 481 -| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\ 482 -| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\ 483 -| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\ 484 -| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\ 485 -| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\ 486 -| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\ 487 -| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\ 377 +| Data Type | Storage | Compute | Size per Claim | Decision | Rationale | 378 +|-----------|---------|---------|----------------|----------|-----------| 379 +| Claim core | ✅ | - | 1 KB | STORE | Essential | 380 +| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility | 381 +| Sources | ✅ | - | 500 B (shared) | STORE | Track record | 382 +| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit | 383 +| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective | 384 +| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access | 385 +| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access | 386 +| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed | 387 +| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access | 388 +| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement | 389 +| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning | 390 +| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending | 391 +| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic | 488 488 | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable | 489 -**Total storage per claim**: 18 KB (without edits and flags) 393 +**Total storage per claim**: ~18 KB (without edits and flags) 490 490 **For 1 million claims**: 491 - 492 -* **Storage**: 18 GB (manageable) 493 -* **PostgreSQL**: $50/month (standard instance) 494 -* **Redis cache**: $20/month (1 GB instance) 495 -* **S3 archives**: $5/month (old edits) 496 -* **Total**: $75/month infrastructure 395 +* **Storage**: ~18 GB (manageable) 396 +* **PostgreSQL**: ~$50/month (standard instance) 397 +* **Redis cache**: ~$20/month (1 GB instance) 398 +* **S3 archives**: ~$5/month (old edits) 399 +* **Total**: ~$75/month infrastructure 497 497 **LLM cost savings by caching**: 498 498 * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims 499 499 * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims * Verdict stored: Save $0.003 per claim = $3K per 1M claims 500 -* **Total savings**: $35K per 1M claims vs recomputing every time 501 - 403 +* **Total savings**: ~$35K per 1M claims vs recomputing every time 502 502 === Recomputation Triggers === 503 - 504 504 **When to mark cached data as stale and recompute:** 505 - 506 506 1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores) 507 507 2. **Evidence added** → Recompute: scenarios, verdict, confidence score 508 508 3. **Source track record changes >10 points** → Recompute: confidence score, verdict ... ... @@ -509,15 +509,11 @@ 509 509 4. **System improvement deployed** → Mark affected claims stale, recompute on next view 510 510 5. **Controversy detected** (high flag rate) → Recompute: risk score 511 511 **Recomputation strategy**: 512 - 513 513 * **Eager**: Immediately recompute (for user edits) 514 514 * **Lazy**: Recompute on next view (for system improvements) 515 515 * **Batch**: Nightly re-evaluation of stale claims (if <1000) 516 - 517 517 === Database Size Projection === 518 - 519 519 **Year 1**: 10K claims 520 - 521 521 * Storage: 180 MB 522 522 * Cost: $10/month 523 523 **Year 3**: 100K claims * Storage: 1.8 GB ... ... @@ -529,21 +529,15 @@ 529 529 * Cost: $300/month 530 530 * Optimization: Archive old claims to S3 ($5/TB/month) 531 531 **Conclusion**: Storage costs are manageable, LLM cost savings are substantial. 532 - 533 533 == 3. Key Simplifications == 534 - 535 535 * **Two content states only**: Published, Hidden 536 536 * **Three user roles only**: Reader, Contributor, Moderator 537 537 * **No complex versioning**: Linear edit history 538 538 * **Reputation-based permissions**: Not role hierarchy 539 539 * **Source track records**: Continuous evaluation 540 - 541 541 == 3. What Gets Stored in the Database == 542 - 543 543 === 3.1 Primary Storage (PostgreSQL) === 544 - 545 545 **Claims Table**: 546 - 547 547 * Current state only (latest version) 548 548 * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at 549 549 **Evidence Table**: ... ... @@ -570,44 +570,31 @@ 570 570 **QualityMetric Table**: 571 571 * Time-series quality data 572 572 * Fields: id, metric_type, metric_category, value, target, timestamp 573 - 574 574 === 3.2 What's NOT Stored (Computed on-the-fly) === 575 - 576 576 * **Verdicts**: Synthesized from evidence + scenarios when requested 577 577 * **Risk scores**: Recalculated based on current factors 578 578 * **Aggregated statistics**: Computed from base data 579 579 * **Search results**: Generated from Elasticsearch index 580 - 581 581 === 3.3 Cache Layer (Redis) === 582 - 583 583 **Cached for performance**: 584 - 585 585 * Frequently accessed claims (TTL: 1 hour) 586 586 * Search results (TTL: 15 minutes) 587 587 * User sessions (TTL: 24 hours) 588 588 * Source track records (TTL: 1 hour) 589 - 590 590 === 3.4 File Storage (S3) === 591 - 592 592 **Archived content**: 593 - 594 594 * Old edit history (>3 months) 595 595 * Evidence documents (archived copies) 596 596 * Database backups 597 597 * Export files 598 - 599 599 === 3.5 Search Index (Elasticsearch) === 600 - 601 601 **Indexed for search**: 602 - 603 603 * Claim assertions (full-text) 604 604 * Evidence excerpts (full-text) 605 605 * Scenario descriptions (full-text) 606 606 * Source names (autocomplete) 607 607 Synchronized from PostgreSQL via change data capture or periodic sync. 608 - 609 609 == 4. Related Pages == 610 - 611 -* [[Architecture>>Test.FactHarbor pre11 V0\.9\.70.Specification.Architecture.WebHome]] 488 +* [[Architecture>>Test.FactHarbor.Specification.Architecture.WebHome]] 612 612 * [[Requirements>>Test.FactHarbor.Specification.Requirements.WebHome]] 613 613 * [[Workflows>>Test.FactHarbor.Specification.Workflows.WebHome]]