Changes for page Data Model

Last modified by Robert Schaub on 2026/02/08 08:32

From version 1.3
edited by Robert Schaub
on 2026/02/08 08:30
Change comment: Update document after refactoring.
To version 1.1
edited by Robert Schaub
on 2026/01/20 21:40
Change comment: Imported from XAR

Summary

Details

Page properties
Parent
... ... @@ -1,1 +1,1 @@
1 -Archive.FactHarbor 2026\.02\.08.Specification.WebHome
1 +FactHarbor.Specification.WebHome
Content
... ... @@ -17,32 +17,26 @@
17 17  {{/warning}}
18 18  
19 19  FactHarbor's data model is **simple, focused, designed for automated processing**.
20 -
21 21  == 1. Core Entities ==
22 -
23 23  === 1.1 Claim ===
24 -
25 25  Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count
26 -
27 27  ==== Performance Optimization: Denormalized Fields ====
28 -
29 29  **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
30 30  **Additional cached fields in claims table**:
31 -
32 32  * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores
33 -* Avoids joining evidence table for listing/preview
34 -* Updated when evidence is added/removed
35 -* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
27 + * Avoids joining evidence table for listing/preview
28 + * Updated when evidence is added/removed
29 + * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
36 36  * **source_names** (TEXT[]): Array of source names for quick display
37 -* Avoids joining through evidence to sources
38 -* Updated when sources change
39 -* Format: `["New York Times", "Nature Journal", ...]`
31 + * Avoids joining through evidence to sources
32 + * Updated when sources change
33 + * Format: `["New York Times", "Nature Journal", ...]`
40 40  * **scenario_count** (INTEGER): Number of scenarios for this claim
41 -* Quick metric without counting rows
42 -* Updated when scenarios added/removed
35 + * Quick metric without counting rows
36 + * Updated when scenarios added/removed
43 43  * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed
44 -* Helps invalidate stale caches
45 -* Triggers background refresh if too old
38 + * Helps invalidate stale caches
39 + * Triggers background refresh if too old
46 46  **Update Strategy**:
47 47  * **Immediate**: Update on claim edit (user-facing)
48 48  * **Deferred**: Update via background job every hour (non-critical)
... ... @@ -51,18 +51,13 @@
51 51  * ✅ 70% fewer joins on common queries
52 52  * ✅ Much faster claim list/search pages
53 53  * ✅ Better user experience
54 -* ⚠️ Small storage increase (10%)
48 +* ⚠️ Small storage increase (~10%)
55 55  * ⚠️ Need to keep caches in sync
56 -
57 57  === 1.2 Evidence ===
58 -
59 59  Fields: claim_id, source_id, excerpt, url, relevance_score, supports
60 -
61 61  === 1.3 Source ===
62 -
63 63  **Purpose**: Track reliability of information sources over time
64 64  **Fields**:
65 -
66 66  * **id** (UUID): Unique identifier
67 67  * **name** (text): Source name (e.g., "New York Times", "Nature Journal")
68 68  * **domain** (text): Website domain (e.g., "nytimes.com")
... ... @@ -85,21 +85,17 @@
85 85  **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
86 86  Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**
87 87  **Key**: Automated source reliability tracking
88 -
89 89  ==== Source Scoring Process (Separation of Concerns) ====
90 -
91 91  **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.
92 92  **The Problem**:
93 -
94 94  * Source scores should influence claim verdicts
95 95  * Claim verdicts should update source scores
96 96  * But: Direct feedback creates circular dependency and potential feedback loops
97 97  **The Solution**: Temporal separation
98 -
99 99  ==== Weekly Background Job (Source Scoring) ====
100 -
101 101  Runs independently of claim analysis:
102 -{{code language="python"}}def update_source_scores_weekly():
86 +{{code language="python"}}
87 +def update_source_scores_weekly():
103 103   """
104 104   Background job: Calculate source reliability
105 105   Never triggered by individual claim analysis
... ... @@ -119,12 +119,12 @@
119 119   source.last_updated = now()
120 120   source.save()
121 121   # Job runs: Sunday 2 AM UTC
122 - # Never during claim processing{{/code}}
123 -
107 + # Never during claim processing
108 +{{/code}}
124 124  ==== Real-Time Claim Analysis (AKEL) ====
125 -
126 126  Uses source scores but never updates them:
127 -{{code language="python"}}def analyze_claim(claim_text):
111 +{{code language="python"}}
112 +def analyze_claim(claim_text):
128 128   """
129 129   Real-time: Analyze claim using current source scores
130 130   READ source scores, never UPDATE them
... ... @@ -141,12 +141,10 @@
141 141   verdict = synthesize_verdict(evidence_list)
142 142   # NEVER update source scores here
143 143   # That happens in weekly background job
144 - return verdict{{/code}}
145 -
129 + return verdict
130 +{{/code}}
146 146  ==== Monthly Audit (Quality Assurance) ====
147 -
148 148  Moderator review of flagged source scores:
149 -
150 150  * Verify scores make sense
151 151  * Detect gaming attempts
152 152  * Identify systematic biases
... ... @@ -186,7 +186,6 @@
186 186   → NYT score: 0.89 (trending up)
187 187   → Blog X score: 0.48 (trending down)
188 188  ```
189 -
190 190  === 1.4 Scenario ===
191 191  
192 192  {{warning}}
... ... @@ -197,7 +197,6 @@
197 197  **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
198 198  **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)
199 199  **Fields**:
200 -
201 201  * **id** (UUID): Unique identifier
202 202  * **claim_id** (UUID): Foreign key to claim (one-to-many)
203 203  * **description** (text): Human-readable description of the scenario
... ... @@ -218,7 +218,6 @@
218 218  **Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.
219 219  
220 220  **Core Fields**:
221 -
222 222  * **id** (UUID): Primary key
223 223  * **scenario_id** (UUID FK): The scenario being assessed
224 224  * **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")
... ... @@ -234,7 +234,6 @@
234 234  
235 235  **Example**:
236 236  For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":
237 -
238 238  * Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
239 239  * After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
240 240  * Edit entity records the complete before/after change with timestamp and reason
... ... @@ -242,18 +242,12 @@
242 242  **Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios.
243 243  
244 244  === 1.6 User ===
245 -
246 246  Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count
247 -
248 -=== User Reputation System ===
249 -
225 +=== User Reputation System ==
250 250  **V1.0 Approach**: Simple manual role assignment
251 251  **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
252 -
253 253  === Roles (Manual Assignment) ===
254 -
255 255  **reader** (default):
256 -
257 257  * View published claims and evidence
258 258  * Browse and search content
259 259  * No editing permissions
... ... @@ -272,11 +272,8 @@
272 272  * System configuration
273 273  * Access to all features
274 274  * Founder-appointed initially
275 -
276 276  === Contribution Tracking (Simple) ===
277 -
278 278  **Basic metrics only**:
279 -
280 280  * `contributions_count`: Total number of contributions
281 281  * `created_at`: Account age
282 282  * `last_active`: Recent activity
... ... @@ -285,26 +285,19 @@
285 285  * No automated privilege escalation
286 286  * No reputation decay
287 287  * No threshold-based promotions
288 -
289 289  === Promotion Process ===
290 -
291 291  **Manual review by moderators/admins**:
292 -
293 293  1. User demonstrates value through contributions
294 294  2. Moderator reviews user's contribution history
295 295  3. Moderator promotes user to contributor role
296 296  4. Admin promotes trusted contributors to moderator
297 297  **Criteria** (guidelines, not automated):
298 -
299 299  * Quality of contributions
300 300  * Consistency over time
301 301  * Collaborative behavior
302 302  * Understanding of project goals
303 -
304 304  === V2.0+ Evolution ===
305 -
306 306  **Add complex reputation when**:
307 -
308 308  * 100+ active contributors
309 309  * Manual role management becomes bottleneck
310 310  * Clear patterns of abuse emerge requiring automation
... ... @@ -314,16 +314,11 @@
314 314  * Reputation decay for inactive users
315 315  * Track record scoring for contributors
316 316  See [[When to Add Complexity>>FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
317 -
318 318  === 1.7 Edit ===
319 -
320 320  **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
321 321  **Purpose**: Complete audit trail for all content changes
322 -
323 323  === Edit History Details ===
324 -
325 325  **What Gets Edited**:
326 -
327 327  * **Claims** (20% edited): assertion, domain, status, scores, analysis
328 328  * **Evidence** (10% edited): excerpt, relevance_score, supports
329 329  * **Scenarios** (5% edited): description, assumptions, confidence
... ... @@ -340,14 +340,12 @@
340 340  * `MODERATION_ACTION`: Hide/unhide for abuse
341 341  * `REVERT`: Rollback to previous version
342 342  **Retention Policy** (5 years total):
343 -
344 344  1. **Hot storage** (3 months): PostgreSQL, instant access
345 345  2. **Warm storage** (2 years): Partitioned, slower queries
346 346  3. **Cold storage** (3 years): S3 compressed, download required
347 347  4. **Deletion**: After 5 years (except legal holds)
348 -**Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit)
305 +**Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit)
349 349  **Use Cases**:
350 -
351 351  * View claim history timeline
352 352  * Detect vandalism patterns
353 353  * Learn from user corrections (system improvement)
... ... @@ -354,17 +354,12 @@
354 354  * Legal compliance (audit trail)
355 355  * Rollback capability
356 356  See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases
357 -
358 358  === 1.8 Flag ===
359 -
360 360  Fields: entity_id, reported_by, issue_type, status, resolution_note
361 -
362 362  === 1.9 QualityMetric ===
363 -
364 364  **Fields**: metric_type, category, value, target, timestamp
365 365  **Purpose**: Time-series quality tracking
366 366  **Usage**:
367 -
368 368  * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times
369 369  * **Quality dashboard**: Real-time display with trend charts
370 370  * **Alerting**: Automatic alerts when metrics exceed thresholds
... ... @@ -371,13 +371,10 @@
371 371  * **A/B testing**: Compare control vs treatment metrics
372 372  * **Improvement validation**: Measure before/after changes
373 373  **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
374 -
375 375  === 1.10 ErrorPattern ===
376 -
377 377  **Fields**: error_category, claim_id, description, root_cause, frequency, status
378 378  **Purpose**: Capture errors to trigger system improvements
379 379  **Usage**:
380 -
381 381  * **Error capture**: When users flag issues or system detects problems
382 382  * **Pattern analysis**: Weekly grouping by category and frequency
383 383  * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor
... ... @@ -389,13 +389,9 @@
389 389  {{include reference="FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}}
390 390  
391 391  == 1.5 User Class Diagram ==
392 -
393 393  {{include reference="FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}}
394 -
395 395  == 2. Versioning Strategy ==
396 -
397 397  **All Content Entities Are Versioned**:
398 -
399 399  * **Claim**: Every edit creates new version (V1→V2→V3...)
400 400  * **Evidence**: Changes tracked in edit history
401 401  * **Scenario**: Modifications versioned
... ... @@ -416,91 +416,68 @@
416 416  Claim V2: "The sky is blue during daytime"
417 417   → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
418 418  ```
419 -
420 420  == 2.5. Storage vs Computation Strategy ==
421 -
422 422  **Critical architectural decision**: What to persist in databases vs compute dynamically?
423 423  **Trade-off**:
424 -
425 425  * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs
426 426  * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible
427 -
428 428  === Recommendation: Hybrid Approach ===
429 -
430 430  **STORE (in PostgreSQL):**
431 -
432 432  ==== Claims (Current State + History) ====
433 -
434 434  * **What**: assertion, domain, status, created_at, updated_at, version
435 435  * **Why**: Core entity, must be persistent
436 436  * **Also store**: confidence_score (computed once, then cached)
437 -* **Size**: 1 KB per claim
374 +* **Size**: ~1 KB per claim
438 438  * **Growth**: Linear with claims
439 439  * **Decision**: ✅ STORE - Essential
440 -
441 441  ==== Evidence (All Records) ====
442 -
443 443  * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
444 444  * **Why**: Hard to re-gather, user contributions, reproducibility
445 -* **Size**: 2 KB per evidence (with excerpt)
380 +* **Size**: ~2 KB per evidence (with excerpt)
446 446  * **Growth**: 3-10 evidence per claim
447 447  * **Decision**: ✅ STORE - Essential for reproducibility
448 -
449 449  ==== Sources (Track Records) ====
450 -
451 451  * **What**: name, domain, track_record_score, accuracy_history, correction_frequency
452 452  * **Why**: Continuously updated, expensive to recompute
453 -* **Size**: 500 bytes per source
386 +* **Size**: ~500 bytes per source
454 454  * **Growth**: Slow (limited number of sources)
455 455  * **Decision**: ✅ STORE - Essential for quality
456 -
457 457  ==== Edit History (All Versions) ====
458 -
459 459  * **What**: before_state, after_state, user_id, reason, timestamp
460 460  * **Why**: Audit trail, legal requirement, reproducibility
461 -* **Size**: 2 KB per edit
462 -* **Growth**: Linear with edits (A portion of claims get edited)
392 +* **Size**: ~2 KB per edit
393 +* **Growth**: Linear with edits (~A portion of claims get edited)
463 463  * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
464 464  * **Decision**: ✅ STORE - Essential for accountability
465 -
466 466  ==== Flags (User Reports) ====
467 -
468 468  * **What**: entity_id, reported_by, issue_type, description, status
469 469  * **Why**: Error detection, system improvement triggers
470 -* **Size**: 500 bytes per flag
399 +* **Size**: ~500 bytes per flag
471 471  * **Growth**: 5-high percentage of claims get flagged
472 472  * **Decision**: ✅ STORE - Essential for improvement
473 -
474 474  ==== ErrorPatterns (System Improvement) ====
475 -
476 476  * **What**: error_category, claim_id, description, root_cause, frequency, status
477 477  * **Why**: Learning loop, prevent recurring errors
478 -* **Size**: 1 KB per pattern
405 +* **Size**: ~1 KB per pattern
479 479  * **Growth**: Slow (limited patterns, many fixed)
480 480  * **Decision**: ✅ STORE - Essential for learning
481 -
482 482  ==== QualityMetrics (Time Series) ====
483 -
484 484  * **What**: metric_type, category, value, target, timestamp
485 485  * **Why**: Trend analysis, cannot recreate historical metrics
486 -* **Size**: 200 bytes per metric
411 +* **Size**: ~200 bytes per metric
487 487  * **Growth**: Hourly = 8,760 per year per metric type
488 488  * **Retention**: 2 years hot, then aggregate and archive
489 489  * **Decision**: ✅ STORE - Essential for monitoring
490 490  **STORE (Computed Once, Then Cached):**
491 -
492 492  ==== Analysis Summary ====
493 -
494 494  * **What**: Neutral text summary of claim analysis (200-500 words)
495 495  * **Computed**: Once by AKEL when claim first analyzed
496 496  * **Stored in**: Claim table (text field)
497 497  * **Recomputed**: Only when system significantly improves OR claim edited
498 498  * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
499 -* **Size**: 2 KB per claim
422 +* **Size**: ~2 KB per claim
500 500  * **Decision**: ✅ STORE (cached) - Cost-effective
501 -
502 502  ==== Confidence Score ====
503 -
504 504  * **What**: 0-100 score of analysis confidence
505 505  * **Computed**: Once by AKEL
506 506  * **Stored in**: Claim table (integer field)
... ... @@ -508,9 +508,7 @@
508 508  * **Why store**: Cheap to store, expensive to compute, users need it fast
509 509  * **Size**: 4 bytes per claim
510 510  * **Decision**: ✅ STORE (cached) - Performance critical
511 -
512 512  ==== Risk Score ====
513 -
514 514  * **What**: 0-100 score of claim risk level
515 515  * **Computed**: Once by AKEL
516 516  * **Stored in**: Claim table (integer field)
... ... @@ -519,17 +519,13 @@
519 519  * **Size**: 4 bytes per claim
520 520  * **Decision**: ✅ STORE (cached) - Performance critical
521 521  **COMPUTE DYNAMICALLY (Never Store):**
522 -
523 -==== Scenarios ====
524 -
525 - ⚠️ CRITICAL DECISION
526 -
441 +==== Scenarios ==== ⚠️ CRITICAL DECISION
527 527  * **What**: 2-5 possible interpretations of claim with assumptions
528 528  * **Current design**: Stored in Scenario table
529 529  * **Alternative**: Compute on-demand when user views claim details
530 -* **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim
445 +* **Storage cost**: ~1 KB per scenario × 3 scenarios average = ~3 KB per claim
531 531  * **Compute cost**: $0.005-0.01 per request (LLM API call)
532 -* **Frequency**: Viewed in detail by 20% of users
447 +* **Frequency**: Viewed in detail by ~20% of users
533 533  * **Trade-off analysis**:
534 534   - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access
535 535   - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
... ... @@ -537,17 +537,12 @@
537 537  * **Speed**: Computed = 5-8 seconds delay, Stored = instant
538 538  * **Decision**: ✅ STORE (hybrid approach below)
539 539  **Scenario Strategy** (APPROVED):
540 -
541 541  1. **Store scenarios** initially when claim analyzed
542 542  2. **Mark as stale** when system improves significantly
543 543  3. **Recompute on next view** if marked stale
544 544  4. **Cache for 30 days** if frequently accessed
545 545  5. **Result**: Best of both worlds - speed + freshness
546 -
547 -==== Verdict Synthesis ====
548 -
549 -
550 -
460 +==== Verdict Synthesis ====
551 551  * **What**: Final conclusion text synthesizing all scenarios
552 552  * **Compute cost**: $0.002-0.005 per request
553 553  * **Frequency**: Every time claim viewed
... ... @@ -555,23 +555,17 @@
555 555  * **Speed**: 2-3 seconds (acceptable)
556 556  **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale
557 557  * **Recommendation**: ✅ STORE cached version, mark stale when changes occur
558 -
559 559  ==== Search Results ====
560 -
561 561  * **What**: Lists of claims matching search query
562 562  * **Compute from**: Elasticsearch index
563 563  * **Cache**: 15 minutes in Redis for popular queries
564 564  * **Why not store permanently**: Constantly changing, infinite possible queries
565 -
566 566  ==== Aggregated Statistics ====
567 -
568 568  * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.
569 569  * **Compute from**: Database queries
570 570  * **Cache**: 1 hour in Redis
571 571  * **Why not store**: Can be derived, relatively cheap to compute
572 -
573 573  ==== User Reputation ====
574 -
575 575  * **What**: Score based on contributions
576 576  * **Current design**: Stored in User table
577 577  * **Alternative**: Compute from Edit table
... ... @@ -581,43 +581,37 @@
581 581  * **Frequency**: Read on every user action
582 582  * **Compute cost**: Simple COUNT query, milliseconds
583 583  * **Decision**: ✅ STORE - Performance critical, read-heavy
584 -
585 585  === Summary Table ===
586 -
587 -| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\
588 -|-----|-|-||----|-----|\\
589 -| Claim core | ✅ | - | 1 KB | STORE | Essential |\\
590 -| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\
591 -| Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\
592 -| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\
593 -| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\
594 -| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
595 -| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
596 -| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\
597 -| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\
598 -| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\
599 -| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\
600 -| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\
601 -| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\
489 +| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |
490 +|-----------|---------|---------|----------------|----------|-----------|
491 +| Claim core | ✅ | - | 1 KB | STORE | Essential |
492 +| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |
493 +| Sources | ✅ | - | 500 B (shared) | STORE | Track record |
494 +| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |
495 +| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |
496 +| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |
497 +| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |
498 +| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |
499 +| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |
500 +| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |
501 +| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |
502 +| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |
503 +| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |
602 602  | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |
603 -**Total storage per claim**: 18 KB (without edits and flags)
505 +**Total storage per claim**: ~18 KB (without edits and flags)
604 604  **For 1 million claims**:
605 -
606 -* **Storage**: 18 GB (manageable)
607 -* **PostgreSQL**: $50/month (standard instance)
608 -* **Redis cache**: $20/month (1 GB instance)
609 -* **S3 archives**: $5/month (old edits)
610 -* **Total**: $75/month infrastructure
507 +* **Storage**: ~18 GB (manageable)
508 +* **PostgreSQL**: ~$50/month (standard instance)
509 +* **Redis cache**: ~$20/month (1 GB instance)
510 +* **S3 archives**: ~$5/month (old edits)
511 +* **Total**: ~$75/month infrastructure
611 611  **LLM cost savings by caching**:
612 612  * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
613 613  * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims
614 614  * Verdict stored: Save $0.003 per claim = $3K per 1M claims
615 -* **Total savings**: $35K per 1M claims vs recomputing every time
616 -
516 +* **Total savings**: ~$35K per 1M claims vs recomputing every time
617 617  === Recomputation Triggers ===
618 -
619 619  **When to mark cached data as stale and recompute:**
620 -
621 621  1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)
622 622  2. **Evidence added** → Recompute: scenarios, verdict, confidence score
623 623  3. **Source track record changes >10 points** → Recompute: confidence score, verdict
... ... @@ -624,15 +624,11 @@
624 624  4. **System improvement deployed** → Mark affected claims stale, recompute on next view
625 625  5. **Controversy detected** (high flag rate) → Recompute: risk score
626 626  **Recomputation strategy**:
627 -
628 628  * **Eager**: Immediately recompute (for user edits)
629 629  * **Lazy**: Recompute on next view (for system improvements)
630 630  * **Batch**: Nightly re-evaluation of stale claims (if <1000)
631 -
632 632  === Database Size Projection ===
633 -
634 634  **Year 1**: 10K claims
635 -
636 636  * Storage: 180 MB
637 637  * Cost: $10/month
638 638  **Year 3**: 100K claims
... ... @@ -646,21 +646,15 @@
646 646  * Cost: $300/month
647 647  * Optimization: Archive old claims to S3 ($5/TB/month)
648 648  **Conclusion**: Storage costs are manageable, LLM cost savings are substantial.
649 -
650 650  == 3. Key Simplifications ==
651 -
652 652  * **Two content states only**: Published, Hidden
653 653  * **Three user roles only**: Reader, Contributor, Moderator
654 654  * **No complex versioning**: Linear edit history
655 655  * **Reputation-based permissions**: Not role hierarchy
656 656  * **Source track records**: Continuous evaluation
657 -
658 658  == 3. What Gets Stored in the Database ==
659 -
660 660  === 3.1 Primary Storage (PostgreSQL) ===
661 -
662 662  **Claims Table**:
663 -
664 664  * Current state only (latest version)
665 665  * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
666 666  **Evidence Table**:
... ... @@ -687,14 +687,11 @@
687 687  **QualityMetric Table**:
688 688  * Time-series quality data
689 689  * Fields: id, metric_type, metric_category, value, target, timestamp
690 -
691 691  === 3.2 What's NOT Stored (Computed on-the-fly) ===
692 -
693 693  * **Verdicts**: Synthesized from evidence + scenarios when requested
694 694  * **Risk scores**: Recalculated based on current factors
695 695  * **Aggregated statistics**: Computed from base data
696 696  * **Search results**: Generated from Elasticsearch index
697 -
698 698  === 3.3 Cache Layer (Redis) ===
699 699  
700 700  {{warning}}
... ... @@ -702,33 +702,24 @@
702 702  {{/warning}}
703 703  
704 704  **Cached for performance (Planned)**:
705 -
706 706  * Frequently accessed claims (TTL: 1 hour)
707 707  * Search results (TTL: 15 minutes)
708 708  * User sessions (TTL: 24 hours)
709 709  * Source track records (TTL: 1 hour)
710 -
711 711  === 3.4 File Storage (S3) ===
712 -
713 713  **Archived content**:
714 -
715 715  * Old edit history (>3 months)
716 716  * Evidence documents (archived copies)
717 717  * Database backups
718 718  * Export files
719 -
720 720  === 3.5 Search Index (Elasticsearch) ===
721 -
722 722  **Indexed for search**:
723 -
724 724  * Claim assertions (full-text)
725 725  * Evidence excerpts (full-text)
726 726  * Scenario descriptions (full-text)
727 727  * Source names (autocomplete)
728 728  Synchronized from PostgreSQL via change data capture or periodic sync.
729 -
730 730  == 4. Related Pages ==
731 -
732 -* [[Architecture>>Archive.FactHarbor 2026\.02\.08.Specification.Architecture.WebHome]]
608 +* [[Architecture>>FactHarbor.Specification.Architecture.WebHome]]
733 733  * [[Requirements>>FactHarbor.Specification.Requirements.WebHome]]
734 734  * [[Workflows>>FactHarbor.Specification.Workflows.WebHome]]