Changes for page Data Model

Last modified by Robert Schaub on 2026/02/08 08:32

From version 1.1
edited by Robert Schaub
on 2026/01/20 21:40
Change comment: Imported from XAR
To version 1.2
edited by Robert Schaub
on 2026/02/08 08:30
Change comment: Renamed back-links.

Summary

Details

Page properties
Content
... ... @@ -17,26 +17,32 @@
17 17  {{/warning}}
18 18  
19 19  FactHarbor's data model is **simple, focused, designed for automated processing**.
20 +
20 20  == 1. Core Entities ==
22 +
21 21  === 1.1 Claim ===
24 +
22 22  Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count
26 +
23 23  ==== Performance Optimization: Denormalized Fields ====
28 +
24 24  **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
25 25  **Additional cached fields in claims table**:
31 +
26 26  * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores
27 - * Avoids joining evidence table for listing/preview
28 - * Updated when evidence is added/removed
29 - * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
33 +* Avoids joining evidence table for listing/preview
34 +* Updated when evidence is added/removed
35 +* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
30 30  * **source_names** (TEXT[]): Array of source names for quick display
31 - * Avoids joining through evidence to sources
32 - * Updated when sources change
33 - * Format: `["New York Times", "Nature Journal", ...]`
37 +* Avoids joining through evidence to sources
38 +* Updated when sources change
39 +* Format: `["New York Times", "Nature Journal", ...]`
34 34  * **scenario_count** (INTEGER): Number of scenarios for this claim
35 - * Quick metric without counting rows
36 - * Updated when scenarios added/removed
41 +* Quick metric without counting rows
42 +* Updated when scenarios added/removed
37 37  * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed
38 - * Helps invalidate stale caches
39 - * Triggers background refresh if too old
44 +* Helps invalidate stale caches
45 +* Triggers background refresh if too old
40 40  **Update Strategy**:
41 41  * **Immediate**: Update on claim edit (user-facing)
42 42  * **Deferred**: Update via background job every hour (non-critical)
... ... @@ -45,13 +45,18 @@
45 45  * ✅ 70% fewer joins on common queries
46 46  * ✅ Much faster claim list/search pages
47 47  * ✅ Better user experience
48 -* ⚠️ Small storage increase (~10%)
54 +* ⚠️ Small storage increase (10%)
49 49  * ⚠️ Need to keep caches in sync
56 +
50 50  === 1.2 Evidence ===
58 +
51 51  Fields: claim_id, source_id, excerpt, url, relevance_score, supports
60 +
52 52  === 1.3 Source ===
62 +
53 53  **Purpose**: Track reliability of information sources over time
54 54  **Fields**:
65 +
55 55  * **id** (UUID): Unique identifier
56 56  * **name** (text): Source name (e.g., "New York Times", "Nature Journal")
57 57  * **domain** (text): Website domain (e.g., "nytimes.com")
... ... @@ -74,17 +74,21 @@
74 74  **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
75 75  Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**
76 76  **Key**: Automated source reliability tracking
88 +
77 77  ==== Source Scoring Process (Separation of Concerns) ====
90 +
78 78  **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.
79 79  **The Problem**:
93 +
80 80  * Source scores should influence claim verdicts
81 81  * Claim verdicts should update source scores
82 82  * But: Direct feedback creates circular dependency and potential feedback loops
83 83  **The Solution**: Temporal separation
98 +
84 84  ==== Weekly Background Job (Source Scoring) ====
100 +
85 85  Runs independently of claim analysis:
86 -{{code language="python"}}
87 -def update_source_scores_weekly():
102 +{{code language="python"}}def update_source_scores_weekly():
88 88   """
89 89   Background job: Calculate source reliability
90 90   Never triggered by individual claim analysis
... ... @@ -104,12 +104,12 @@
104 104   source.last_updated = now()
105 105   source.save()
106 106   # Job runs: Sunday 2 AM UTC
107 - # Never during claim processing
108 -{{/code}}
122 + # Never during claim processing{{/code}}
123 +
109 109  ==== Real-Time Claim Analysis (AKEL) ====
125 +
110 110  Uses source scores but never updates them:
111 -{{code language="python"}}
112 -def analyze_claim(claim_text):
127 +{{code language="python"}}def analyze_claim(claim_text):
113 113   """
114 114   Real-time: Analyze claim using current source scores
115 115   READ source scores, never UPDATE them
... ... @@ -126,10 +126,12 @@
126 126   verdict = synthesize_verdict(evidence_list)
127 127   # NEVER update source scores here
128 128   # That happens in weekly background job
129 - return verdict
130 -{{/code}}
144 + return verdict{{/code}}
145 +
131 131  ==== Monthly Audit (Quality Assurance) ====
147 +
132 132  Moderator review of flagged source scores:
149 +
133 133  * Verify scores make sense
134 134  * Detect gaming attempts
135 135  * Identify systematic biases
... ... @@ -169,6 +169,7 @@
169 169   → NYT score: 0.89 (trending up)
170 170   → Blog X score: 0.48 (trending down)
171 171  ```
189 +
172 172  === 1.4 Scenario ===
173 173  
174 174  {{warning}}
... ... @@ -179,6 +179,7 @@
179 179  **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
180 180  **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)
181 181  **Fields**:
200 +
182 182  * **id** (UUID): Unique identifier
183 183  * **claim_id** (UUID): Foreign key to claim (one-to-many)
184 184  * **description** (text): Human-readable description of the scenario
... ... @@ -199,6 +199,7 @@
199 199  **Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.
200 200  
201 201  **Core Fields**:
221 +
202 202  * **id** (UUID): Primary key
203 203  * **scenario_id** (UUID FK): The scenario being assessed
204 204  * **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")
... ... @@ -214,6 +214,7 @@
214 214  
215 215  **Example**:
216 216  For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":
237 +
217 217  * Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
218 218  * After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
219 219  * Edit entity records the complete before/after change with timestamp and reason
... ... @@ -221,12 +221,18 @@
221 221  **Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios.
222 222  
223 223  === 1.6 User ===
245 +
224 224  Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count
225 -=== User Reputation System ==
247 +
248 +=== User Reputation System ===
249 +
226 226  **V1.0 Approach**: Simple manual role assignment
227 227  **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
252 +
228 228  === Roles (Manual Assignment) ===
254 +
229 229  **reader** (default):
256 +
230 230  * View published claims and evidence
231 231  * Browse and search content
232 232  * No editing permissions
... ... @@ -245,8 +245,11 @@
245 245  * System configuration
246 246  * Access to all features
247 247  * Founder-appointed initially
275 +
248 248  === Contribution Tracking (Simple) ===
277 +
249 249  **Basic metrics only**:
279 +
250 250  * `contributions_count`: Total number of contributions
251 251  * `created_at`: Account age
252 252  * `last_active`: Recent activity
... ... @@ -255,19 +255,26 @@
255 255  * No automated privilege escalation
256 256  * No reputation decay
257 257  * No threshold-based promotions
288 +
258 258  === Promotion Process ===
290 +
259 259  **Manual review by moderators/admins**:
292 +
260 260  1. User demonstrates value through contributions
261 261  2. Moderator reviews user's contribution history
262 262  3. Moderator promotes user to contributor role
263 263  4. Admin promotes trusted contributors to moderator
264 264  **Criteria** (guidelines, not automated):
298 +
265 265  * Quality of contributions
266 266  * Consistency over time
267 267  * Collaborative behavior
268 268  * Understanding of project goals
303 +
269 269  === V2.0+ Evolution ===
305 +
270 270  **Add complex reputation when**:
307 +
271 271  * 100+ active contributors
272 272  * Manual role management becomes bottleneck
273 273  * Clear patterns of abuse emerge requiring automation
... ... @@ -277,11 +277,16 @@
277 277  * Reputation decay for inactive users
278 278  * Track record scoring for contributors
279 279  See [[When to Add Complexity>>FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
317 +
280 280  === 1.7 Edit ===
319 +
281 281  **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
282 282  **Purpose**: Complete audit trail for all content changes
322 +
283 283  === Edit History Details ===
324 +
284 284  **What Gets Edited**:
326 +
285 285  * **Claims** (20% edited): assertion, domain, status, scores, analysis
286 286  * **Evidence** (10% edited): excerpt, relevance_score, supports
287 287  * **Scenarios** (5% edited): description, assumptions, confidence
... ... @@ -298,12 +298,14 @@
298 298  * `MODERATION_ACTION`: Hide/unhide for abuse
299 299  * `REVERT`: Rollback to previous version
300 300  **Retention Policy** (5 years total):
343 +
301 301  1. **Hot storage** (3 months): PostgreSQL, instant access
302 302  2. **Warm storage** (2 years): Partitioned, slower queries
303 303  3. **Cold storage** (3 years): S3 compressed, download required
304 304  4. **Deletion**: After 5 years (except legal holds)
305 -**Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit)
348 +**Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit)
306 306  **Use Cases**:
350 +
307 307  * View claim history timeline
308 308  * Detect vandalism patterns
309 309  * Learn from user corrections (system improvement)
... ... @@ -310,12 +310,17 @@
310 310  * Legal compliance (audit trail)
311 311  * Rollback capability
312 312  See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases
357 +
313 313  === 1.8 Flag ===
359 +
314 314  Fields: entity_id, reported_by, issue_type, status, resolution_note
361 +
315 315  === 1.9 QualityMetric ===
363 +
316 316  **Fields**: metric_type, category, value, target, timestamp
317 317  **Purpose**: Time-series quality tracking
318 318  **Usage**:
367 +
319 319  * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times
320 320  * **Quality dashboard**: Real-time display with trend charts
321 321  * **Alerting**: Automatic alerts when metrics exceed thresholds
... ... @@ -322,10 +322,13 @@
322 322  * **A/B testing**: Compare control vs treatment metrics
323 323  * **Improvement validation**: Measure before/after changes
324 324  **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
374 +
325 325  === 1.10 ErrorPattern ===
376 +
326 326  **Fields**: error_category, claim_id, description, root_cause, frequency, status
327 327  **Purpose**: Capture errors to trigger system improvements
328 328  **Usage**:
380 +
329 329  * **Error capture**: When users flag issues or system detects problems
330 330  * **Pattern analysis**: Weekly grouping by category and frequency
331 331  * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor
... ... @@ -337,9 +337,13 @@
337 337  {{include reference="FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}}
338 338  
339 339  == 1.5 User Class Diagram ==
392 +
340 340  {{include reference="FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}}
394 +
341 341  == 2. Versioning Strategy ==
396 +
342 342  **All Content Entities Are Versioned**:
398 +
343 343  * **Claim**: Every edit creates new version (V1→V2→V3...)
344 344  * **Evidence**: Changes tracked in edit history
345 345  * **Scenario**: Modifications versioned
... ... @@ -360,68 +360,91 @@
360 360  Claim V2: "The sky is blue during daytime"
361 361   → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
362 362  ```
419 +
363 363  == 2.5. Storage vs Computation Strategy ==
421 +
364 364  **Critical architectural decision**: What to persist in databases vs compute dynamically?
365 365  **Trade-off**:
424 +
366 366  * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs
367 367  * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible
427 +
368 368  === Recommendation: Hybrid Approach ===
429 +
369 369  **STORE (in PostgreSQL):**
431 +
370 370  ==== Claims (Current State + History) ====
433 +
371 371  * **What**: assertion, domain, status, created_at, updated_at, version
372 372  * **Why**: Core entity, must be persistent
373 373  * **Also store**: confidence_score (computed once, then cached)
374 -* **Size**: ~1 KB per claim
437 +* **Size**: 1 KB per claim
375 375  * **Growth**: Linear with claims
376 376  * **Decision**: ✅ STORE - Essential
440 +
377 377  ==== Evidence (All Records) ====
442 +
378 378  * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
379 379  * **Why**: Hard to re-gather, user contributions, reproducibility
380 -* **Size**: ~2 KB per evidence (with excerpt)
445 +* **Size**: 2 KB per evidence (with excerpt)
381 381  * **Growth**: 3-10 evidence per claim
382 382  * **Decision**: ✅ STORE - Essential for reproducibility
448 +
383 383  ==== Sources (Track Records) ====
450 +
384 384  * **What**: name, domain, track_record_score, accuracy_history, correction_frequency
385 385  * **Why**: Continuously updated, expensive to recompute
386 -* **Size**: ~500 bytes per source
453 +* **Size**: 500 bytes per source
387 387  * **Growth**: Slow (limited number of sources)
388 388  * **Decision**: ✅ STORE - Essential for quality
456 +
389 389  ==== Edit History (All Versions) ====
458 +
390 390  * **What**: before_state, after_state, user_id, reason, timestamp
391 391  * **Why**: Audit trail, legal requirement, reproducibility
392 -* **Size**: ~2 KB per edit
393 -* **Growth**: Linear with edits (~A portion of claims get edited)
461 +* **Size**: 2 KB per edit
462 +* **Growth**: Linear with edits (A portion of claims get edited)
394 394  * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
395 395  * **Decision**: ✅ STORE - Essential for accountability
465 +
396 396  ==== Flags (User Reports) ====
467 +
397 397  * **What**: entity_id, reported_by, issue_type, description, status
398 398  * **Why**: Error detection, system improvement triggers
399 -* **Size**: ~500 bytes per flag
470 +* **Size**: 500 bytes per flag
400 400  * **Growth**: 5-high percentage of claims get flagged
401 401  * **Decision**: ✅ STORE - Essential for improvement
473 +
402 402  ==== ErrorPatterns (System Improvement) ====
475 +
403 403  * **What**: error_category, claim_id, description, root_cause, frequency, status
404 404  * **Why**: Learning loop, prevent recurring errors
405 -* **Size**: ~1 KB per pattern
478 +* **Size**: 1 KB per pattern
406 406  * **Growth**: Slow (limited patterns, many fixed)
407 407  * **Decision**: ✅ STORE - Essential for learning
481 +
408 408  ==== QualityMetrics (Time Series) ====
483 +
409 409  * **What**: metric_type, category, value, target, timestamp
410 410  * **Why**: Trend analysis, cannot recreate historical metrics
411 -* **Size**: ~200 bytes per metric
486 +* **Size**: 200 bytes per metric
412 412  * **Growth**: Hourly = 8,760 per year per metric type
413 413  * **Retention**: 2 years hot, then aggregate and archive
414 414  * **Decision**: ✅ STORE - Essential for monitoring
415 415  **STORE (Computed Once, Then Cached):**
491 +
416 416  ==== Analysis Summary ====
493 +
417 417  * **What**: Neutral text summary of claim analysis (200-500 words)
418 418  * **Computed**: Once by AKEL when claim first analyzed
419 419  * **Stored in**: Claim table (text field)
420 420  * **Recomputed**: Only when system significantly improves OR claim edited
421 421  * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
422 -* **Size**: ~2 KB per claim
499 +* **Size**: 2 KB per claim
423 423  * **Decision**: ✅ STORE (cached) - Cost-effective
501 +
424 424  ==== Confidence Score ====
503 +
425 425  * **What**: 0-100 score of analysis confidence
426 426  * **Computed**: Once by AKEL
427 427  * **Stored in**: Claim table (integer field)
... ... @@ -429,7 +429,9 @@
429 429  * **Why store**: Cheap to store, expensive to compute, users need it fast
430 430  * **Size**: 4 bytes per claim
431 431  * **Decision**: ✅ STORE (cached) - Performance critical
511 +
432 432  ==== Risk Score ====
513 +
433 433  * **What**: 0-100 score of claim risk level
434 434  * **Computed**: Once by AKEL
435 435  * **Stored in**: Claim table (integer field)
... ... @@ -438,13 +438,17 @@
438 438  * **Size**: 4 bytes per claim
439 439  * **Decision**: ✅ STORE (cached) - Performance critical
440 440  **COMPUTE DYNAMICALLY (Never Store):**
441 -==== Scenarios ==== ⚠️ CRITICAL DECISION
522 +
523 +==== Scenarios ====
524 +
525 + ⚠️ CRITICAL DECISION
526 +
442 442  * **What**: 2-5 possible interpretations of claim with assumptions
443 443  * **Current design**: Stored in Scenario table
444 444  * **Alternative**: Compute on-demand when user views claim details
445 -* **Storage cost**: ~1 KB per scenario × 3 scenarios average = ~3 KB per claim
530 +* **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim
446 446  * **Compute cost**: $0.005-0.01 per request (LLM API call)
447 -* **Frequency**: Viewed in detail by ~20% of users
532 +* **Frequency**: Viewed in detail by 20% of users
448 448  * **Trade-off analysis**:
449 449   - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access
450 450   - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
... ... @@ -452,12 +452,17 @@
452 452  * **Speed**: Computed = 5-8 seconds delay, Stored = instant
453 453  * **Decision**: ✅ STORE (hybrid approach below)
454 454  **Scenario Strategy** (APPROVED):
540 +
455 455  1. **Store scenarios** initially when claim analyzed
456 456  2. **Mark as stale** when system improves significantly
457 457  3. **Recompute on next view** if marked stale
458 458  4. **Cache for 30 days** if frequently accessed
459 459  5. **Result**: Best of both worlds - speed + freshness
460 -==== Verdict Synthesis ====
546 +
547 +==== Verdict Synthesis ====
548 +
549 +
550 +
461 461  * **What**: Final conclusion text synthesizing all scenarios
462 462  * **Compute cost**: $0.002-0.005 per request
463 463  * **Frequency**: Every time claim viewed
... ... @@ -465,17 +465,23 @@
465 465  * **Speed**: 2-3 seconds (acceptable)
466 466  **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale
467 467  * **Recommendation**: ✅ STORE cached version, mark stale when changes occur
558 +
468 468  ==== Search Results ====
560 +
469 469  * **What**: Lists of claims matching search query
470 470  * **Compute from**: Elasticsearch index
471 471  * **Cache**: 15 minutes in Redis for popular queries
472 472  * **Why not store permanently**: Constantly changing, infinite possible queries
565 +
473 473  ==== Aggregated Statistics ====
567 +
474 474  * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.
475 475  * **Compute from**: Database queries
476 476  * **Cache**: 1 hour in Redis
477 477  * **Why not store**: Can be derived, relatively cheap to compute
572 +
478 478  ==== User Reputation ====
574 +
479 479  * **What**: Score based on contributions
480 480  * **Current design**: Stored in User table
481 481  * **Alternative**: Compute from Edit table
... ... @@ -485,37 +485,43 @@
485 485  * **Frequency**: Read on every user action
486 486  * **Compute cost**: Simple COUNT query, milliseconds
487 487  * **Decision**: ✅ STORE - Performance critical, read-heavy
584 +
488 488  === Summary Table ===
489 -| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |
490 -|-----------|---------|---------|----------------|----------|-----------|
491 -| Claim core | ✅ | - | 1 KB | STORE | Essential |
492 -| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |
493 -| Sources | ✅ | - | 500 B (shared) | STORE | Track record |
494 -| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |
495 -| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |
496 -| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |
497 -| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |
498 -| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |
499 -| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |
500 -| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |
501 -| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |
502 -| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |
503 -| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |
586 +
587 +| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\
588 +|-----|-|-||----|-----|\\
589 +| Claim core | ✅ | - | 1 KB | STORE | Essential |\\
590 +| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\
591 +| Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\
592 +| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\
593 +| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\
594 +| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
595 +| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
596 +| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\
597 +| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\
598 +| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\
599 +| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\
600 +| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\
601 +| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\
504 504  | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |
505 -**Total storage per claim**: ~18 KB (without edits and flags)
603 +**Total storage per claim**: 18 KB (without edits and flags)
506 506  **For 1 million claims**:
507 -* **Storage**: ~18 GB (manageable)
508 -* **PostgreSQL**: ~$50/month (standard instance)
509 -* **Redis cache**: ~$20/month (1 GB instance)
510 -* **S3 archives**: ~$5/month (old edits)
511 -* **Total**: ~$75/month infrastructure
605 +
606 +* **Storage**: 18 GB (manageable)
607 +* **PostgreSQL**: $50/month (standard instance)
608 +* **Redis cache**: $20/month (1 GB instance)
609 +* **S3 archives**: $5/month (old edits)
610 +* **Total**: $75/month infrastructure
512 512  **LLM cost savings by caching**:
513 513  * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
514 514  * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims
515 515  * Verdict stored: Save $0.003 per claim = $3K per 1M claims
516 -* **Total savings**: ~$35K per 1M claims vs recomputing every time
615 +* **Total savings**: $35K per 1M claims vs recomputing every time
616 +
517 517  === Recomputation Triggers ===
618 +
518 518  **When to mark cached data as stale and recompute:**
620 +
519 519  1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)
520 520  2. **Evidence added** → Recompute: scenarios, verdict, confidence score
521 521  3. **Source track record changes >10 points** → Recompute: confidence score, verdict
... ... @@ -522,11 +522,15 @@
522 522  4. **System improvement deployed** → Mark affected claims stale, recompute on next view
523 523  5. **Controversy detected** (high flag rate) → Recompute: risk score
524 524  **Recomputation strategy**:
627 +
525 525  * **Eager**: Immediately recompute (for user edits)
526 526  * **Lazy**: Recompute on next view (for system improvements)
527 527  * **Batch**: Nightly re-evaluation of stale claims (if <1000)
631 +
528 528  === Database Size Projection ===
633 +
529 529  **Year 1**: 10K claims
635 +
530 530  * Storage: 180 MB
531 531  * Cost: $10/month
532 532  **Year 3**: 100K claims
... ... @@ -540,15 +540,21 @@
540 540  * Cost: $300/month
541 541  * Optimization: Archive old claims to S3 ($5/TB/month)
542 542  **Conclusion**: Storage costs are manageable, LLM cost savings are substantial.
649 +
543 543  == 3. Key Simplifications ==
651 +
544 544  * **Two content states only**: Published, Hidden
545 545  * **Three user roles only**: Reader, Contributor, Moderator
546 546  * **No complex versioning**: Linear edit history
547 547  * **Reputation-based permissions**: Not role hierarchy
548 548  * **Source track records**: Continuous evaluation
657 +
549 549  == 3. What Gets Stored in the Database ==
659 +
550 550  === 3.1 Primary Storage (PostgreSQL) ===
661 +
551 551  **Claims Table**:
663 +
552 552  * Current state only (latest version)
553 553  * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
554 554  **Evidence Table**:
... ... @@ -575,11 +575,14 @@
575 575  **QualityMetric Table**:
576 576  * Time-series quality data
577 577  * Fields: id, metric_type, metric_category, value, target, timestamp
690 +
578 578  === 3.2 What's NOT Stored (Computed on-the-fly) ===
692 +
579 579  * **Verdicts**: Synthesized from evidence + scenarios when requested
580 580  * **Risk scores**: Recalculated based on current factors
581 581  * **Aggregated statistics**: Computed from base data
582 582  * **Search results**: Generated from Elasticsearch index
697 +
583 583  === 3.3 Cache Layer (Redis) ===
584 584  
585 585  {{warning}}
... ... @@ -587,24 +587,33 @@
587 587  {{/warning}}
588 588  
589 589  **Cached for performance (Planned)**:
705 +
590 590  * Frequently accessed claims (TTL: 1 hour)
591 591  * Search results (TTL: 15 minutes)
592 592  * User sessions (TTL: 24 hours)
593 593  * Source track records (TTL: 1 hour)
710 +
594 594  === 3.4 File Storage (S3) ===
712 +
595 595  **Archived content**:
714 +
596 596  * Old edit history (>3 months)
597 597  * Evidence documents (archived copies)
598 598  * Database backups
599 599  * Export files
719 +
600 600  === 3.5 Search Index (Elasticsearch) ===
721 +
601 601  **Indexed for search**:
723 +
602 602  * Claim assertions (full-text)
603 603  * Evidence excerpts (full-text)
604 604  * Scenario descriptions (full-text)
605 605  * Source names (autocomplete)
606 606  Synchronized from PostgreSQL via change data capture or periodic sync.
729 +
607 607  == 4. Related Pages ==
608 -* [[Architecture>>FactHarbor.Specification.Architecture.WebHome]]
731 +
732 +* [[Architecture>>Archive.FactHarbor 2026\.02\.08.Specification.Architecture.WebHome]]
609 609  * [[Requirements>>FactHarbor.Specification.Requirements.WebHome]]
610 610  * [[Workflows>>FactHarbor.Specification.Workflows.WebHome]]