Changes for page Data Model

Last modified by Robert Schaub on 2025/12/24 11:48

From version 1.1
edited by Robert Schaub
on 2025/12/24 10:21
Change comment: Imported from XAR
To version 1.2
edited by Robert Schaub
on 2025/12/24 11:47
Change comment: Renamed back-links.

Summary

Details

Page properties
Content
... ... @@ -1,25 +1,32 @@
1 1  = Data Model =
2 +
2 2  FactHarbor's data model is **simple, focused, designed for automated processing**.
4 +
3 3  == 1. Core Entities ==
6 +
4 4  === 1.1 Claim ===
8 +
5 5  Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count
10 +
6 6  ==== Performance Optimization: Denormalized Fields ====
12 +
7 7  **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
8 8  **Additional cached fields in claims table**:
15 +
9 9  * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores
10 - * Avoids joining evidence table for listing/preview
11 - * Updated when evidence is added/removed
12 - * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
17 +* Avoids joining evidence table for listing/preview
18 +* Updated when evidence is added/removed
19 +* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
13 13  * **source_names** (TEXT[]): Array of source names for quick display
14 - * Avoids joining through evidence to sources
15 - * Updated when sources change
16 - * Format: `["New York Times", "Nature Journal", ...]`
21 +* Avoids joining through evidence to sources
22 +* Updated when sources change
23 +* Format: `["New York Times", "Nature Journal", ...]`
17 17  * **scenario_count** (INTEGER): Number of scenarios for this claim
18 - * Quick metric without counting rows
19 - * Updated when scenarios added/removed
25 +* Quick metric without counting rows
26 +* Updated when scenarios added/removed
20 20  * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed
21 - * Helps invalidate stale caches
22 - * Triggers background refresh if too old
28 +* Helps invalidate stale caches
29 +* Triggers background refresh if too old
23 23  **Update Strategy**:
24 24  * **Immediate**: Update on claim edit (user-facing)
25 25  * **Deferred**: Update via background job every hour (non-critical)
... ... @@ -28,13 +28,18 @@
28 28  * ✅ 70% fewer joins on common queries
29 29  * ✅ Much faster claim list/search pages
30 30  * ✅ Better user experience
31 -* ⚠️ Small storage increase (~10%)
38 +* ⚠️ Small storage increase (10%)
32 32  * ⚠️ Need to keep caches in sync
40 +
33 33  === 1.2 Evidence ===
42 +
34 34  Fields: claim_id, source_id, excerpt, url, relevance_score, supports
44 +
35 35  === 1.3 Source ===
46 +
36 36  **Purpose**: Track reliability of information sources over time
37 37  **Fields**:
49 +
38 38  * **id** (UUID): Unique identifier
39 39  * **name** (text): Source name (e.g., "New York Times", "Nature Journal")
40 40  * **domain** (text): Website domain (e.g., "nytimes.com")
... ... @@ -52,17 +52,21 @@
52 52  **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
53 53  Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**
54 54  **Key**: Automated source reliability tracking
67 +
55 55  ==== Source Scoring Process (Separation of Concerns) ====
69 +
56 56  **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.
57 57  **The Problem**:
72 +
58 58  * Source scores should influence claim verdicts
59 59  * Claim verdicts should update source scores
60 60  * But: Direct feedback creates circular dependency and potential feedback loops
61 61  **The Solution**: Temporal separation
77 +
62 62  ==== Weekly Background Job (Source Scoring) ====
79 +
63 63  Runs independently of claim analysis:
64 -{{code language="python"}}
65 -def update_source_scores_weekly():
81 +{{code language="python"}}def update_source_scores_weekly():
66 66   """
67 67   Background job: Calculate source reliability
68 68   Never triggered by individual claim analysis
... ... @@ -82,12 +82,12 @@
82 82   source.last_updated = now()
83 83   source.save()
84 84   # Job runs: Sunday 2 AM UTC
85 - # Never during claim processing
86 -{{/code}}
101 + # Never during claim processing{{/code}}
102 +
87 87  ==== Real-Time Claim Analysis (AKEL) ====
104 +
88 88  Uses source scores but never updates them:
89 -{{code language="python"}}
90 -def analyze_claim(claim_text):
106 +{{code language="python"}}def analyze_claim(claim_text):
91 91   """
92 92   Real-time: Analyze claim using current source scores
93 93   READ source scores, never UPDATE them
... ... @@ -104,10 +104,12 @@
104 104   verdict = synthesize_verdict(evidence_list)
105 105   # NEVER update source scores here
106 106   # That happens in weekly background job
107 - return verdict
108 -{{/code}}
123 + return verdict{{/code}}
124 +
109 109  ==== Monthly Audit (Quality Assurance) ====
126 +
110 110  Moderator review of flagged source scores:
128 +
111 111  * Verify scores make sense
112 112  * Detect gaming attempts
113 113  * Identify systematic biases
... ... @@ -147,11 +147,14 @@
147 147   → NYT score: 0.89 (trending up)
148 148   → Blog X score: 0.48 (trending down)
149 149  ```
168 +
150 150  === 1.4 Scenario ===
170 +
151 151  **Purpose**: Different interpretations or contexts for evaluating claims
152 152  **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
153 153  **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)
154 154  **Fields**:
175 +
155 155  * **id** (UUID): Unique identifier
156 156  * **claim_id** (UUID): Foreign key to claim (one-to-many)
157 157  * **description** (text): Human-readable description of the scenario
... ... @@ -172,6 +172,7 @@
172 172  **Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.
173 173  
174 174  **Core Fields**:
196 +
175 175  * **id** (UUID): Primary key
176 176  * **scenario_id** (UUID FK): The scenario being assessed
177 177  * **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")
... ... @@ -187,6 +187,7 @@
187 187  
188 188  **Example**:
189 189  For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":
212 +
190 190  * Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
191 191  * After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
192 192  * Edit entity records the complete before/after change with timestamp and reason
... ... @@ -194,12 +194,18 @@
194 194  **Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios.
195 195  
196 196  === 1.6 User ===
220 +
197 197  Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count
198 -=== User Reputation System ==
222 +
223 +=== User Reputation System ===
224 +
199 199  **V1.0 Approach**: Simple manual role assignment
200 200  **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
227 +
201 201  === Roles (Manual Assignment) ===
229 +
202 202  **reader** (default):
231 +
203 203  * View published claims and evidence
204 204  * Browse and search content
205 205  * No editing permissions
... ... @@ -218,8 +218,11 @@
218 218  * System configuration
219 219  * Access to all features
220 220  * Founder-appointed initially
250 +
221 221  === Contribution Tracking (Simple) ===
252 +
222 222  **Basic metrics only**:
254 +
223 223  * `contributions_count`: Total number of contributions
224 224  * `created_at`: Account age
225 225  * `last_active`: Recent activity
... ... @@ -228,19 +228,26 @@
228 228  * No automated privilege escalation
229 229  * No reputation decay
230 230  * No threshold-based promotions
263 +
231 231  === Promotion Process ===
265 +
232 232  **Manual review by moderators/admins**:
267 +
233 233  1. User demonstrates value through contributions
234 234  2. Moderator reviews user's contribution history
235 235  3. Moderator promotes user to contributor role
236 236  4. Admin promotes trusted contributors to moderator
237 237  **Criteria** (guidelines, not automated):
273 +
238 238  * Quality of contributions
239 239  * Consistency over time
240 240  * Collaborative behavior
241 241  * Understanding of project goals
278 +
242 242  === V2.0+ Evolution ===
280 +
243 243  **Add complex reputation when**:
282 +
244 244  * 100+ active contributors
245 245  * Manual role management becomes bottleneck
246 246  * Clear patterns of abuse emerge requiring automation
... ... @@ -250,11 +250,16 @@
250 250  * Reputation decay for inactive users
251 251  * Track record scoring for contributors
252 252  See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
292 +
253 253  === 1.7 Edit ===
294 +
254 254  **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
255 255  **Purpose**: Complete audit trail for all content changes
297 +
256 256  === Edit History Details ===
299 +
257 257  **What Gets Edited**:
301 +
258 258  * **Claims** (20% edited): assertion, domain, status, scores, analysis
259 259  * **Evidence** (10% edited): excerpt, relevance_score, supports
260 260  * **Scenarios** (5% edited): description, assumptions, confidence
... ... @@ -271,12 +271,14 @@
271 271  * `MODERATION_ACTION`: Hide/unhide for abuse
272 272  * `REVERT`: Rollback to previous version
273 273  **Retention Policy** (5 years total):
318 +
274 274  1. **Hot storage** (3 months): PostgreSQL, instant access
275 275  2. **Warm storage** (2 years): Partitioned, slower queries
276 276  3. **Cold storage** (3 years): S3 compressed, download required
277 277  4. **Deletion**: After 5 years (except legal holds)
278 -**Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit)
323 +**Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit)
279 279  **Use Cases**:
325 +
280 280  * View claim history timeline
281 281  * Detect vandalism patterns
282 282  * Learn from user corrections (system improvement)
... ... @@ -283,12 +283,17 @@
283 283  * Legal compliance (audit trail)
284 284  * Rollback capability
285 285  See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases
332 +
286 286  === 1.8 Flag ===
334 +
287 287  Fields: entity_id, reported_by, issue_type, status, resolution_note
288 -=== 1.9 QualityMetric ===
336 +
337 +=== 1.9 QualityMetric ===
338 +
289 289  **Fields**: metric_type, category, value, target, timestamp
290 290  **Purpose**: Time-series quality tracking
291 291  **Usage**:
342 +
292 292  * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times
293 293  * **Quality dashboard**: Real-time display with trend charts
294 294  * **Alerting**: Automatic alerts when metrics exceed thresholds
... ... @@ -295,10 +295,13 @@
295 295  * **A/B testing**: Compare control vs treatment metrics
296 296  * **Improvement validation**: Measure before/after changes
297 297  **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
298 -=== 1.10 ErrorPattern ===
349 +
350 +=== 1.10 ErrorPattern ===
351 +
299 299  **Fields**: error_category, claim_id, description, root_cause, frequency, status
300 300  **Purpose**: Capture errors to trigger system improvements
301 301  **Usage**:
355 +
302 302  * **Error capture**: When users flag issues or system detects problems
303 303  * **Pattern analysis**: Weekly grouping by category and frequency
304 304  * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor
... ... @@ -310,9 +310,13 @@
310 310  {{include reference="Test.FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}}
311 311  
312 312  == 1.5 User Class Diagram ==
367 +
313 313  {{include reference="Test.FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}}
369 +
314 314  == 2. Versioning Strategy ==
371 +
315 315  **All Content Entities Are Versioned**:
373 +
316 316  * **Claim**: Every edit creates new version (V1→V2→V3...)
317 317  * **Evidence**: Changes tracked in edit history
318 318  * **Scenario**: Modifications versioned
... ... @@ -333,68 +333,91 @@
333 333  Claim V2: "The sky is blue during daytime"
334 334   → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
335 335  ```
394 +
336 336  == 2.5. Storage vs Computation Strategy ==
396 +
337 337  **Critical architectural decision**: What to persist in databases vs compute dynamically?
338 338  **Trade-off**:
399 +
339 339  * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs
340 340  * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible
402 +
341 341  === Recommendation: Hybrid Approach ===
404 +
342 342  **STORE (in PostgreSQL):**
406 +
343 343  ==== Claims (Current State + History) ====
408 +
344 344  * **What**: assertion, domain, status, created_at, updated_at, version
345 345  * **Why**: Core entity, must be persistent
346 346  * **Also store**: confidence_score (computed once, then cached)
347 -* **Size**: ~1 KB per claim
412 +* **Size**: 1 KB per claim
348 348  * **Growth**: Linear with claims
349 349  * **Decision**: ✅ STORE - Essential
415 +
350 350  ==== Evidence (All Records) ====
417 +
351 351  * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
352 352  * **Why**: Hard to re-gather, user contributions, reproducibility
353 -* **Size**: ~2 KB per evidence (with excerpt)
420 +* **Size**: 2 KB per evidence (with excerpt)
354 354  * **Growth**: 3-10 evidence per claim
355 355  * **Decision**: ✅ STORE - Essential for reproducibility
423 +
356 356  ==== Sources (Track Records) ====
425 +
357 357  * **What**: name, domain, track_record_score, accuracy_history, correction_frequency
358 358  * **Why**: Continuously updated, expensive to recompute
359 -* **Size**: ~500 bytes per source
428 +* **Size**: 500 bytes per source
360 360  * **Growth**: Slow (limited number of sources)
361 361  * **Decision**: ✅ STORE - Essential for quality
431 +
362 362  ==== Edit History (All Versions) ====
433 +
363 363  * **What**: before_state, after_state, user_id, reason, timestamp
364 364  * **Why**: Audit trail, legal requirement, reproducibility
365 -* **Size**: ~2 KB per edit
366 -* **Growth**: Linear with edits (~A portion of claims get edited)
436 +* **Size**: 2 KB per edit
437 +* **Growth**: Linear with edits (A portion of claims get edited)
367 367  * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
368 368  * **Decision**: ✅ STORE - Essential for accountability
440 +
369 369  ==== Flags (User Reports) ====
442 +
370 370  * **What**: entity_id, reported_by, issue_type, description, status
371 371  * **Why**: Error detection, system improvement triggers
372 -* **Size**: ~500 bytes per flag
445 +* **Size**: 500 bytes per flag
373 373  * **Growth**: 5-high percentage of claims get flagged
374 374  * **Decision**: ✅ STORE - Essential for improvement
448 +
375 375  ==== ErrorPatterns (System Improvement) ====
450 +
376 376  * **What**: error_category, claim_id, description, root_cause, frequency, status
377 377  * **Why**: Learning loop, prevent recurring errors
378 -* **Size**: ~1 KB per pattern
453 +* **Size**: 1 KB per pattern
379 379  * **Growth**: Slow (limited patterns, many fixed)
380 380  * **Decision**: ✅ STORE - Essential for learning
456 +
381 381  ==== QualityMetrics (Time Series) ====
458 +
382 382  * **What**: metric_type, category, value, target, timestamp
383 383  * **Why**: Trend analysis, cannot recreate historical metrics
384 -* **Size**: ~200 bytes per metric
461 +* **Size**: 200 bytes per metric
385 385  * **Growth**: Hourly = 8,760 per year per metric type
386 386  * **Retention**: 2 years hot, then aggregate and archive
387 387  * **Decision**: ✅ STORE - Essential for monitoring
388 388  **STORE (Computed Once, Then Cached):**
466 +
389 389  ==== Analysis Summary ====
468 +
390 390  * **What**: Neutral text summary of claim analysis (200-500 words)
391 391  * **Computed**: Once by AKEL when claim first analyzed
392 392  * **Stored in**: Claim table (text field)
393 393  * **Recomputed**: Only when system significantly improves OR claim edited
394 394  * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
395 -* **Size**: ~2 KB per claim
474 +* **Size**: 2 KB per claim
396 396  * **Decision**: ✅ STORE (cached) - Cost-effective
476 +
397 397  ==== Confidence Score ====
478 +
398 398  * **What**: 0-100 score of analysis confidence
399 399  * **Computed**: Once by AKEL
400 400  * **Stored in**: Claim table (integer field)
... ... @@ -402,7 +402,9 @@
402 402  * **Why store**: Cheap to store, expensive to compute, users need it fast
403 403  * **Size**: 4 bytes per claim
404 404  * **Decision**: ✅ STORE (cached) - Performance critical
486 +
405 405  ==== Risk Score ====
488 +
406 406  * **What**: 0-100 score of claim risk level
407 407  * **Computed**: Once by AKEL
408 408  * **Stored in**: Claim table (integer field)
... ... @@ -411,13 +411,17 @@
411 411  * **Size**: 4 bytes per claim
412 412  * **Decision**: ✅ STORE (cached) - Performance critical
413 413  **COMPUTE DYNAMICALLY (Never Store):**
414 -==== Scenarios ==== ⚠️ CRITICAL DECISION
497 +
498 +==== Scenarios ====
499 +
500 + ⚠️ CRITICAL DECISION
501 +
415 415  * **What**: 2-5 possible interpretations of claim with assumptions
416 416  * **Current design**: Stored in Scenario table
417 417  * **Alternative**: Compute on-demand when user views claim details
418 -* **Storage cost**: ~1 KB per scenario × 3 scenarios average = ~3 KB per claim
505 +* **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim
419 419  * **Compute cost**: $0.005-0.01 per request (LLM API call)
420 -* **Frequency**: Viewed in detail by ~20% of users
507 +* **Frequency**: Viewed in detail by 20% of users
421 421  * **Trade-off analysis**:
422 422   - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access
423 423   - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
... ... @@ -425,12 +425,17 @@
425 425  * **Speed**: Computed = 5-8 seconds delay, Stored = instant
426 426  * **Decision**: ✅ STORE (hybrid approach below)
427 427  **Scenario Strategy** (APPROVED):
515 +
428 428  1. **Store scenarios** initially when claim analyzed
429 429  2. **Mark as stale** when system improves significantly
430 430  3. **Recompute on next view** if marked stale
431 431  4. **Cache for 30 days** if frequently accessed
432 432  5. **Result**: Best of both worlds - speed + freshness
433 -==== Verdict Synthesis ====
521 +
522 +==== Verdict Synthesis ====
523 +
524 +
525 +
434 434  * **What**: Final conclusion text synthesizing all scenarios
435 435  * **Compute cost**: $0.002-0.005 per request
436 436  * **Frequency**: Every time claim viewed
... ... @@ -438,17 +438,23 @@
438 438  * **Speed**: 2-3 seconds (acceptable)
439 439  **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale
440 440  * **Recommendation**: ✅ STORE cached version, mark stale when changes occur
533 +
441 441  ==== Search Results ====
535 +
442 442  * **What**: Lists of claims matching search query
443 443  * **Compute from**: Elasticsearch index
444 444  * **Cache**: 15 minutes in Redis for popular queries
445 445  * **Why not store permanently**: Constantly changing, infinite possible queries
540 +
446 446  ==== Aggregated Statistics ====
542 +
447 447  * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.
448 448  * **Compute from**: Database queries
449 449  * **Cache**: 1 hour in Redis
450 450  * **Why not store**: Can be derived, relatively cheap to compute
547 +
451 451  ==== User Reputation ====
549 +
452 452  * **What**: Score based on contributions
453 453  * **Current design**: Stored in User table
454 454  * **Alternative**: Compute from Edit table
... ... @@ -458,37 +458,43 @@
458 458  * **Frequency**: Read on every user action
459 459  * **Compute cost**: Simple COUNT query, milliseconds
460 460  * **Decision**: ✅ STORE - Performance critical, read-heavy
559 +
461 461  === Summary Table ===
462 -| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |
463 -|-----------|---------|---------|----------------|----------|-----------|
464 -| Claim core | ✅ | - | 1 KB | STORE | Essential |
465 -| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |
466 -| Sources | ✅ | - | 500 B (shared) | STORE | Track record |
467 -| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |
468 -| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |
469 -| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |
470 -| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |
471 -| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |
472 -| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |
473 -| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |
474 -| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |
475 -| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |
476 -| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |
561 +
562 +| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\
563 +|-----|-|-||----|-----|\\
564 +| Claim core | ✅ | - | 1 KB | STORE | Essential |\\
565 +| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\
566 +| Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\
567 +| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\
568 +| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\
569 +| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
570 +| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
571 +| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\
572 +| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\
573 +| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\
574 +| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\
575 +| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\
576 +| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\
477 477  | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |
478 -**Total storage per claim**: ~18 KB (without edits and flags)
578 +**Total storage per claim**: 18 KB (without edits and flags)
479 479  **For 1 million claims**:
480 -* **Storage**: ~18 GB (manageable)
481 -* **PostgreSQL**: ~$50/month (standard instance)
482 -* **Redis cache**: ~$20/month (1 GB instance)
483 -* **S3 archives**: ~$5/month (old edits)
484 -* **Total**: ~$75/month infrastructure
580 +
581 +* **Storage**: 18 GB (manageable)
582 +* **PostgreSQL**: $50/month (standard instance)
583 +* **Redis cache**: $20/month (1 GB instance)
584 +* **S3 archives**: $5/month (old edits)
585 +* **Total**: $75/month infrastructure
485 485  **LLM cost savings by caching**:
486 486  * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
487 487  * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims
488 488  * Verdict stored: Save $0.003 per claim = $3K per 1M claims
489 -* **Total savings**: ~$35K per 1M claims vs recomputing every time
590 +* **Total savings**: $35K per 1M claims vs recomputing every time
591 +
490 490  === Recomputation Triggers ===
593 +
491 491  **When to mark cached data as stale and recompute:**
595 +
492 492  1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)
493 493  2. **Evidence added** → Recompute: scenarios, verdict, confidence score
494 494  3. **Source track record changes >10 points** → Recompute: confidence score, verdict
... ... @@ -495,11 +495,15 @@
495 495  4. **System improvement deployed** → Mark affected claims stale, recompute on next view
496 496  5. **Controversy detected** (high flag rate) → Recompute: risk score
497 497  **Recomputation strategy**:
602 +
498 498  * **Eager**: Immediately recompute (for user edits)
499 499  * **Lazy**: Recompute on next view (for system improvements)
500 500  * **Batch**: Nightly re-evaluation of stale claims (if <1000)
606 +
501 501  === Database Size Projection ===
608 +
502 502  **Year 1**: 10K claims
610 +
503 503  * Storage: 180 MB
504 504  * Cost: $10/month
505 505  **Year 3**: 100K claims
... ... @@ -513,15 +513,21 @@
513 513  * Cost: $300/month
514 514  * Optimization: Archive old claims to S3 ($5/TB/month)
515 515  **Conclusion**: Storage costs are manageable, LLM cost savings are substantial.
624 +
516 516  == 3. Key Simplifications ==
626 +
517 517  * **Two content states only**: Published, Hidden
518 518  * **Three user roles only**: Reader, Contributor, Moderator
519 519  * **No complex versioning**: Linear edit history
520 520  * **Reputation-based permissions**: Not role hierarchy
521 521  * **Source track records**: Continuous evaluation
632 +
522 522  == 3. What Gets Stored in the Database ==
634 +
523 523  === 3.1 Primary Storage (PostgreSQL) ===
636 +
524 524  **Claims Table**:
638 +
525 525  * Current state only (latest version)
526 526  * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
527 527  **Evidence Table**:
... ... @@ -548,31 +548,44 @@
548 548  **QualityMetric Table**:
549 549  * Time-series quality data
550 550  * Fields: id, metric_type, metric_category, value, target, timestamp
665 +
551 551  === 3.2 What's NOT Stored (Computed on-the-fly) ===
667 +
552 552  * **Verdicts**: Synthesized from evidence + scenarios when requested
553 553  * **Risk scores**: Recalculated based on current factors
554 554  * **Aggregated statistics**: Computed from base data
555 555  * **Search results**: Generated from Elasticsearch index
672 +
556 556  === 3.3 Cache Layer (Redis) ===
674 +
557 557  **Cached for performance**:
676 +
558 558  * Frequently accessed claims (TTL: 1 hour)
559 559  * Search results (TTL: 15 minutes)
560 560  * User sessions (TTL: 24 hours)
561 561  * Source track records (TTL: 1 hour)
681 +
562 562  === 3.4 File Storage (S3) ===
683 +
563 563  **Archived content**:
685 +
564 564  * Old edit history (>3 months)
565 565  * Evidence documents (archived copies)
566 566  * Database backups
567 567  * Export files
690 +
568 568  === 3.5 Search Index (Elasticsearch) ===
692 +
569 569  **Indexed for search**:
694 +
570 570  * Claim assertions (full-text)
571 571  * Evidence excerpts (full-text)
572 572  * Scenario descriptions (full-text)
573 573  * Source names (autocomplete)
574 574  Synchronized from PostgreSQL via change data capture or periodic sync.
700 +
575 575  == 4. Related Pages ==
576 -* [[Architecture>>Test.FactHarbor.Specification.Architecture.WebHome]]
702 +
703 +* [[Architecture>>Test.FactHarbor V0\.9\.100.Specification.Architecture.WebHome]]
577 577  * [[Requirements>>Test.FactHarbor.Specification.Requirements.WebHome]]
578 578  * [[Workflows>>Test.FactHarbor.Specification.Workflows.WebHome]]