Changes for page Data Model

Last modified by Robert Schaub on 2026/02/08 08:27

From version 2.1
edited by Robert Schaub
on 2025/12/18 12:54
Change comment: Imported from XAR
To version 4.2
edited by Robert Schaub
on 2025/12/21 13:38
Change comment: Renamed back-links.

Summary

Details

Page properties
Content
... ... @@ -1,25 +1,32 @@
1 1  = Data Model =
2 +
2 2  FactHarbor's data model is **simple, focused, designed for automated processing**.
4 +
3 3  == 1. Core Entities ==
6 +
4 4  === 1.1 Claim ===
8 +
5 5  Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count
10 +
6 6  ==== Performance Optimization: Denormalized Fields ====
12 +
7 7  **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
8 8  **Additional cached fields in claims table**:
15 +
9 9  * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores
10 - * Avoids joining evidence table for listing/preview
11 - * Updated when evidence is added/removed
12 - * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
17 +* Avoids joining evidence table for listing/preview
18 +* Updated when evidence is added/removed
19 +* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
13 13  * **source_names** (TEXT[]): Array of source names for quick display
14 - * Avoids joining through evidence to sources
15 - * Updated when sources change
16 - * Format: `["New York Times", "Nature Journal", ...]`
21 +* Avoids joining through evidence to sources
22 +* Updated when sources change
23 +* Format: `["New York Times", "Nature Journal", ...]`
17 17  * **scenario_count** (INTEGER): Number of scenarios for this claim
18 - * Quick metric without counting rows
19 - * Updated when scenarios added/removed
25 +* Quick metric without counting rows
26 +* Updated when scenarios added/removed
20 20  * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed
21 - * Helps invalidate stale caches
22 - * Triggers background refresh if too old
28 +* Helps invalidate stale caches
29 +* Triggers background refresh if too old
23 23  **Update Strategy**:
24 24  * **Immediate**: Update on claim edit (user-facing)
25 25  * **Deferred**: Update via background job every hour (non-critical)
... ... @@ -28,13 +28,18 @@
28 28  * ✅ 70% fewer joins on common queries
29 29  * ✅ Much faster claim list/search pages
30 30  * ✅ Better user experience
31 -* ⚠️ Small storage increase (~10%)
38 +* ⚠️ Small storage increase (10%)
32 32  * ⚠️ Need to keep caches in sync
40 +
33 33  === 1.2 Evidence ===
42 +
34 34  Fields: claim_id, source_id, excerpt, url, relevance_score, supports
44 +
35 35  === 1.3 Source ===
46 +
36 36  **Purpose**: Track reliability of information sources over time
37 37  **Fields**:
49 +
38 38  * **id** (UUID): Unique identifier
39 39  * **name** (text): Source name (e.g., "New York Times", "Nature Journal")
40 40  * **domain** (text): Website domain (e.g., "nytimes.com")
... ... @@ -52,17 +52,21 @@
52 52  **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
53 53  Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**
54 54  **Key**: Automated source reliability tracking
67 +
55 55  ==== Source Scoring Process (Separation of Concerns) ====
69 +
56 56  **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.
57 57  **The Problem**:
72 +
58 58  * Source scores should influence claim verdicts
59 59  * Claim verdicts should update source scores
60 60  * But: Direct feedback creates circular dependency and potential feedback loops
61 61  **The Solution**: Temporal separation
77 +
62 62  ==== Weekly Background Job (Source Scoring) ====
79 +
63 63  Runs independently of claim analysis:
64 -{{code language="python"}}
65 -def update_source_scores_weekly():
81 +{{code language="python"}}def update_source_scores_weekly():
66 66   """
67 67   Background job: Calculate source reliability
68 68   Never triggered by individual claim analysis
... ... @@ -82,12 +82,12 @@
82 82   source.last_updated = now()
83 83   source.save()
84 84   # Job runs: Sunday 2 AM UTC
85 - # Never during claim processing
86 -{{/code}}
101 + # Never during claim processing{{/code}}
102 +
87 87  ==== Real-Time Claim Analysis (AKEL) ====
104 +
88 88  Uses source scores but never updates them:
89 -{{code language="python"}}
90 -def analyze_claim(claim_text):
106 +{{code language="python"}}def analyze_claim(claim_text):
91 91   """
92 92   Real-time: Analyze claim using current source scores
93 93   READ source scores, never UPDATE them
... ... @@ -104,10 +104,12 @@
104 104   verdict = synthesize_verdict(evidence_list)
105 105   # NEVER update source scores here
106 106   # That happens in weekly background job
107 - return verdict
108 -{{/code}}
123 + return verdict{{/code}}
124 +
109 109  ==== Monthly Audit (Quality Assurance) ====
126 +
110 110  Moderator review of flagged source scores:
128 +
111 111  * Verify scores make sense
112 112  * Detect gaming attempts
113 113  * Identify systematic biases
... ... @@ -147,18 +147,19 @@
147 147   → NYT score: 0.89 (trending up)
148 148   → Blog X score: 0.48 (trending down)
149 149  ```
168 +
150 150  === 1.4 Scenario ===
170 +
151 151  **Purpose**: Different interpretations or contexts for evaluating claims
152 152  **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
153 153  **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)
154 154  **Fields**:
175 +
155 155  * **id** (UUID): Unique identifier
156 156  * **claim_id** (UUID): Foreign key to claim (one-to-many)
157 157  * **description** (text): Human-readable description of the scenario
158 158  * **assumptions** (JSONB): Key assumptions that define this scenario context
159 159  * **extracted_from** (UUID): Reference to evidence that this scenario was extracted from
160 -* **verdict_summary** (text): Compiled verdict for this scenario
161 -* **confidence** (decimal 0-1): Confidence level for verdict in this scenario
162 162  * **created_at** (timestamp): When scenario was created
163 163  * **updated_at** (timestamp): Last modification
164 164  **How Found**: Evidence search → Extract context → Create scenario → Link to claim
... ... @@ -168,13 +168,48 @@
168 168  * Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data
169 169  * Scenario 3: "Immunocompromised patients" from specialist study
170 170  **V2.0 Evolution**: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality.
171 -=== 1.5 User ===
190 +
191 +=== 1.5 Verdict ===
192 +
193 +**Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.
194 +
195 +**Core Fields**:
196 +
197 +* **id** (UUID): Primary key
198 +* **scenario_id** (UUID FK): The scenario being assessed
199 +* **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")
200 +* **confidence** (decimal 0-1): How confident we are in this assessment
201 +* **explanation_summary** (text): Human-readable reasoning explaining the verdict
202 +* **uncertainty_factors** (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")
203 +* **created_at** (timestamp): When verdict was created
204 +* **updated_at** (timestamp): Last modification
205 +
206 +**Change Tracking**: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states.
207 +
208 +**Relationship**: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity.
209 +
210 +**Example**:
211 +For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":
212 +
213 +* Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
214 +* After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
215 +* Edit entity records the complete before/after change with timestamp and reason
216 +
217 +**Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios.
218 +
219 +=== 1.6 User ===
220 +
172 172  Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count
173 -=== User Reputation System ==
222 +
223 +=== User Reputation System ===
224 +
174 174  **V1.0 Approach**: Simple manual role assignment
175 175  **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
227 +
176 176  === Roles (Manual Assignment) ===
229 +
177 177  **reader** (default):
231 +
178 178  * View published claims and evidence
179 179  * Browse and search content
180 180  * No editing permissions
... ... @@ -193,8 +193,11 @@
193 193  * System configuration
194 194  * Access to all features
195 195  * Founder-appointed initially
250 +
196 196  === Contribution Tracking (Simple) ===
252 +
197 197  **Basic metrics only**:
254 +
198 198  * `contributions_count`: Total number of contributions
199 199  * `created_at`: Account age
200 200  * `last_active`: Recent activity
... ... @@ -203,19 +203,26 @@
203 203  * No automated privilege escalation
204 204  * No reputation decay
205 205  * No threshold-based promotions
263 +
206 206  === Promotion Process ===
265 +
207 207  **Manual review by moderators/admins**:
267 +
208 208  1. User demonstrates value through contributions
209 209  2. Moderator reviews user's contribution history
210 210  3. Moderator promotes user to contributor role
211 211  4. Admin promotes trusted contributors to moderator
212 212  **Criteria** (guidelines, not automated):
273 +
213 213  * Quality of contributions
214 214  * Consistency over time
215 215  * Collaborative behavior
216 216  * Understanding of project goals
278 +
217 217  === V2.0+ Evolution ===
280 +
218 218  **Add complex reputation when**:
282 +
219 219  * 100+ active contributors
220 220  * Manual role management becomes bottleneck
221 221  * Clear patterns of abuse emerge requiring automation
... ... @@ -224,12 +224,17 @@
224 224  * Threshold-based promotions
225 225  * Reputation decay for inactive users
226 226  * Track record scoring for contributors
227 -See [[When to Add Complexity>>FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
228 -=== 1.6 Edit ===
291 +See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
292 +
293 +=== 1.7 Edit ===
294 +
229 229  **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
230 230  **Purpose**: Complete audit trail for all content changes
297 +
231 231  === Edit History Details ===
299 +
232 232  **What Gets Edited**:
301 +
233 233  * **Claims** (20% edited): assertion, domain, status, scores, analysis
234 234  * **Evidence** (10% edited): excerpt, relevance_score, supports
235 235  * **Scenarios** (5% edited): description, assumptions, confidence
... ... @@ -246,12 +246,14 @@
246 246  * `MODERATION_ACTION`: Hide/unhide for abuse
247 247  * `REVERT`: Rollback to previous version
248 248  **Retention Policy** (5 years total):
318 +
249 249  1. **Hot storage** (3 months): PostgreSQL, instant access
250 250  2. **Warm storage** (2 years): Partitioned, slower queries
251 251  3. **Cold storage** (3 years): S3 compressed, download required
252 252  4. **Deletion**: After 5 years (except legal holds)
253 -**Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit)
323 +**Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit)
254 254  **Use Cases**:
325 +
255 255  * View claim history timeline
256 256  * Detect vandalism patterns
257 257  * Learn from user corrections (system improvement)
... ... @@ -258,12 +258,17 @@
258 258  * Legal compliance (audit trail)
259 259  * Rollback capability
260 260  See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases
261 -=== 1.7 Flag ===
332 +
333 +=== 1.8 Flag ===
334 +
262 262  Fields: entity_id, reported_by, issue_type, status, resolution_note
263 -=== 1.8 QualityMetric ===
336 +
337 +=== 1.9 QualityMetric ===
338 +
264 264  **Fields**: metric_type, category, value, target, timestamp
265 265  **Purpose**: Time-series quality tracking
266 266  **Usage**:
342 +
267 267  * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times
268 268  * **Quality dashboard**: Real-time display with trend charts
269 269  * **Alerting**: Automatic alerts when metrics exceed thresholds
... ... @@ -270,10 +270,13 @@
270 270  * **A/B testing**: Compare control vs treatment metrics
271 271  * **Improvement validation**: Measure before/after changes
272 272  **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
273 -=== 1.9 ErrorPattern ===
349 +
350 +=== 1.10 ErrorPattern ===
351 +
274 274  **Fields**: error_category, claim_id, description, root_cause, frequency, status
275 275  **Purpose**: Capture errors to trigger system improvements
276 276  **Usage**:
355 +
277 277  * **Error capture**: When users flag issues or system detects problems
278 278  * **Pattern analysis**: Weekly grouping by category and frequency
279 279  * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor
... ... @@ -282,12 +282,16 @@
282 282  
283 283  == 1.4 Core Data Model ERD ==
284 284  
285 -{{include reference="FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}}
364 +{{include reference="Test.FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}}
286 286  
287 287  == 1.5 User Class Diagram ==
288 -{{include reference="FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}}
367 +
368 +{{include reference="Test.FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}}
369 +
289 289  == 2. Versioning Strategy ==
371 +
290 290  **All Content Entities Are Versioned**:
373 +
291 291  * **Claim**: Every edit creates new version (V1→V2→V3...)
292 292  * **Evidence**: Changes tracked in edit history
293 293  * **Scenario**: Modifications versioned
... ... @@ -308,68 +308,91 @@
308 308  Claim V2: "The sky is blue during daytime"
309 309   → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
310 310  ```
394 +
311 311  == 2.5. Storage vs Computation Strategy ==
396 +
312 312  **Critical architectural decision**: What to persist in databases vs compute dynamically?
313 313  **Trade-off**:
399 +
314 314  * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs
315 315  * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible
402 +
316 316  === Recommendation: Hybrid Approach ===
404 +
317 317  **STORE (in PostgreSQL):**
406 +
318 318  ==== Claims (Current State + History) ====
408 +
319 319  * **What**: assertion, domain, status, created_at, updated_at, version
320 320  * **Why**: Core entity, must be persistent
321 321  * **Also store**: confidence_score (computed once, then cached)
322 -* **Size**: ~1 KB per claim
412 +* **Size**: 1 KB per claim
323 323  * **Growth**: Linear with claims
324 324  * **Decision**: ✅ STORE - Essential
415 +
325 325  ==== Evidence (All Records) ====
417 +
326 326  * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
327 327  * **Why**: Hard to re-gather, user contributions, reproducibility
328 -* **Size**: ~2 KB per evidence (with excerpt)
420 +* **Size**: 2 KB per evidence (with excerpt)
329 329  * **Growth**: 3-10 evidence per claim
330 330  * **Decision**: ✅ STORE - Essential for reproducibility
423 +
331 331  ==== Sources (Track Records) ====
425 +
332 332  * **What**: name, domain, track_record_score, accuracy_history, correction_frequency
333 333  * **Why**: Continuously updated, expensive to recompute
334 -* **Size**: ~500 bytes per source
428 +* **Size**: 500 bytes per source
335 335  * **Growth**: Slow (limited number of sources)
336 336  * **Decision**: ✅ STORE - Essential for quality
431 +
337 337  ==== Edit History (All Versions) ====
433 +
338 338  * **What**: before_state, after_state, user_id, reason, timestamp
339 339  * **Why**: Audit trail, legal requirement, reproducibility
340 -* **Size**: ~2 KB per edit
341 -* **Growth**: Linear with edits (~A portion of claims get edited)
436 +* **Size**: 2 KB per edit
437 +* **Growth**: Linear with edits (A portion of claims get edited)
342 342  * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
343 343  * **Decision**: ✅ STORE - Essential for accountability
440 +
344 344  ==== Flags (User Reports) ====
442 +
345 345  * **What**: entity_id, reported_by, issue_type, description, status
346 346  * **Why**: Error detection, system improvement triggers
347 -* **Size**: ~500 bytes per flag
445 +* **Size**: 500 bytes per flag
348 348  * **Growth**: 5-high percentage of claims get flagged
349 349  * **Decision**: ✅ STORE - Essential for improvement
448 +
350 350  ==== ErrorPatterns (System Improvement) ====
450 +
351 351  * **What**: error_category, claim_id, description, root_cause, frequency, status
352 352  * **Why**: Learning loop, prevent recurring errors
353 -* **Size**: ~1 KB per pattern
453 +* **Size**: 1 KB per pattern
354 354  * **Growth**: Slow (limited patterns, many fixed)
355 355  * **Decision**: ✅ STORE - Essential for learning
456 +
356 356  ==== QualityMetrics (Time Series) ====
458 +
357 357  * **What**: metric_type, category, value, target, timestamp
358 358  * **Why**: Trend analysis, cannot recreate historical metrics
359 -* **Size**: ~200 bytes per metric
461 +* **Size**: 200 bytes per metric
360 360  * **Growth**: Hourly = 8,760 per year per metric type
361 361  * **Retention**: 2 years hot, then aggregate and archive
362 362  * **Decision**: ✅ STORE - Essential for monitoring
363 363  **STORE (Computed Once, Then Cached):**
466 +
364 364  ==== Analysis Summary ====
468 +
365 365  * **What**: Neutral text summary of claim analysis (200-500 words)
366 366  * **Computed**: Once by AKEL when claim first analyzed
367 367  * **Stored in**: Claim table (text field)
368 368  * **Recomputed**: Only when system significantly improves OR claim edited
369 369  * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
370 -* **Size**: ~2 KB per claim
474 +* **Size**: 2 KB per claim
371 371  * **Decision**: ✅ STORE (cached) - Cost-effective
476 +
372 372  ==== Confidence Score ====
478 +
373 373  * **What**: 0-100 score of analysis confidence
374 374  * **Computed**: Once by AKEL
375 375  * **Stored in**: Claim table (integer field)
... ... @@ -377,7 +377,9 @@
377 377  * **Why store**: Cheap to store, expensive to compute, users need it fast
378 378  * **Size**: 4 bytes per claim
379 379  * **Decision**: ✅ STORE (cached) - Performance critical
486 +
380 380  ==== Risk Score ====
488 +
381 381  * **What**: 0-100 score of claim risk level
382 382  * **Computed**: Once by AKEL
383 383  * **Stored in**: Claim table (integer field)
... ... @@ -386,13 +386,17 @@
386 386  * **Size**: 4 bytes per claim
387 387  * **Decision**: ✅ STORE (cached) - Performance critical
388 388  **COMPUTE DYNAMICALLY (Never Store):**
389 -==== Scenarios ==== ⚠️ CRITICAL DECISION
497 +
498 +==== Scenarios ====
499 +
500 + ⚠️ CRITICAL DECISION
501 +
390 390  * **What**: 2-5 possible interpretations of claim with assumptions
391 391  * **Current design**: Stored in Scenario table
392 392  * **Alternative**: Compute on-demand when user views claim details
393 -* **Storage cost**: ~1 KB per scenario × 3 scenarios average = ~3 KB per claim
505 +* **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim
394 394  * **Compute cost**: $0.005-0.01 per request (LLM API call)
395 -* **Frequency**: Viewed in detail by ~20% of users
507 +* **Frequency**: Viewed in detail by 20% of users
396 396  * **Trade-off analysis**:
397 397   - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access
398 398   - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
... ... @@ -400,12 +400,17 @@
400 400  * **Speed**: Computed = 5-8 seconds delay, Stored = instant
401 401  * **Decision**: ✅ STORE (hybrid approach below)
402 402  **Scenario Strategy** (APPROVED):
515 +
403 403  1. **Store scenarios** initially when claim analyzed
404 404  2. **Mark as stale** when system improves significantly
405 405  3. **Recompute on next view** if marked stale
406 406  4. **Cache for 30 days** if frequently accessed
407 407  5. **Result**: Best of both worlds - speed + freshness
408 -==== Verdict Synthesis ====
521 +
522 +==== Verdict Synthesis ====
523 +
524 +
525 +
409 409  * **What**: Final conclusion text synthesizing all scenarios
410 410  * **Compute cost**: $0.002-0.005 per request
411 411  * **Frequency**: Every time claim viewed
... ... @@ -413,17 +413,23 @@
413 413  * **Speed**: 2-3 seconds (acceptable)
414 414  **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale
415 415  * **Recommendation**: ✅ STORE cached version, mark stale when changes occur
533 +
416 416  ==== Search Results ====
535 +
417 417  * **What**: Lists of claims matching search query
418 418  * **Compute from**: Elasticsearch index
419 419  * **Cache**: 15 minutes in Redis for popular queries
420 420  * **Why not store permanently**: Constantly changing, infinite possible queries
540 +
421 421  ==== Aggregated Statistics ====
542 +
422 422  * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.
423 423  * **Compute from**: Database queries
424 424  * **Cache**: 1 hour in Redis
425 425  * **Why not store**: Can be derived, relatively cheap to compute
547 +
426 426  ==== User Reputation ====
549 +
427 427  * **What**: Score based on contributions
428 428  * **Current design**: Stored in User table
429 429  * **Alternative**: Compute from Edit table
... ... @@ -433,37 +433,43 @@
433 433  * **Frequency**: Read on every user action
434 434  * **Compute cost**: Simple COUNT query, milliseconds
435 435  * **Decision**: ✅ STORE - Performance critical, read-heavy
559 +
436 436  === Summary Table ===
437 -| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |
438 -|-----------|---------|---------|----------------|----------|-----------|
439 -| Claim core | ✅ | - | 1 KB | STORE | Essential |
440 -| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |
441 -| Sources | ✅ | - | 500 B (shared) | STORE | Track record |
442 -| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |
443 -| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |
444 -| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |
445 -| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |
446 -| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |
447 -| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |
448 -| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |
449 -| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |
450 -| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |
451 -| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |
561 +
562 +| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\
563 +|-----|-|-||----|-----|\\
564 +| Claim core | ✅ | - | 1 KB | STORE | Essential |\\
565 +| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\
566 +| Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\
567 +| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\
568 +| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\
569 +| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
570 +| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
571 +| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\
572 +| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\
573 +| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\
574 +| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\
575 +| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\
576 +| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\
452 452  | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |
453 -**Total storage per claim**: ~18 KB (without edits and flags)
578 +**Total storage per claim**: 18 KB (without edits and flags)
454 454  **For 1 million claims**:
455 -* **Storage**: ~18 GB (manageable)
456 -* **PostgreSQL**: ~$50/month (standard instance)
457 -* **Redis cache**: ~$20/month (1 GB instance)
458 -* **S3 archives**: ~$5/month (old edits)
459 -* **Total**: ~$75/month infrastructure
580 +
581 +* **Storage**: 18 GB (manageable)
582 +* **PostgreSQL**: $50/month (standard instance)
583 +* **Redis cache**: $20/month (1 GB instance)
584 +* **S3 archives**: $5/month (old edits)
585 +* **Total**: $75/month infrastructure
460 460  **LLM cost savings by caching**:
461 461  * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
462 462  * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims
463 463  * Verdict stored: Save $0.003 per claim = $3K per 1M claims
464 -* **Total savings**: ~$35K per 1M claims vs recomputing every time
590 +* **Total savings**: $35K per 1M claims vs recomputing every time
591 +
465 465  === Recomputation Triggers ===
593 +
466 466  **When to mark cached data as stale and recompute:**
595 +
467 467  1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)
468 468  2. **Evidence added** → Recompute: scenarios, verdict, confidence score
469 469  3. **Source track record changes >10 points** → Recompute: confidence score, verdict
... ... @@ -470,11 +470,15 @@
470 470  4. **System improvement deployed** → Mark affected claims stale, recompute on next view
471 471  5. **Controversy detected** (high flag rate) → Recompute: risk score
472 472  **Recomputation strategy**:
602 +
473 473  * **Eager**: Immediately recompute (for user edits)
474 474  * **Lazy**: Recompute on next view (for system improvements)
475 475  * **Batch**: Nightly re-evaluation of stale claims (if <1000)
606 +
476 476  === Database Size Projection ===
608 +
477 477  **Year 1**: 10K claims
610 +
478 478  * Storage: 180 MB
479 479  * Cost: $10/month
480 480  **Year 3**: 100K claims
... ... @@ -488,15 +488,21 @@
488 488  * Cost: $300/month
489 489  * Optimization: Archive old claims to S3 ($5/TB/month)
490 490  **Conclusion**: Storage costs are manageable, LLM cost savings are substantial.
624 +
491 491  == 3. Key Simplifications ==
626 +
492 492  * **Two content states only**: Published, Hidden
493 493  * **Three user roles only**: Reader, Contributor, Moderator
494 494  * **No complex versioning**: Linear edit history
495 495  * **Reputation-based permissions**: Not role hierarchy
496 496  * **Source track records**: Continuous evaluation
632 +
497 497  == 3. What Gets Stored in the Database ==
634 +
498 498  === 3.1 Primary Storage (PostgreSQL) ===
636 +
499 499  **Claims Table**:
638 +
500 500  * Current state only (latest version)
501 501  * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
502 502  **Evidence Table**:
... ... @@ -523,31 +523,44 @@
523 523  **QualityMetric Table**:
524 524  * Time-series quality data
525 525  * Fields: id, metric_type, metric_category, value, target, timestamp
665 +
526 526  === 3.2 What's NOT Stored (Computed on-the-fly) ===
667 +
527 527  * **Verdicts**: Synthesized from evidence + scenarios when requested
528 528  * **Risk scores**: Recalculated based on current factors
529 529  * **Aggregated statistics**: Computed from base data
530 530  * **Search results**: Generated from Elasticsearch index
672 +
531 531  === 3.3 Cache Layer (Redis) ===
674 +
532 532  **Cached for performance**:
676 +
533 533  * Frequently accessed claims (TTL: 1 hour)
534 534  * Search results (TTL: 15 minutes)
535 535  * User sessions (TTL: 24 hours)
536 536  * Source track records (TTL: 1 hour)
681 +
537 537  === 3.4 File Storage (S3) ===
683 +
538 538  **Archived content**:
685 +
539 539  * Old edit history (>3 months)
540 540  * Evidence documents (archived copies)
541 541  * Database backups
542 542  * Export files
690 +
543 543  === 3.5 Search Index (Elasticsearch) ===
692 +
544 544  **Indexed for search**:
694 +
545 545  * Claim assertions (full-text)
546 546  * Evidence excerpts (full-text)
547 547  * Scenario descriptions (full-text)
548 548  * Source names (autocomplete)
549 549  Synchronized from PostgreSQL via change data capture or periodic sync.
700 +
550 550  == 4. Related Pages ==
551 -* [[Architecture>>FactHarbor.Specification.Architecture.WebHome]]
552 -* [[Requirements>>FactHarbor.Specification.Requirements.WebHome]]
553 -* [[Workflows>>FactHarbor.Specification.Workflows.WebHome]]
702 +
703 +* [[Architecture>>FactHarbor.Archive.FactHarbor delta for V0\.9\.70.Specification.Architecture.WebHome]]
704 +* [[Requirements>>Test.FactHarbor.Specification.Requirements.WebHome]]
705 +* [[Workflows>>Test.FactHarbor.Specification.Workflows.WebHome]]