Changes for page Data Model

Last modified by Robert Schaub on 2025/12/22 14:16

From version 1.1
edited by Robert Schaub
on 2025/12/22 14:10
Change comment: Imported from XAR
To version 1.3
edited by Robert Schaub
on 2025/12/22 14:16
Change comment: Update document after refactoring.

Summary

Details

Page properties
Parent
... ... @@ -1,1 +1,1 @@
1 -Test.FactHarbor.Specification.WebHome
1 +Test.FactHarbor pre11 V0\.9\.70.Specification.WebHome
Content
... ... @@ -1,11 +1,18 @@
1 1  = Data Model =
2 +
2 2  FactHarbor's data model is **simple, focused, designed for automated processing**.
4 +
3 3  == 1. Core Entities ==
6 +
4 4  === 1.1 Claim ===
8 +
5 5  Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count
10 +
6 6  ==== Performance Optimization: Denormalized Fields ====
12 +
7 7  **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
8 8  **Additional cached fields in claims table**:
15 +
9 9  * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores * Avoids joining evidence table for listing/preview * Updated when evidence is added/removed * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
10 10  * **source_names** (TEXT[]): Array of source names for quick display * Avoids joining through evidence to sources * Updated when sources change * Format: `["New York Times", "Nature Journal", ...]`
11 11  * **scenario_count** (INTEGER): Number of scenarios for this claim * Quick metric without counting rows * Updated when scenarios added/removed
... ... @@ -18,13 +18,18 @@
18 18  * ✅ 70% fewer joins on common queries
19 19  * ✅ Much faster claim list/search pages
20 20  * ✅ Better user experience
21 -* ⚠️ Small storage increase (~10%)
28 +* ⚠️ Small storage increase (10%)
22 22  * ⚠️ Need to keep caches in sync
30 +
23 23  === 1.2 Evidence ===
32 +
24 24  Fields: claim_id, source_id, excerpt, url, relevance_score, supports
34 +
25 25  === 1.3 Source ===
36 +
26 26  **Purpose**: Track reliability of information sources over time
27 27  **Fields**:
39 +
28 28  * **id** (UUID): Unique identifier
29 29  * **name** (text): Source name (e.g., "New York Times", "Nature Journal")
30 30  * **domain** (text): Website domain (e.g., "nytimes.com")
... ... @@ -42,24 +42,30 @@
42 42  **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
43 43  Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**
44 44  **Key**: Automated source reliability tracking
57 +
45 45  ==== Source Scoring Process (Separation of Concerns) ====
59 +
46 46  **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.
47 47  **The Problem**: * Source scores should influence claim verdicts
62 +
48 48  * Claim verdicts should update source scores
49 49  * But: Direct feedback creates circular dependency and potential feedback loops
50 50  **The Solution**: Temporal separation
66 +
51 51  ==== Weekly Background Job (Source Scoring) ====
68 +
52 52  Runs independently of claim analysis:
53 -{{code language="python"}}
54 -def update_source_scores_weekly(): """ Background job: Calculate source reliability Never triggered by individual claim analysis """ # Analyze all claims from past week claims = get_claims_from_past_week() for source in get_all_sources(): # Calculate accuracy metrics correct_verdicts = count_correct_verdicts_citing(source, claims) total_citations = count_total_citations(source, claims) accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5 # Weight by claim importance weighted_score = calculate_weighted_score(source, claims) # Update source record source.track_record_score = weighted_score source.total_citations = total_citations source.last_updated = now() source.save() # Job runs: Sunday 2 AM UTC # Never during claim processing
55 -{{/code}}
70 +{{code language="python"}}def update_source_scores_weekly(): """ Background job: Calculate source reliability Never triggered by individual claim analysis """ # Analyze all claims from past week claims = get_claims_from_past_week() for source in get_all_sources(): # Calculate accuracy metrics correct_verdicts = count_correct_verdicts_citing(source, claims) total_citations = count_total_citations(source, claims) accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5 # Weight by claim importance weighted_score = calculate_weighted_score(source, claims) # Update source record source.track_record_score = weighted_score source.total_citations = total_citations source.last_updated = now() source.save() # Job runs: Sunday 2 AM UTC # Never during claim processing{{/code}}
71 +
56 56  ==== Real-Time Claim Analysis (AKEL) ====
73 +
57 57  Uses source scores but never updates them:
58 -{{code language="python"}}
59 -def analyze_claim(claim_text): """ Real-time: Analyze claim using current source scores READ source scores, never UPDATE them """ # Gather evidence evidence_list = gather_evidence(claim_text) for evidence in evidence_list: # READ source score (snapshot from last weekly update) source = get_source(evidence.source_id) source_score = source.track_record_score # Use score to weight evidence evidence.weighted_relevance = evidence.relevance * source_score # Generate verdict using weighted evidence verdict = synthesize_verdict(evidence_list) # NEVER update source scores here # That happens in weekly background job return verdict
60 -{{/code}}
75 +{{code language="python"}}def analyze_claim(claim_text): """ Real-time: Analyze claim using current source scores READ source scores, never UPDATE them """ # Gather evidence evidence_list = gather_evidence(claim_text) for evidence in evidence_list: # READ source score (snapshot from last weekly update) source = get_source(evidence.source_id) source_score = source.track_record_score # Use score to weight evidence evidence.weighted_relevance = evidence.relevance * source_score # Generate verdict using weighted evidence verdict = synthesize_verdict(evidence_list) # NEVER update source scores here # That happens in weekly background job return verdict{{/code}}
76 +
61 61  ==== Monthly Audit (Quality Assurance) ====
78 +
62 62  Moderator review of flagged source scores:
80 +
63 63  * Verify scores make sense
64 64  * Detect gaming attempts
65 65  * Identify systematic biases
... ... @@ -93,11 +93,14 @@
93 93  Monday-Saturday: Claims processed using these scores → All claims this week use NYT=0.87 → All claims this week use Blog X=0.52
94 94  Next Sunday 2 AM: Recalculate scores including this week's claims → NYT score: 0.89 (trending up) → Blog X score: 0.48 (trending down)
95 95  ```
114 +
96 96  === 1.4 Scenario ===
116 +
97 97  **Purpose**: Different interpretations or contexts for evaluating claims
98 98  **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
99 99  **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)
100 100  **Fields**:
121 +
101 101  * **id** (UUID): Unique identifier
102 102  * **claim_id** (UUID): Foreign key to claim (one-to-many)
103 103  * **description** (text): Human-readable description of the scenario
... ... @@ -124,11 +124,16 @@
124 124  * After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
125 125  * Edit entity records the complete before/after change with timestamp and reason **Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios. === 1.6 User ===
126 126  Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count
127 -=== User Reputation System ==
148 +
149 +=== User Reputation System ===
150 +
128 128  **V1.0 Approach**: Simple manual role assignment
129 129  **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
153 +
130 130  === Roles (Manual Assignment) ===
155 +
131 131  **reader** (default):
157 +
132 132  * View published claims and evidence
133 133  * Browse and search content
134 134  * No editing permissions
... ... @@ -147,8 +147,11 @@
147 147  * System configuration
148 148  * Access to all features
149 149  * Founder-appointed initially
176 +
150 150  === Contribution Tracking (Simple) ===
178 +
151 151  **Basic metrics only**:
180 +
152 152  * `contributions_count`: Total number of contributions
153 153  * `created_at`: Account age
154 154  * `last_active`: Recent activity
... ... @@ -157,19 +157,26 @@
157 157  * No automated privilege escalation
158 158  * No reputation decay
159 159  * No threshold-based promotions
189 +
160 160  === Promotion Process ===
191 +
161 161  **Manual review by moderators/admins**:
193 +
162 162  1. User demonstrates value through contributions
163 163  2. Moderator reviews user's contribution history
164 164  3. Moderator promotes user to contributor role
165 165  4. Admin promotes trusted contributors to moderator
166 166  **Criteria** (guidelines, not automated):
199 +
167 167  * Quality of contributions
168 168  * Consistency over time
169 169  * Collaborative behavior
170 170  * Understanding of project goals
204 +
171 171  === V2.0+ Evolution ===
206 +
172 172  **Add complex reputation when**:
208 +
173 173  * 100+ active contributors
174 174  * Manual role management becomes bottleneck
175 175  * Clear patterns of abuse emerge requiring automation
... ... @@ -179,11 +179,16 @@
179 179  * Reputation decay for inactive users
180 180  * Track record scoring for contributors
181 181  See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
218 +
182 182  === 1.7 Edit ===
220 +
183 183  **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
184 184  **Purpose**: Complete audit trail for all content changes
223 +
185 185  === Edit History Details ===
225 +
186 186  **What Gets Edited**:
227 +
187 187  * **Claims** (20% edited): assertion, domain, status, scores, analysis
188 188  * **Evidence** (10% edited): excerpt, relevance_score, supports
189 189  * **Scenarios** (5% edited): description, assumptions, confidence
... ... @@ -200,12 +200,14 @@
200 200  * `MODERATION_ACTION`: Hide/unhide for abuse
201 201  * `REVERT`: Rollback to previous version
202 202  **Retention Policy** (5 years total):
244 +
203 203  1. **Hot storage** (3 months): PostgreSQL, instant access
204 204  2. **Warm storage** (2 years): Partitioned, slower queries
205 205  3. **Cold storage** (3 years): S3 compressed, download required
206 206  4. **Deletion**: After 5 years (except legal holds)
207 -**Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit)
249 +**Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit)
208 208  **Use Cases**:
251 +
209 209  * View claim history timeline
210 210  * Detect vandalism patterns
211 211  * Learn from user corrections (system improvement)
... ... @@ -212,12 +212,17 @@
212 212  * Legal compliance (audit trail)
213 213  * Rollback capability
214 214  See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases
258 +
215 215  === 1.8 Flag ===
260 +
216 216  Fields: entity_id, reported_by, issue_type, status, resolution_note
262 +
217 217  === 1.9 QualityMetric ===
264 +
218 218  **Fields**: metric_type, category, value, target, timestamp
219 219  **Purpose**: Time-series quality tracking
220 220  **Usage**:
268 +
221 221  * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times
222 222  * **Quality dashboard**: Real-time display with trend charts
223 223  * **Alerting**: Automatic alerts when metrics exceed thresholds
... ... @@ -224,10 +224,13 @@
224 224  * **A/B testing**: Compare control vs treatment metrics
225 225  * **Improvement validation**: Measure before/after changes
226 226  **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
275 +
227 227  === 1.10 ErrorPattern ===
277 +
228 228  **Fields**: error_category, claim_id, description, root_cause, frequency, status
229 229  **Purpose**: Capture errors to trigger system improvements
230 230  **Usage**:
281 +
231 231  * **Error capture**: When users flag issues or system detects problems
232 232  * **Pattern analysis**: Weekly grouping by category and frequency
233 233  * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor
... ... @@ -234,8 +234,11 @@
234 234  * **Metrics**: Track error rate reduction over time
235 235  **Example**: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}` == 1.4 Core Data Model ERD == {{include reference="Test.FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}} == 1.5 User Class Diagram ==
236 236  {{include reference="Test.FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}}
288 +
237 237  == 2. Versioning Strategy ==
290 +
238 238  **All Content Entities Are Versioned**:
292 +
239 239  * **Claim**: Every edit creates new version (V1→V2→V3...)
240 240  * **Evidence**: Changes tracked in edit history
241 241  * **Scenario**: Modifications versioned
... ... @@ -253,68 +253,91 @@
253 253  ```
254 254  Claim V1: "The sky is blue" → User edits → Claim V2: "The sky is blue during daytime" → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
255 255  ```
310 +
256 256  == 2.5. Storage vs Computation Strategy ==
312 +
257 257  **Critical architectural decision**: What to persist in databases vs compute dynamically?
258 258  **Trade-off**:
315 +
259 259  * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs
260 260  * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible
318 +
261 261  === Recommendation: Hybrid Approach ===
320 +
262 262  **STORE (in PostgreSQL):**
322 +
263 263  ==== Claims (Current State + History) ====
324 +
264 264  * **What**: assertion, domain, status, created_at, updated_at, version
265 265  * **Why**: Core entity, must be persistent
266 266  * **Also store**: confidence_score (computed once, then cached)
267 -* **Size**: ~1 KB per claim
328 +* **Size**: 1 KB per claim
268 268  * **Growth**: Linear with claims
269 269  * **Decision**: ✅ STORE - Essential
331 +
270 270  ==== Evidence (All Records) ====
333 +
271 271  * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
272 272  * **Why**: Hard to re-gather, user contributions, reproducibility
273 -* **Size**: ~2 KB per evidence (with excerpt)
336 +* **Size**: 2 KB per evidence (with excerpt)
274 274  * **Growth**: 3-10 evidence per claim
275 275  * **Decision**: ✅ STORE - Essential for reproducibility
339 +
276 276  ==== Sources (Track Records) ====
341 +
277 277  * **What**: name, domain, track_record_score, accuracy_history, correction_frequency
278 278  * **Why**: Continuously updated, expensive to recompute
279 -* **Size**: ~500 bytes per source
344 +* **Size**: 500 bytes per source
280 280  * **Growth**: Slow (limited number of sources)
281 281  * **Decision**: ✅ STORE - Essential for quality
347 +
282 282  ==== Edit History (All Versions) ====
349 +
283 283  * **What**: before_state, after_state, user_id, reason, timestamp
284 284  * **Why**: Audit trail, legal requirement, reproducibility
285 -* **Size**: ~2 KB per edit
286 -* **Growth**: Linear with edits (~A portion of claims get edited)
352 +* **Size**: 2 KB per edit
353 +* **Growth**: Linear with edits (A portion of claims get edited)
287 287  * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
288 288  * **Decision**: ✅ STORE - Essential for accountability
356 +
289 289  ==== Flags (User Reports) ====
358 +
290 290  * **What**: entity_id, reported_by, issue_type, description, status
291 291  * **Why**: Error detection, system improvement triggers
292 -* **Size**: ~500 bytes per flag
361 +* **Size**: 500 bytes per flag
293 293  * **Growth**: 5-high percentage of claims get flagged
294 294  * **Decision**: ✅ STORE - Essential for improvement
364 +
295 295  ==== ErrorPatterns (System Improvement) ====
366 +
296 296  * **What**: error_category, claim_id, description, root_cause, frequency, status
297 297  * **Why**: Learning loop, prevent recurring errors
298 -* **Size**: ~1 KB per pattern
369 +* **Size**: 1 KB per pattern
299 299  * **Growth**: Slow (limited patterns, many fixed)
300 300  * **Decision**: ✅ STORE - Essential for learning
372 +
301 301  ==== QualityMetrics (Time Series) ====
374 +
302 302  * **What**: metric_type, category, value, target, timestamp
303 303  * **Why**: Trend analysis, cannot recreate historical metrics
304 -* **Size**: ~200 bytes per metric
377 +* **Size**: 200 bytes per metric
305 305  * **Growth**: Hourly = 8,760 per year per metric type
306 306  * **Retention**: 2 years hot, then aggregate and archive
307 307  * **Decision**: ✅ STORE - Essential for monitoring
308 308  **STORE (Computed Once, Then Cached):**
382 +
309 309  ==== Analysis Summary ====
384 +
310 310  * **What**: Neutral text summary of claim analysis (200-500 words)
311 311  * **Computed**: Once by AKEL when claim first analyzed
312 312  * **Stored in**: Claim table (text field)
313 313  * **Recomputed**: Only when system significantly improves OR claim edited
314 314  * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
315 -* **Size**: ~2 KB per claim
390 +* **Size**: 2 KB per claim
316 316  * **Decision**: ✅ STORE (cached) - Cost-effective
392 +
317 317  ==== Confidence Score ====
394 +
318 318  * **What**: 0-100 score of analysis confidence
319 319  * **Computed**: Once by AKEL
320 320  * **Stored in**: Claim table (integer field)
... ... @@ -322,7 +322,9 @@
322 322  * **Why store**: Cheap to store, expensive to compute, users need it fast
323 323  * **Size**: 4 bytes per claim
324 324  * **Decision**: ✅ STORE (cached) - Performance critical
402 +
325 325  ==== Risk Score ====
404 +
326 326  * **What**: 0-100 score of claim risk level
327 327  * **Computed**: Once by AKEL
328 328  * **Stored in**: Claim table (integer field)
... ... @@ -331,24 +331,33 @@
331 331  * **Size**: 4 bytes per claim
332 332  * **Decision**: ✅ STORE (cached) - Performance critical
333 333  **COMPUTE DYNAMICALLY (Never Store):**
334 -==== Scenarios ==== ⚠️ CRITICAL DECISION
413 +
414 +==== Scenarios ====
415 +
416 + ⚠️ CRITICAL DECISION
417 +
335 335  * **What**: 2-5 possible interpretations of claim with assumptions
336 336  * **Current design**: Stored in Scenario table
337 337  * **Alternative**: Compute on-demand when user views claim details
338 -* **Storage cost**: ~1 KB per scenario × 3 scenarios average = ~3 KB per claim
421 +* **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim
339 339  * **Compute cost**: $0.005-0.01 per request (LLM API call)
340 -* **Frequency**: Viewed in detail by ~20% of users
423 +* **Frequency**: Viewed in detail by 20% of users
341 341  * **Trade-off analysis**: - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
342 342  * **Reproducibility**: Scenarios may improve as AI improves (good to recompute)
343 343  * **Speed**: Computed = 5-8 seconds delay, Stored = instant
344 344  * **Decision**: ✅ STORE (hybrid approach below)
345 345  **Scenario Strategy** (APPROVED):
429 +
346 346  1. **Store scenarios** initially when claim analyzed
347 347  2. **Mark as stale** when system improves significantly
348 348  3. **Recompute on next view** if marked stale
349 349  4. **Cache for 30 days** if frequently accessed
350 350  5. **Result**: Best of both worlds - speed + freshness
351 -==== Verdict Synthesis ==== * **What**: Final conclusion text synthesizing all scenarios
435 +
436 +==== Verdict Synthesis ====
437 +
438 + ~* **What**: Final conclusion text synthesizing all scenarios
439 +
352 352  * **Compute cost**: $0.002-0.005 per request
353 353  * **Frequency**: Every time claim viewed
354 354  * **Why not store**: Changes as evidence/scenarios change, users want fresh analysis
... ... @@ -355,17 +355,23 @@
355 355  * **Speed**: 2-3 seconds (acceptable)
356 356  **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale
357 357  * **Recommendation**: ✅ STORE cached version, mark stale when changes occur
446 +
358 358  ==== Search Results ====
448 +
359 359  * **What**: Lists of claims matching search query
360 360  * **Compute from**: Elasticsearch index
361 361  * **Cache**: 15 minutes in Redis for popular queries
362 362  * **Why not store permanently**: Constantly changing, infinite possible queries
453 +
363 363  ==== Aggregated Statistics ====
455 +
364 364  * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.
365 365  * **Compute from**: Database queries
366 366  * **Cache**: 1 hour in Redis
367 367  * **Why not store**: Can be derived, relatively cheap to compute
460 +
368 368  ==== User Reputation ====
462 +
369 369  * **What**: Score based on contributions
370 370  * **Current design**: Stored in User table
371 371  * **Alternative**: Compute from Edit table
... ... @@ -373,36 +373,42 @@
373 373  * **Frequency**: Read on every user action
374 374  * **Compute cost**: Simple COUNT query, milliseconds
375 375  * **Decision**: ✅ STORE - Performance critical, read-heavy
470 +
376 376  === Summary Table ===
377 -| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |
378 -|-----------|---------|---------|----------------|----------|-----------|
379 -| Claim core | ✅ | - | 1 KB | STORE | Essential |
380 -| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |
381 -| Sources | ✅ | - | 500 B (shared) | STORE | Track record |
382 -| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |
383 -| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |
384 -| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |
385 -| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |
386 -| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |
387 -| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |
388 -| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |
389 -| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |
390 -| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |
391 -| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |
472 +
473 +| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\
474 +|-----|-|-||----|-----|\\
475 +| Claim core | ✅ | - | 1 KB | STORE | Essential |\\
476 +| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\
477 +| Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\
478 +| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\
479 +| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\
480 +| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
481 +| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
482 +| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\
483 +| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\
484 +| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\
485 +| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\
486 +| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\
487 +| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\
392 392  | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |
393 -**Total storage per claim**: ~18 KB (without edits and flags)
489 +**Total storage per claim**: 18 KB (without edits and flags)
394 394  **For 1 million claims**:
395 -* **Storage**: ~18 GB (manageable)
396 -* **PostgreSQL**: ~$50/month (standard instance)
397 -* **Redis cache**: ~$20/month (1 GB instance)
398 -* **S3 archives**: ~$5/month (old edits)
399 -* **Total**: ~$75/month infrastructure
491 +
492 +* **Storage**: 18 GB (manageable)
493 +* **PostgreSQL**: $50/month (standard instance)
494 +* **Redis cache**: $20/month (1 GB instance)
495 +* **S3 archives**: $5/month (old edits)
496 +* **Total**: $75/month infrastructure
400 400  **LLM cost savings by caching**:
401 401  * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
402 402  * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims * Verdict stored: Save $0.003 per claim = $3K per 1M claims
403 -* **Total savings**: ~$35K per 1M claims vs recomputing every time
500 +* **Total savings**: $35K per 1M claims vs recomputing every time
501 +
404 404  === Recomputation Triggers ===
503 +
405 405  **When to mark cached data as stale and recompute:**
505 +
406 406  1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)
407 407  2. **Evidence added** → Recompute: scenarios, verdict, confidence score
408 408  3. **Source track record changes >10 points** → Recompute: confidence score, verdict
... ... @@ -409,11 +409,15 @@
409 409  4. **System improvement deployed** → Mark affected claims stale, recompute on next view
410 410  5. **Controversy detected** (high flag rate) → Recompute: risk score
411 411  **Recomputation strategy**:
512 +
412 412  * **Eager**: Immediately recompute (for user edits)
413 413  * **Lazy**: Recompute on next view (for system improvements)
414 414  * **Batch**: Nightly re-evaluation of stale claims (if <1000)
516 +
415 415  === Database Size Projection ===
518 +
416 416  **Year 1**: 10K claims
520 +
417 417  * Storage: 180 MB
418 418  * Cost: $10/month
419 419  **Year 3**: 100K claims * Storage: 1.8 GB
... ... @@ -425,15 +425,21 @@
425 425  * Cost: $300/month
426 426  * Optimization: Archive old claims to S3 ($5/TB/month)
427 427  **Conclusion**: Storage costs are manageable, LLM cost savings are substantial.
532 +
428 428  == 3. Key Simplifications ==
534 +
429 429  * **Two content states only**: Published, Hidden
430 430  * **Three user roles only**: Reader, Contributor, Moderator
431 431  * **No complex versioning**: Linear edit history
432 432  * **Reputation-based permissions**: Not role hierarchy
433 433  * **Source track records**: Continuous evaluation
540 +
434 434  == 3. What Gets Stored in the Database ==
542 +
435 435  === 3.1 Primary Storage (PostgreSQL) ===
544 +
436 436  **Claims Table**:
546 +
437 437  * Current state only (latest version)
438 438  * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
439 439  **Evidence Table**:
... ... @@ -460,31 +460,44 @@
460 460  **QualityMetric Table**:
461 461  * Time-series quality data
462 462  * Fields: id, metric_type, metric_category, value, target, timestamp
573 +
463 463  === 3.2 What's NOT Stored (Computed on-the-fly) ===
575 +
464 464  * **Verdicts**: Synthesized from evidence + scenarios when requested
465 465  * **Risk scores**: Recalculated based on current factors
466 466  * **Aggregated statistics**: Computed from base data
467 467  * **Search results**: Generated from Elasticsearch index
580 +
468 468  === 3.3 Cache Layer (Redis) ===
582 +
469 469  **Cached for performance**:
584 +
470 470  * Frequently accessed claims (TTL: 1 hour)
471 471  * Search results (TTL: 15 minutes)
472 472  * User sessions (TTL: 24 hours)
473 473  * Source track records (TTL: 1 hour)
589 +
474 474  === 3.4 File Storage (S3) ===
591 +
475 475  **Archived content**:
593 +
476 476  * Old edit history (>3 months)
477 477  * Evidence documents (archived copies)
478 478  * Database backups
479 479  * Export files
598 +
480 480  === 3.5 Search Index (Elasticsearch) ===
600 +
481 481  **Indexed for search**:
602 +
482 482  * Claim assertions (full-text)
483 483  * Evidence excerpts (full-text)
484 484  * Scenario descriptions (full-text)
485 485  * Source names (autocomplete)
486 486  Synchronized from PostgreSQL via change data capture or periodic sync.
608 +
487 487  == 4. Related Pages ==
488 -* [[Architecture>>Test.FactHarbor.Specification.Architecture.WebHome]]
610 +
611 +* [[Architecture>>Test.FactHarbor pre11 V0\.9\.70.Specification.Architecture.WebHome]]
489 489  * [[Requirements>>Test.FactHarbor.Specification.Requirements.WebHome]]
490 490  * [[Workflows>>Test.FactHarbor.Specification.Workflows.WebHome]]