Changes for page Data Model

Last modified by Robert Schaub on 2026/02/08 08:27

From version 3.1
edited by Robert Schaub
on 2025/12/19 14:41
Change comment: Imported from XAR
To version 4.3
edited by Robert Schaub
on 2025/12/21 13:38
Change comment: Renamed back-links.

Summary

Details

Page properties
Content
... ... @@ -1,25 +1,32 @@
1 1  = Data Model =
2 +
2 2  FactHarbor's data model is **simple, focused, designed for automated processing**.
4 +
3 3  == 1. Core Entities ==
6 +
4 4  === 1.1 Claim ===
8 +
5 5  Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count
10 +
6 6  ==== Performance Optimization: Denormalized Fields ====
12 +
7 7  **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
8 8  **Additional cached fields in claims table**:
15 +
9 9  * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores
10 - * Avoids joining evidence table for listing/preview
11 - * Updated when evidence is added/removed
12 - * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
17 +* Avoids joining evidence table for listing/preview
18 +* Updated when evidence is added/removed
19 +* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
13 13  * **source_names** (TEXT[]): Array of source names for quick display
14 - * Avoids joining through evidence to sources
15 - * Updated when sources change
16 - * Format: `["New York Times", "Nature Journal", ...]`
21 +* Avoids joining through evidence to sources
22 +* Updated when sources change
23 +* Format: `["New York Times", "Nature Journal", ...]`
17 17  * **scenario_count** (INTEGER): Number of scenarios for this claim
18 - * Quick metric without counting rows
19 - * Updated when scenarios added/removed
25 +* Quick metric without counting rows
26 +* Updated when scenarios added/removed
20 20  * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed
21 - * Helps invalidate stale caches
22 - * Triggers background refresh if too old
28 +* Helps invalidate stale caches
29 +* Triggers background refresh if too old
23 23  **Update Strategy**:
24 24  * **Immediate**: Update on claim edit (user-facing)
25 25  * **Deferred**: Update via background job every hour (non-critical)
... ... @@ -28,13 +28,18 @@
28 28  * ✅ 70% fewer joins on common queries
29 29  * ✅ Much faster claim list/search pages
30 30  * ✅ Better user experience
31 -* ⚠️ Small storage increase (~10%)
38 +* ⚠️ Small storage increase (10%)
32 32  * ⚠️ Need to keep caches in sync
40 +
33 33  === 1.2 Evidence ===
42 +
34 34  Fields: claim_id, source_id, excerpt, url, relevance_score, supports
44 +
35 35  === 1.3 Source ===
46 +
36 36  **Purpose**: Track reliability of information sources over time
37 37  **Fields**:
49 +
38 38  * **id** (UUID): Unique identifier
39 39  * **name** (text): Source name (e.g., "New York Times", "Nature Journal")
40 40  * **domain** (text): Website domain (e.g., "nytimes.com")
... ... @@ -52,17 +52,21 @@
52 52  **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
53 53  Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**
54 54  **Key**: Automated source reliability tracking
67 +
55 55  ==== Source Scoring Process (Separation of Concerns) ====
69 +
56 56  **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.
57 57  **The Problem**:
72 +
58 58  * Source scores should influence claim verdicts
59 59  * Claim verdicts should update source scores
60 60  * But: Direct feedback creates circular dependency and potential feedback loops
61 61  **The Solution**: Temporal separation
77 +
62 62  ==== Weekly Background Job (Source Scoring) ====
79 +
63 63  Runs independently of claim analysis:
64 -{{code language="python"}}
65 -def update_source_scores_weekly():
81 +{{code language="python"}}def update_source_scores_weekly():
66 66   """
67 67   Background job: Calculate source reliability
68 68   Never triggered by individual claim analysis
... ... @@ -82,12 +82,12 @@
82 82   source.last_updated = now()
83 83   source.save()
84 84   # Job runs: Sunday 2 AM UTC
85 - # Never during claim processing
86 -{{/code}}
101 + # Never during claim processing{{/code}}
102 +
87 87  ==== Real-Time Claim Analysis (AKEL) ====
104 +
88 88  Uses source scores but never updates them:
89 -{{code language="python"}}
90 -def analyze_claim(claim_text):
106 +{{code language="python"}}def analyze_claim(claim_text):
91 91   """
92 92   Real-time: Analyze claim using current source scores
93 93   READ source scores, never UPDATE them
... ... @@ -104,10 +104,12 @@
104 104   verdict = synthesize_verdict(evidence_list)
105 105   # NEVER update source scores here
106 106   # That happens in weekly background job
107 - return verdict
108 -{{/code}}
123 + return verdict{{/code}}
124 +
109 109  ==== Monthly Audit (Quality Assurance) ====
126 +
110 110  Moderator review of flagged source scores:
128 +
111 111  * Verify scores make sense
112 112  * Detect gaming attempts
113 113  * Identify systematic biases
... ... @@ -147,11 +147,14 @@
147 147   → NYT score: 0.89 (trending up)
148 148   → Blog X score: 0.48 (trending down)
149 149  ```
168 +
150 150  === 1.4 Scenario ===
170 +
151 151  **Purpose**: Different interpretations or contexts for evaluating claims
152 152  **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
153 153  **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)
154 154  **Fields**:
175 +
155 155  * **id** (UUID): Unique identifier
156 156  * **claim_id** (UUID): Foreign key to claim (one-to-many)
157 157  * **description** (text): Human-readable description of the scenario
... ... @@ -172,33 +172,42 @@
172 172  **Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.
173 173  
174 174  **Core Fields**:
196 +
175 175  * **id** (UUID): Primary key
176 176  * **scenario_id** (UUID FK): The scenario being assessed
177 -* **created_at** (timestamp): When verdict was first created
178 -
179 -**Versioned via VERDICT_VERSION**: Verdicts evolve as new evidence emerges or analysis improves. Each version captures:
180 180  * **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")
181 181  * **confidence** (decimal 0-1): How confident we are in this assessment
182 182  * **explanation_summary** (text): Human-readable reasoning explaining the verdict
183 183  * **uncertainty_factors** (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")
184 -* **created_at** (timestamp): When this version was generated
203 +* **created_at** (timestamp): When verdict was created
204 +* **updated_at** (timestamp): Last modification
185 185  
186 -**Relationship**: Each Scenario has multiple Verdicts over time (as understanding evolves). Each Verdict has multiple versions.
206 +**Change Tracking**: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states.
187 187  
208 +**Relationship**: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity.
209 +
188 188  **Example**:
189 189  For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":
190 -* Initial verdict (v1): likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
191 -* Updated verdict (v2): likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
192 192  
193 -**Key Design**: Separating Verdict from Scenario allows tracking how our understanding evolves without losing history.
213 +* Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
214 +* After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
215 +* Edit entity records the complete before/after change with timestamp and reason
194 194  
217 +**Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios.
218 +
195 195  === 1.6 User ===
220 +
196 196  Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count
197 -=== User Reputation System ==
222 +
223 +=== User Reputation System ===
224 +
198 198  **V1.0 Approach**: Simple manual role assignment
199 199  **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
227 +
200 200  === Roles (Manual Assignment) ===
229 +
201 201  **reader** (default):
231 +
202 202  * View published claims and evidence
203 203  * Browse and search content
204 204  * No editing permissions
... ... @@ -217,8 +217,11 @@
217 217  * System configuration
218 218  * Access to all features
219 219  * Founder-appointed initially
250 +
220 220  === Contribution Tracking (Simple) ===
252 +
221 221  **Basic metrics only**:
254 +
222 222  * `contributions_count`: Total number of contributions
223 223  * `created_at`: Account age
224 224  * `last_active`: Recent activity
... ... @@ -227,19 +227,26 @@
227 227  * No automated privilege escalation
228 228  * No reputation decay
229 229  * No threshold-based promotions
263 +
230 230  === Promotion Process ===
265 +
231 231  **Manual review by moderators/admins**:
267 +
232 232  1. User demonstrates value through contributions
233 233  2. Moderator reviews user's contribution history
234 234  3. Moderator promotes user to contributor role
235 235  4. Admin promotes trusted contributors to moderator
236 236  **Criteria** (guidelines, not automated):
273 +
237 237  * Quality of contributions
238 238  * Consistency over time
239 239  * Collaborative behavior
240 240  * Understanding of project goals
278 +
241 241  === V2.0+ Evolution ===
280 +
242 242  **Add complex reputation when**:
282 +
243 243  * 100+ active contributors
244 244  * Manual role management becomes bottleneck
245 245  * Clear patterns of abuse emerge requiring automation
... ... @@ -249,11 +249,16 @@
249 249  * Reputation decay for inactive users
250 250  * Track record scoring for contributors
251 251  See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
292 +
252 252  === 1.7 Edit ===
294 +
253 253  **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
254 254  **Purpose**: Complete audit trail for all content changes
297 +
255 255  === Edit History Details ===
299 +
256 256  **What Gets Edited**:
301 +
257 257  * **Claims** (20% edited): assertion, domain, status, scores, analysis
258 258  * **Evidence** (10% edited): excerpt, relevance_score, supports
259 259  * **Scenarios** (5% edited): description, assumptions, confidence
... ... @@ -270,12 +270,14 @@
270 270  * `MODERATION_ACTION`: Hide/unhide for abuse
271 271  * `REVERT`: Rollback to previous version
272 272  **Retention Policy** (5 years total):
318 +
273 273  1. **Hot storage** (3 months): PostgreSQL, instant access
274 274  2. **Warm storage** (2 years): Partitioned, slower queries
275 275  3. **Cold storage** (3 years): S3 compressed, download required
276 276  4. **Deletion**: After 5 years (except legal holds)
277 -**Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit)
323 +**Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit)
278 278  **Use Cases**:
325 +
279 279  * View claim history timeline
280 280  * Detect vandalism patterns
281 281  * Learn from user corrections (system improvement)
... ... @@ -282,12 +282,17 @@
282 282  * Legal compliance (audit trail)
283 283  * Rollback capability
284 284  See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases
332 +
285 285  === 1.8 Flag ===
334 +
286 286  Fields: entity_id, reported_by, issue_type, status, resolution_note
287 -=== 1.9 QualityMetric ===
336 +
337 +=== 1.9 QualityMetric ===
338 +
288 288  **Fields**: metric_type, category, value, target, timestamp
289 289  **Purpose**: Time-series quality tracking
290 290  **Usage**:
342 +
291 291  * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times
292 292  * **Quality dashboard**: Real-time display with trend charts
293 293  * **Alerting**: Automatic alerts when metrics exceed thresholds
... ... @@ -294,10 +294,13 @@
294 294  * **A/B testing**: Compare control vs treatment metrics
295 295  * **Improvement validation**: Measure before/after changes
296 296  **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
297 -=== 1.10 ErrorPattern ===
349 +
350 +=== 1.10 ErrorPattern ===
351 +
298 298  **Fields**: error_category, claim_id, description, root_cause, frequency, status
299 299  **Purpose**: Capture errors to trigger system improvements
300 300  **Usage**:
355 +
301 301  * **Error capture**: When users flag issues or system detects problems
302 302  * **Pattern analysis**: Weekly grouping by category and frequency
303 303  * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor
... ... @@ -309,9 +309,13 @@
309 309  {{include reference="Test.FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}}
310 310  
311 311  == 1.5 User Class Diagram ==
367 +
312 312  {{include reference="Test.FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}}
369 +
313 313  == 2. Versioning Strategy ==
371 +
314 314  **All Content Entities Are Versioned**:
373 +
315 315  * **Claim**: Every edit creates new version (V1→V2→V3...)
316 316  * **Evidence**: Changes tracked in edit history
317 317  * **Scenario**: Modifications versioned
... ... @@ -332,68 +332,91 @@
332 332  Claim V2: "The sky is blue during daytime"
333 333   → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
334 334  ```
394 +
335 335  == 2.5. Storage vs Computation Strategy ==
396 +
336 336  **Critical architectural decision**: What to persist in databases vs compute dynamically?
337 337  **Trade-off**:
399 +
338 338  * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs
339 339  * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible
402 +
340 340  === Recommendation: Hybrid Approach ===
404 +
341 341  **STORE (in PostgreSQL):**
406 +
342 342  ==== Claims (Current State + History) ====
408 +
343 343  * **What**: assertion, domain, status, created_at, updated_at, version
344 344  * **Why**: Core entity, must be persistent
345 345  * **Also store**: confidence_score (computed once, then cached)
346 -* **Size**: ~1 KB per claim
412 +* **Size**: 1 KB per claim
347 347  * **Growth**: Linear with claims
348 348  * **Decision**: ✅ STORE - Essential
415 +
349 349  ==== Evidence (All Records) ====
417 +
350 350  * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
351 351  * **Why**: Hard to re-gather, user contributions, reproducibility
352 -* **Size**: ~2 KB per evidence (with excerpt)
420 +* **Size**: 2 KB per evidence (with excerpt)
353 353  * **Growth**: 3-10 evidence per claim
354 354  * **Decision**: ✅ STORE - Essential for reproducibility
423 +
355 355  ==== Sources (Track Records) ====
425 +
356 356  * **What**: name, domain, track_record_score, accuracy_history, correction_frequency
357 357  * **Why**: Continuously updated, expensive to recompute
358 -* **Size**: ~500 bytes per source
428 +* **Size**: 500 bytes per source
359 359  * **Growth**: Slow (limited number of sources)
360 360  * **Decision**: ✅ STORE - Essential for quality
431 +
361 361  ==== Edit History (All Versions) ====
433 +
362 362  * **What**: before_state, after_state, user_id, reason, timestamp
363 363  * **Why**: Audit trail, legal requirement, reproducibility
364 -* **Size**: ~2 KB per edit
365 -* **Growth**: Linear with edits (~A portion of claims get edited)
436 +* **Size**: 2 KB per edit
437 +* **Growth**: Linear with edits (A portion of claims get edited)
366 366  * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
367 367  * **Decision**: ✅ STORE - Essential for accountability
440 +
368 368  ==== Flags (User Reports) ====
442 +
369 369  * **What**: entity_id, reported_by, issue_type, description, status
370 370  * **Why**: Error detection, system improvement triggers
371 -* **Size**: ~500 bytes per flag
445 +* **Size**: 500 bytes per flag
372 372  * **Growth**: 5-high percentage of claims get flagged
373 373  * **Decision**: ✅ STORE - Essential for improvement
448 +
374 374  ==== ErrorPatterns (System Improvement) ====
450 +
375 375  * **What**: error_category, claim_id, description, root_cause, frequency, status
376 376  * **Why**: Learning loop, prevent recurring errors
377 -* **Size**: ~1 KB per pattern
453 +* **Size**: 1 KB per pattern
378 378  * **Growth**: Slow (limited patterns, many fixed)
379 379  * **Decision**: ✅ STORE - Essential for learning
456 +
380 380  ==== QualityMetrics (Time Series) ====
458 +
381 381  * **What**: metric_type, category, value, target, timestamp
382 382  * **Why**: Trend analysis, cannot recreate historical metrics
383 -* **Size**: ~200 bytes per metric
461 +* **Size**: 200 bytes per metric
384 384  * **Growth**: Hourly = 8,760 per year per metric type
385 385  * **Retention**: 2 years hot, then aggregate and archive
386 386  * **Decision**: ✅ STORE - Essential for monitoring
387 387  **STORE (Computed Once, Then Cached):**
466 +
388 388  ==== Analysis Summary ====
468 +
389 389  * **What**: Neutral text summary of claim analysis (200-500 words)
390 390  * **Computed**: Once by AKEL when claim first analyzed
391 391  * **Stored in**: Claim table (text field)
392 392  * **Recomputed**: Only when system significantly improves OR claim edited
393 393  * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
394 -* **Size**: ~2 KB per claim
474 +* **Size**: 2 KB per claim
395 395  * **Decision**: ✅ STORE (cached) - Cost-effective
476 +
396 396  ==== Confidence Score ====
478 +
397 397  * **What**: 0-100 score of analysis confidence
398 398  * **Computed**: Once by AKEL
399 399  * **Stored in**: Claim table (integer field)
... ... @@ -401,7 +401,9 @@
401 401  * **Why store**: Cheap to store, expensive to compute, users need it fast
402 402  * **Size**: 4 bytes per claim
403 403  * **Decision**: ✅ STORE (cached) - Performance critical
486 +
404 404  ==== Risk Score ====
488 +
405 405  * **What**: 0-100 score of claim risk level
406 406  * **Computed**: Once by AKEL
407 407  * **Stored in**: Claim table (integer field)
... ... @@ -410,13 +410,17 @@
410 410  * **Size**: 4 bytes per claim
411 411  * **Decision**: ✅ STORE (cached) - Performance critical
412 412  **COMPUTE DYNAMICALLY (Never Store):**
413 -==== Scenarios ==== ⚠️ CRITICAL DECISION
497 +
498 +==== Scenarios ====
499 +
500 + ⚠️ CRITICAL DECISION
501 +
414 414  * **What**: 2-5 possible interpretations of claim with assumptions
415 415  * **Current design**: Stored in Scenario table
416 416  * **Alternative**: Compute on-demand when user views claim details
417 -* **Storage cost**: ~1 KB per scenario × 3 scenarios average = ~3 KB per claim
505 +* **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim
418 418  * **Compute cost**: $0.005-0.01 per request (LLM API call)
419 -* **Frequency**: Viewed in detail by ~20% of users
507 +* **Frequency**: Viewed in detail by 20% of users
420 420  * **Trade-off analysis**:
421 421   - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access
422 422   - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
... ... @@ -424,12 +424,17 @@
424 424  * **Speed**: Computed = 5-8 seconds delay, Stored = instant
425 425  * **Decision**: ✅ STORE (hybrid approach below)
426 426  **Scenario Strategy** (APPROVED):
515 +
427 427  1. **Store scenarios** initially when claim analyzed
428 428  2. **Mark as stale** when system improves significantly
429 429  3. **Recompute on next view** if marked stale
430 430  4. **Cache for 30 days** if frequently accessed
431 431  5. **Result**: Best of both worlds - speed + freshness
432 -==== Verdict Synthesis ====
521 +
522 +==== Verdict Synthesis ====
523 +
524 +
525 +
433 433  * **What**: Final conclusion text synthesizing all scenarios
434 434  * **Compute cost**: $0.002-0.005 per request
435 435  * **Frequency**: Every time claim viewed
... ... @@ -437,17 +437,23 @@
437 437  * **Speed**: 2-3 seconds (acceptable)
438 438  **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale
439 439  * **Recommendation**: ✅ STORE cached version, mark stale when changes occur
533 +
440 440  ==== Search Results ====
535 +
441 441  * **What**: Lists of claims matching search query
442 442  * **Compute from**: Elasticsearch index
443 443  * **Cache**: 15 minutes in Redis for popular queries
444 444  * **Why not store permanently**: Constantly changing, infinite possible queries
540 +
445 445  ==== Aggregated Statistics ====
542 +
446 446  * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.
447 447  * **Compute from**: Database queries
448 448  * **Cache**: 1 hour in Redis
449 449  * **Why not store**: Can be derived, relatively cheap to compute
547 +
450 450  ==== User Reputation ====
549 +
451 451  * **What**: Score based on contributions
452 452  * **Current design**: Stored in User table
453 453  * **Alternative**: Compute from Edit table
... ... @@ -457,37 +457,43 @@
457 457  * **Frequency**: Read on every user action
458 458  * **Compute cost**: Simple COUNT query, milliseconds
459 459  * **Decision**: ✅ STORE - Performance critical, read-heavy
559 +
460 460  === Summary Table ===
461 -| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |
462 -|-----------|---------|---------|----------------|----------|-----------|
463 -| Claim core | ✅ | - | 1 KB | STORE | Essential |
464 -| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |
465 -| Sources | ✅ | - | 500 B (shared) | STORE | Track record |
466 -| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |
467 -| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |
468 -| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |
469 -| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |
470 -| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |
471 -| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |
472 -| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |
473 -| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |
474 -| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |
475 -| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |
561 +
562 +| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\
563 +|-|-|-|||-|\\
564 +| Claim core | ✅ | - | 1 KB | STORE | Essential |\\
565 +| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\
566 +| Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\
567 +| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\
568 +| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\
569 +| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
570 +| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
571 +| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\
572 +| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\
573 +| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\
574 +| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\
575 +| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\
576 +| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\
476 476  | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |
477 -**Total storage per claim**: ~18 KB (without edits and flags)
578 +**Total storage per claim**: 18 KB (without edits and flags)
478 478  **For 1 million claims**:
479 -* **Storage**: ~18 GB (manageable)
480 -* **PostgreSQL**: ~$50/month (standard instance)
481 -* **Redis cache**: ~$20/month (1 GB instance)
482 -* **S3 archives**: ~$5/month (old edits)
483 -* **Total**: ~$75/month infrastructure
580 +
581 +* **Storage**: 18 GB (manageable)
582 +* **PostgreSQL**: $50/month (standard instance)
583 +* **Redis cache**: $20/month (1 GB instance)
584 +* **S3 archives**: $5/month (old edits)
585 +* **Total**: $75/month infrastructure
484 484  **LLM cost savings by caching**:
485 485  * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
486 486  * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims
487 487  * Verdict stored: Save $0.003 per claim = $3K per 1M claims
488 -* **Total savings**: ~$35K per 1M claims vs recomputing every time
590 +* **Total savings**: $35K per 1M claims vs recomputing every time
591 +
489 489  === Recomputation Triggers ===
593 +
490 490  **When to mark cached data as stale and recompute:**
595 +
491 491  1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)
492 492  2. **Evidence added** → Recompute: scenarios, verdict, confidence score
493 493  3. **Source track record changes >10 points** → Recompute: confidence score, verdict
... ... @@ -494,11 +494,15 @@
494 494  4. **System improvement deployed** → Mark affected claims stale, recompute on next view
495 495  5. **Controversy detected** (high flag rate) → Recompute: risk score
496 496  **Recomputation strategy**:
602 +
497 497  * **Eager**: Immediately recompute (for user edits)
498 498  * **Lazy**: Recompute on next view (for system improvements)
499 499  * **Batch**: Nightly re-evaluation of stale claims (if <1000)
606 +
500 500  === Database Size Projection ===
608 +
501 501  **Year 1**: 10K claims
610 +
502 502  * Storage: 180 MB
503 503  * Cost: $10/month
504 504  **Year 3**: 100K claims
... ... @@ -512,15 +512,21 @@
512 512  * Cost: $300/month
513 513  * Optimization: Archive old claims to S3 ($5/TB/month)
514 514  **Conclusion**: Storage costs are manageable, LLM cost savings are substantial.
624 +
515 515  == 3. Key Simplifications ==
626 +
516 516  * **Two content states only**: Published, Hidden
517 517  * **Three user roles only**: Reader, Contributor, Moderator
518 518  * **No complex versioning**: Linear edit history
519 519  * **Reputation-based permissions**: Not role hierarchy
520 520  * **Source track records**: Continuous evaluation
632 +
521 521  == 3. What Gets Stored in the Database ==
634 +
522 522  === 3.1 Primary Storage (PostgreSQL) ===
636 +
523 523  **Claims Table**:
638 +
524 524  * Current state only (latest version)
525 525  * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
526 526  **Evidence Table**:
... ... @@ -547,31 +547,44 @@
547 547  **QualityMetric Table**:
548 548  * Time-series quality data
549 549  * Fields: id, metric_type, metric_category, value, target, timestamp
665 +
550 550  === 3.2 What's NOT Stored (Computed on-the-fly) ===
667 +
551 551  * **Verdicts**: Synthesized from evidence + scenarios when requested
552 552  * **Risk scores**: Recalculated based on current factors
553 553  * **Aggregated statistics**: Computed from base data
554 554  * **Search results**: Generated from Elasticsearch index
672 +
555 555  === 3.3 Cache Layer (Redis) ===
674 +
556 556  **Cached for performance**:
676 +
557 557  * Frequently accessed claims (TTL: 1 hour)
558 558  * Search results (TTL: 15 minutes)
559 559  * User sessions (TTL: 24 hours)
560 560  * Source track records (TTL: 1 hour)
681 +
561 561  === 3.4 File Storage (S3) ===
683 +
562 562  **Archived content**:
685 +
563 563  * Old edit history (>3 months)
564 564  * Evidence documents (archived copies)
565 565  * Database backups
566 566  * Export files
690 +
567 567  === 3.5 Search Index (Elasticsearch) ===
692 +
568 568  **Indexed for search**:
694 +
569 569  * Claim assertions (full-text)
570 570  * Evidence excerpts (full-text)
571 571  * Scenario descriptions (full-text)
572 572  * Source names (autocomplete)
573 573  Synchronized from PostgreSQL via change data capture or periodic sync.
700 +
574 574  == 4. Related Pages ==
575 -* [[Architecture>>Test.FactHarbor.Specification.Architecture.WebHome]]
576 -* [[Requirements>>Test.FactHarbor.Specification.Requirements.WebHome]]
702 +
703 +* [[Architecture>>FactHarbor.Archive.FactHarbor delta for V0\.9\.70.Specification.Architecture.WebHome]]
704 +* [[Requirements>>FactHarbor.Archive.FactHarbor delta for V0\.9\.70.Specification.Requirements.WebHome]]
577 577  * [[Workflows>>Test.FactHarbor.Specification.Workflows.WebHome]]