Changes for page Data Model

Last modified by Robert Schaub on 2025/12/24 21:46

From version 1.1
edited by Robert Schaub
on 2025/12/18 12:03
Change comment: Imported from XAR
To version 4.2
edited by Robert Schaub
on 2025/12/21 13:38
Change comment: Renamed back-links.

Summary

Details

Page properties
Content
... ... @@ -1,25 +1,32 @@
1 1  = Data Model =
2 +
2 2  FactHarbor's data model is **simple, focused, designed for automated processing**.
4 +
3 3  == 1. Core Entities ==
6 +
4 4  === 1.1 Claim ===
8 +
5 5  Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count
10 +
6 6  ==== Performance Optimization: Denormalized Fields ====
12 +
7 7  **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
8 8  **Additional cached fields in claims table**:
15 +
9 9  * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores
10 - * Avoids joining evidence table for listing/preview
11 - * Updated when evidence is added/removed
12 - * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
17 +* Avoids joining evidence table for listing/preview
18 +* Updated when evidence is added/removed
19 +* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
13 13  * **source_names** (TEXT[]): Array of source names for quick display
14 - * Avoids joining through evidence to sources
15 - * Updated when sources change
16 - * Format: `["New York Times", "Nature Journal", ...]`
21 +* Avoids joining through evidence to sources
22 +* Updated when sources change
23 +* Format: `["New York Times", "Nature Journal", ...]`
17 17  * **scenario_count** (INTEGER): Number of scenarios for this claim
18 - * Quick metric without counting rows
19 - * Updated when scenarios added/removed
25 +* Quick metric without counting rows
26 +* Updated when scenarios added/removed
20 20  * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed
21 - * Helps invalidate stale caches
22 - * Triggers background refresh if too old
28 +* Helps invalidate stale caches
29 +* Triggers background refresh if too old
23 23  **Update Strategy**:
24 24  * **Immediate**: Update on claim edit (user-facing)
25 25  * **Deferred**: Update via background job every hour (non-critical)
... ... @@ -28,13 +28,18 @@
28 28  * ✅ 70% fewer joins on common queries
29 29  * ✅ Much faster claim list/search pages
30 30  * ✅ Better user experience
31 -* ⚠️ Small storage increase (~10%)
38 +* ⚠️ Small storage increase (10%)
32 32  * ⚠️ Need to keep caches in sync
40 +
33 33  === 1.2 Evidence ===
42 +
34 34  Fields: claim_id, source_id, excerpt, url, relevance_score, supports
44 +
35 35  === 1.3 Source ===
46 +
36 36  **Purpose**: Track reliability of information sources over time
37 37  **Fields**:
49 +
38 38  * **id** (UUID): Unique identifier
39 39  * **name** (text): Source name (e.g., "New York Times", "Nature Journal")
40 40  * **domain** (text): Website domain (e.g., "nytimes.com")
... ... @@ -52,17 +52,21 @@
52 52  **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
53 53  Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**
54 54  **Key**: Automated source reliability tracking
67 +
55 55  ==== Source Scoring Process (Separation of Concerns) ====
69 +
56 56  **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.
57 57  **The Problem**:
72 +
58 58  * Source scores should influence claim verdicts
59 59  * Claim verdicts should update source scores
60 60  * But: Direct feedback creates circular dependency and potential feedback loops
61 61  **The Solution**: Temporal separation
77 +
62 62  ==== Weekly Background Job (Source Scoring) ====
79 +
63 63  Runs independently of claim analysis:
64 -{{code language="python"}}
65 -def update_source_scores_weekly():
81 +{{code language="python"}}def update_source_scores_weekly():
66 66   """
67 67   Background job: Calculate source reliability
68 68   Never triggered by individual claim analysis
... ... @@ -82,12 +82,12 @@
82 82   source.last_updated = now()
83 83   source.save()
84 84   # Job runs: Sunday 2 AM UTC
85 - # Never during claim processing
86 -{{/code}}
101 + # Never during claim processing{{/code}}
102 +
87 87  ==== Real-Time Claim Analysis (AKEL) ====
104 +
88 88  Uses source scores but never updates them:
89 -{{code language="python"}}
90 -def analyze_claim(claim_text):
106 +{{code language="python"}}def analyze_claim(claim_text):
91 91   """
92 92   Real-time: Analyze claim using current source scores
93 93   READ source scores, never UPDATE them
... ... @@ -104,10 +104,12 @@
104 104   verdict = synthesize_verdict(evidence_list)
105 105   # NEVER update source scores here
106 106   # That happens in weekly background job
107 - return verdict
108 -{{/code}}
123 + return verdict{{/code}}
124 +
109 109  ==== Monthly Audit (Quality Assurance) ====
126 +
110 110  Moderator review of flagged source scores:
128 +
111 111  * Verify scores make sense
112 112  * Detect gaming attempts
113 113  * Identify systematic biases
... ... @@ -147,18 +147,19 @@
147 147   → NYT score: 0.89 (trending up)
148 148   → Blog X score: 0.48 (trending down)
149 149  ```
168 +
150 150  === 1.4 Scenario ===
170 +
151 151  **Purpose**: Different interpretations or contexts for evaluating claims
152 152  **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
153 153  **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)
154 154  **Fields**:
175 +
155 155  * **id** (UUID): Unique identifier
156 156  * **claim_id** (UUID): Foreign key to claim (one-to-many)
157 157  * **description** (text): Human-readable description of the scenario
158 158  * **assumptions** (JSONB): Key assumptions that define this scenario context
159 159  * **extracted_from** (UUID): Reference to evidence that this scenario was extracted from
160 -* **verdict_summary** (text): Compiled verdict for this scenario
161 -* **confidence** (decimal 0-1): Confidence level for verdict in this scenario
162 162  * **created_at** (timestamp): When scenario was created
163 163  * **updated_at** (timestamp): Last modification
164 164  **How Found**: Evidence search → Extract context → Create scenario → Link to claim
... ... @@ -168,13 +168,48 @@
168 168  * Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data
169 169  * Scenario 3: "Immunocompromised patients" from specialist study
170 170  **V2.0 Evolution**: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality.
171 -=== 1.5 User ===
190 +
191 +=== 1.5 Verdict ===
192 +
193 +**Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.
194 +
195 +**Core Fields**:
196 +
197 +* **id** (UUID): Primary key
198 +* **scenario_id** (UUID FK): The scenario being assessed
199 +* **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")
200 +* **confidence** (decimal 0-1): How confident we are in this assessment
201 +* **explanation_summary** (text): Human-readable reasoning explaining the verdict
202 +* **uncertainty_factors** (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")
203 +* **created_at** (timestamp): When verdict was created
204 +* **updated_at** (timestamp): Last modification
205 +
206 +**Change Tracking**: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states.
207 +
208 +**Relationship**: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity.
209 +
210 +**Example**:
211 +For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":
212 +
213 +* Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
214 +* After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
215 +* Edit entity records the complete before/after change with timestamp and reason
216 +
217 +**Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios.
218 +
219 +=== 1.6 User ===
220 +
172 172  Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count
173 -=== User Reputation System ==
222 +
223 +=== User Reputation System ===
224 +
174 174  **V1.0 Approach**: Simple manual role assignment
175 175  **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
227 +
176 176  === Roles (Manual Assignment) ===
229 +
177 177  **reader** (default):
231 +
178 178  * View published claims and evidence
179 179  * Browse and search content
180 180  * No editing permissions
... ... @@ -193,8 +193,11 @@
193 193  * System configuration
194 194  * Access to all features
195 195  * Founder-appointed initially
250 +
196 196  === Contribution Tracking (Simple) ===
252 +
197 197  **Basic metrics only**:
254 +
198 198  * `contributions_count`: Total number of contributions
199 199  * `created_at`: Account age
200 200  * `last_active`: Recent activity
... ... @@ -203,19 +203,26 @@
203 203  * No automated privilege escalation
204 204  * No reputation decay
205 205  * No threshold-based promotions
263 +
206 206  === Promotion Process ===
265 +
207 207  **Manual review by moderators/admins**:
267 +
208 208  1. User demonstrates value through contributions
209 209  2. Moderator reviews user's contribution history
210 210  3. Moderator promotes user to contributor role
211 211  4. Admin promotes trusted contributors to moderator
212 212  **Criteria** (guidelines, not automated):
273 +
213 213  * Quality of contributions
214 214  * Consistency over time
215 215  * Collaborative behavior
216 216  * Understanding of project goals
278 +
217 217  === V2.0+ Evolution ===
280 +
218 218  **Add complex reputation when**:
282 +
219 219  * 100+ active contributors
220 220  * Manual role management becomes bottleneck
221 221  * Clear patterns of abuse emerge requiring automation
... ... @@ -224,12 +224,17 @@
224 224  * Threshold-based promotions
225 225  * Reputation decay for inactive users
226 226  * Track record scoring for contributors
227 -See [[When to Add Complexity>>FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
228 -=== 1.6 Edit ===
291 +See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
292 +
293 +=== 1.7 Edit ===
294 +
229 229  **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
230 230  **Purpose**: Complete audit trail for all content changes
297 +
231 231  === Edit History Details ===
299 +
232 232  **What Gets Edited**:
301 +
233 233  * **Claims** (20% edited): assertion, domain, status, scores, analysis
234 234  * **Evidence** (10% edited): excerpt, relevance_score, supports
235 235  * **Scenarios** (5% edited): description, assumptions, confidence
... ... @@ -246,12 +246,14 @@
246 246  * `MODERATION_ACTION`: Hide/unhide for abuse
247 247  * `REVERT`: Rollback to previous version
248 248  **Retention Policy** (5 years total):
318 +
249 249  1. **Hot storage** (3 months): PostgreSQL, instant access
250 250  2. **Warm storage** (2 years): Partitioned, slower queries
251 251  3. **Cold storage** (3 years): S3 compressed, download required
252 252  4. **Deletion**: After 5 years (except legal holds)
253 -**Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit)
323 +**Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit)
254 254  **Use Cases**:
325 +
255 255  * View claim history timeline
256 256  * Detect vandalism patterns
257 257  * Learn from user corrections (system improvement)
... ... @@ -258,12 +258,17 @@
258 258  * Legal compliance (audit trail)
259 259  * Rollback capability
260 260  See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases
261 -=== 1.7 Flag ===
332 +
333 +=== 1.8 Flag ===
334 +
262 262  Fields: entity_id, reported_by, issue_type, status, resolution_note
263 -=== 1.8 QualityMetric ===
336 +
337 +=== 1.9 QualityMetric ===
338 +
264 264  **Fields**: metric_type, category, value, target, timestamp
265 265  **Purpose**: Time-series quality tracking
266 266  **Usage**:
342 +
267 267  * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times
268 268  * **Quality dashboard**: Real-time display with trend charts
269 269  * **Alerting**: Automatic alerts when metrics exceed thresholds
... ... @@ -270,19 +270,31 @@
270 270  * **A/B testing**: Compare control vs treatment metrics
271 271  * **Improvement validation**: Measure before/after changes
272 272  **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
273 -=== 1.9 ErrorPattern ===
349 +
350 +=== 1.10 ErrorPattern ===
351 +
274 274  **Fields**: error_category, claim_id, description, root_cause, frequency, status
275 275  **Purpose**: Capture errors to trigger system improvements
276 276  **Usage**:
355 +
277 277  * **Error capture**: When users flag issues or system detects problems
278 278  * **Pattern analysis**: Weekly grouping by category and frequency
279 279  * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor
280 280  * **Metrics**: Track error rate reduction over time
281 281  **Example**: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}`
361 +
362 +== 1.4 Core Data Model ERD ==
363 +
364 +{{include reference="Test.FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}}
365 +
282 282  == 1.5 User Class Diagram ==
283 -{{include reference="FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}}
367 +
368 +{{include reference="Test.FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}}
369 +
284 284  == 2. Versioning Strategy ==
371 +
285 285  **All Content Entities Are Versioned**:
373 +
286 286  * **Claim**: Every edit creates new version (V1→V2→V3...)
287 287  * **Evidence**: Changes tracked in edit history
288 288  * **Scenario**: Modifications versioned
... ... @@ -303,68 +303,91 @@
303 303  Claim V2: "The sky is blue during daytime"
304 304   → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
305 305  ```
394 +
306 306  == 2.5. Storage vs Computation Strategy ==
396 +
307 307  **Critical architectural decision**: What to persist in databases vs compute dynamically?
308 308  **Trade-off**:
399 +
309 309  * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs
310 310  * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible
402 +
311 311  === Recommendation: Hybrid Approach ===
404 +
312 312  **STORE (in PostgreSQL):**
406 +
313 313  ==== Claims (Current State + History) ====
408 +
314 314  * **What**: assertion, domain, status, created_at, updated_at, version
315 315  * **Why**: Core entity, must be persistent
316 316  * **Also store**: confidence_score (computed once, then cached)
317 -* **Size**: ~1 KB per claim
412 +* **Size**: 1 KB per claim
318 318  * **Growth**: Linear with claims
319 319  * **Decision**: ✅ STORE - Essential
415 +
320 320  ==== Evidence (All Records) ====
417 +
321 321  * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
322 322  * **Why**: Hard to re-gather, user contributions, reproducibility
323 -* **Size**: ~2 KB per evidence (with excerpt)
420 +* **Size**: 2 KB per evidence (with excerpt)
324 324  * **Growth**: 3-10 evidence per claim
325 325  * **Decision**: ✅ STORE - Essential for reproducibility
423 +
326 326  ==== Sources (Track Records) ====
425 +
327 327  * **What**: name, domain, track_record_score, accuracy_history, correction_frequency
328 328  * **Why**: Continuously updated, expensive to recompute
329 -* **Size**: ~500 bytes per source
428 +* **Size**: 500 bytes per source
330 330  * **Growth**: Slow (limited number of sources)
331 331  * **Decision**: ✅ STORE - Essential for quality
431 +
332 332  ==== Edit History (All Versions) ====
433 +
333 333  * **What**: before_state, after_state, user_id, reason, timestamp
334 334  * **Why**: Audit trail, legal requirement, reproducibility
335 -* **Size**: ~2 KB per edit
336 -* **Growth**: Linear with edits (~A portion of claims get edited)
436 +* **Size**: 2 KB per edit
437 +* **Growth**: Linear with edits (A portion of claims get edited)
337 337  * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
338 338  * **Decision**: ✅ STORE - Essential for accountability
440 +
339 339  ==== Flags (User Reports) ====
442 +
340 340  * **What**: entity_id, reported_by, issue_type, description, status
341 341  * **Why**: Error detection, system improvement triggers
342 -* **Size**: ~500 bytes per flag
445 +* **Size**: 500 bytes per flag
343 343  * **Growth**: 5-high percentage of claims get flagged
344 344  * **Decision**: ✅ STORE - Essential for improvement
448 +
345 345  ==== ErrorPatterns (System Improvement) ====
450 +
346 346  * **What**: error_category, claim_id, description, root_cause, frequency, status
347 347  * **Why**: Learning loop, prevent recurring errors
348 -* **Size**: ~1 KB per pattern
453 +* **Size**: 1 KB per pattern
349 349  * **Growth**: Slow (limited patterns, many fixed)
350 350  * **Decision**: ✅ STORE - Essential for learning
456 +
351 351  ==== QualityMetrics (Time Series) ====
458 +
352 352  * **What**: metric_type, category, value, target, timestamp
353 353  * **Why**: Trend analysis, cannot recreate historical metrics
354 -* **Size**: ~200 bytes per metric
461 +* **Size**: 200 bytes per metric
355 355  * **Growth**: Hourly = 8,760 per year per metric type
356 356  * **Retention**: 2 years hot, then aggregate and archive
357 357  * **Decision**: ✅ STORE - Essential for monitoring
358 358  **STORE (Computed Once, Then Cached):**
466 +
359 359  ==== Analysis Summary ====
468 +
360 360  * **What**: Neutral text summary of claim analysis (200-500 words)
361 361  * **Computed**: Once by AKEL when claim first analyzed
362 362  * **Stored in**: Claim table (text field)
363 363  * **Recomputed**: Only when system significantly improves OR claim edited
364 364  * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
365 -* **Size**: ~2 KB per claim
474 +* **Size**: 2 KB per claim
366 366  * **Decision**: ✅ STORE (cached) - Cost-effective
476 +
367 367  ==== Confidence Score ====
478 +
368 368  * **What**: 0-100 score of analysis confidence
369 369  * **Computed**: Once by AKEL
370 370  * **Stored in**: Claim table (integer field)
... ... @@ -372,7 +372,9 @@
372 372  * **Why store**: Cheap to store, expensive to compute, users need it fast
373 373  * **Size**: 4 bytes per claim
374 374  * **Decision**: ✅ STORE (cached) - Performance critical
486 +
375 375  ==== Risk Score ====
488 +
376 376  * **What**: 0-100 score of claim risk level
377 377  * **Computed**: Once by AKEL
378 378  * **Stored in**: Claim table (integer field)
... ... @@ -381,13 +381,17 @@
381 381  * **Size**: 4 bytes per claim
382 382  * **Decision**: ✅ STORE (cached) - Performance critical
383 383  **COMPUTE DYNAMICALLY (Never Store):**
384 -==== Scenarios ==== ⚠️ CRITICAL DECISION
497 +
498 +==== Scenarios ====
499 +
500 + ⚠️ CRITICAL DECISION
501 +
385 385  * **What**: 2-5 possible interpretations of claim with assumptions
386 386  * **Current design**: Stored in Scenario table
387 387  * **Alternative**: Compute on-demand when user views claim details
388 -* **Storage cost**: ~1 KB per scenario × 3 scenarios average = ~3 KB per claim
505 +* **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim
389 389  * **Compute cost**: $0.005-0.01 per request (LLM API call)
390 -* **Frequency**: Viewed in detail by ~20% of users
507 +* **Frequency**: Viewed in detail by 20% of users
391 391  * **Trade-off analysis**:
392 392   - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access
393 393   - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
... ... @@ -395,12 +395,17 @@
395 395  * **Speed**: Computed = 5-8 seconds delay, Stored = instant
396 396  * **Decision**: ✅ STORE (hybrid approach below)
397 397  **Scenario Strategy** (APPROVED):
515 +
398 398  1. **Store scenarios** initially when claim analyzed
399 399  2. **Mark as stale** when system improves significantly
400 400  3. **Recompute on next view** if marked stale
401 401  4. **Cache for 30 days** if frequently accessed
402 402  5. **Result**: Best of both worlds - speed + freshness
403 -==== Verdict Synthesis ====
521 +
522 +==== Verdict Synthesis ====
523 +
524 +
525 +
404 404  * **What**: Final conclusion text synthesizing all scenarios
405 405  * **Compute cost**: $0.002-0.005 per request
406 406  * **Frequency**: Every time claim viewed
... ... @@ -408,17 +408,23 @@
408 408  * **Speed**: 2-3 seconds (acceptable)
409 409  **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale
410 410  * **Recommendation**: ✅ STORE cached version, mark stale when changes occur
533 +
411 411  ==== Search Results ====
535 +
412 412  * **What**: Lists of claims matching search query
413 413  * **Compute from**: Elasticsearch index
414 414  * **Cache**: 15 minutes in Redis for popular queries
415 415  * **Why not store permanently**: Constantly changing, infinite possible queries
540 +
416 416  ==== Aggregated Statistics ====
542 +
417 417  * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.
418 418  * **Compute from**: Database queries
419 419  * **Cache**: 1 hour in Redis
420 420  * **Why not store**: Can be derived, relatively cheap to compute
547 +
421 421  ==== User Reputation ====
549 +
422 422  * **What**: Score based on contributions
423 423  * **Current design**: Stored in User table
424 424  * **Alternative**: Compute from Edit table
... ... @@ -428,37 +428,43 @@
428 428  * **Frequency**: Read on every user action
429 429  * **Compute cost**: Simple COUNT query, milliseconds
430 430  * **Decision**: ✅ STORE - Performance critical, read-heavy
559 +
431 431  === Summary Table ===
432 -| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |
433 -|-----------|---------|---------|----------------|----------|-----------|
434 -| Claim core | ✅ | - | 1 KB | STORE | Essential |
435 -| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |
436 -| Sources | ✅ | - | 500 B (shared) | STORE | Track record |
437 -| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |
438 -| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |
439 -| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |
440 -| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |
441 -| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |
442 -| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |
443 -| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |
444 -| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |
445 -| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |
446 -| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |
561 +
562 +| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\
563 +|-----|-|-||----|-----|\\
564 +| Claim core | ✅ | - | 1 KB | STORE | Essential |\\
565 +| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\
566 +| Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\
567 +| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\
568 +| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\
569 +| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
570 +| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
571 +| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\
572 +| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\
573 +| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\
574 +| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\
575 +| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\
576 +| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\
447 447  | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |
448 -**Total storage per claim**: ~18 KB (without edits and flags)
578 +**Total storage per claim**: 18 KB (without edits and flags)
449 449  **For 1 million claims**:
450 -* **Storage**: ~18 GB (manageable)
451 -* **PostgreSQL**: ~$50/month (standard instance)
452 -* **Redis cache**: ~$20/month (1 GB instance)
453 -* **S3 archives**: ~$5/month (old edits)
454 -* **Total**: ~$75/month infrastructure
580 +
581 +* **Storage**: 18 GB (manageable)
582 +* **PostgreSQL**: $50/month (standard instance)
583 +* **Redis cache**: $20/month (1 GB instance)
584 +* **S3 archives**: $5/month (old edits)
585 +* **Total**: $75/month infrastructure
455 455  **LLM cost savings by caching**:
456 456  * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
457 457  * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims
458 458  * Verdict stored: Save $0.003 per claim = $3K per 1M claims
459 -* **Total savings**: ~$35K per 1M claims vs recomputing every time
590 +* **Total savings**: $35K per 1M claims vs recomputing every time
591 +
460 460  === Recomputation Triggers ===
593 +
461 461  **When to mark cached data as stale and recompute:**
595 +
462 462  1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)
463 463  2. **Evidence added** → Recompute: scenarios, verdict, confidence score
464 464  3. **Source track record changes >10 points** → Recompute: confidence score, verdict
... ... @@ -465,11 +465,15 @@
465 465  4. **System improvement deployed** → Mark affected claims stale, recompute on next view
466 466  5. **Controversy detected** (high flag rate) → Recompute: risk score
467 467  **Recomputation strategy**:
602 +
468 468  * **Eager**: Immediately recompute (for user edits)
469 469  * **Lazy**: Recompute on next view (for system improvements)
470 470  * **Batch**: Nightly re-evaluation of stale claims (if <1000)
606 +
471 471  === Database Size Projection ===
608 +
472 472  **Year 1**: 10K claims
610 +
473 473  * Storage: 180 MB
474 474  * Cost: $10/month
475 475  **Year 3**: 100K claims
... ... @@ -483,15 +483,21 @@
483 483  * Cost: $300/month
484 484  * Optimization: Archive old claims to S3 ($5/TB/month)
485 485  **Conclusion**: Storage costs are manageable, LLM cost savings are substantial.
624 +
486 486  == 3. Key Simplifications ==
626 +
487 487  * **Two content states only**: Published, Hidden
488 488  * **Three user roles only**: Reader, Contributor, Moderator
489 489  * **No complex versioning**: Linear edit history
490 490  * **Reputation-based permissions**: Not role hierarchy
491 491  * **Source track records**: Continuous evaluation
632 +
492 492  == 3. What Gets Stored in the Database ==
634 +
493 493  === 3.1 Primary Storage (PostgreSQL) ===
636 +
494 494  **Claims Table**:
638 +
495 495  * Current state only (latest version)
496 496  * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
497 497  **Evidence Table**:
... ... @@ -518,31 +518,44 @@
518 518  **QualityMetric Table**:
519 519  * Time-series quality data
520 520  * Fields: id, metric_type, metric_category, value, target, timestamp
665 +
521 521  === 3.2 What's NOT Stored (Computed on-the-fly) ===
667 +
522 522  * **Verdicts**: Synthesized from evidence + scenarios when requested
523 523  * **Risk scores**: Recalculated based on current factors
524 524  * **Aggregated statistics**: Computed from base data
525 525  * **Search results**: Generated from Elasticsearch index
672 +
526 526  === 3.3 Cache Layer (Redis) ===
674 +
527 527  **Cached for performance**:
676 +
528 528  * Frequently accessed claims (TTL: 1 hour)
529 529  * Search results (TTL: 15 minutes)
530 530  * User sessions (TTL: 24 hours)
531 531  * Source track records (TTL: 1 hour)
681 +
532 532  === 3.4 File Storage (S3) ===
683 +
533 533  **Archived content**:
685 +
534 534  * Old edit history (>3 months)
535 535  * Evidence documents (archived copies)
536 536  * Database backups
537 537  * Export files
690 +
538 538  === 3.5 Search Index (Elasticsearch) ===
692 +
539 539  **Indexed for search**:
694 +
540 540  * Claim assertions (full-text)
541 541  * Evidence excerpts (full-text)
542 542  * Scenario descriptions (full-text)
543 543  * Source names (autocomplete)
544 544  Synchronized from PostgreSQL via change data capture or periodic sync.
700 +
545 545  == 4. Related Pages ==
546 -* [[Architecture>>FactHarbor.Specification.Architecture.WebHome]]
547 -* [[Requirements>>FactHarbor.Specification.Requirements.WebHome]]
548 -* [[Workflows>>FactHarbor.Specification.Workflows.WebHome]]
702 +
703 +* [[Architecture>>FactHarbor.Archive.FactHarbor delta for V0\.9\.70.Specification.Architecture.WebHome]]
704 +* [[Requirements>>Test.FactHarbor.Specification.Requirements.WebHome]]
705 +* [[Workflows>>Test.FactHarbor.Specification.Workflows.WebHome]]