Wiki source code of Data Model

Version 1.1 by Robert Schaub on 2025/12/22 14:10

Show last authors
1 = Data Model =
2 FactHarbor's data model is **simple, focused, designed for automated processing**.
3 == 1. Core Entities ==
4 === 1.1 Claim ===
5 Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count
6 ==== Performance Optimization: Denormalized Fields ====
7 **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
8 **Additional cached fields in claims table**:
9 * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores * Avoids joining evidence table for listing/preview * Updated when evidence is added/removed * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
10 * **source_names** (TEXT[]): Array of source names for quick display * Avoids joining through evidence to sources * Updated when sources change * Format: `["New York Times", "Nature Journal", ...]`
11 * **scenario_count** (INTEGER): Number of scenarios for this claim * Quick metric without counting rows * Updated when scenarios added/removed
12 * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed * Helps invalidate stale caches * Triggers background refresh if too old
13 **Update Strategy**:
14 * **Immediate**: Update on claim edit (user-facing)
15 * **Deferred**: Update via background job every hour (non-critical)
16 * **Invalidation**: Clear cache when source data changes significantly
17 **Trade-offs**:
18 * ✅ 70% fewer joins on common queries
19 * ✅ Much faster claim list/search pages
20 * ✅ Better user experience
21 * ⚠️ Small storage increase (~10%)
22 * ⚠️ Need to keep caches in sync
23 === 1.2 Evidence ===
24 Fields: claim_id, source_id, excerpt, url, relevance_score, supports
25 === 1.3 Source ===
26 **Purpose**: Track reliability of information sources over time
27 **Fields**:
28 * **id** (UUID): Unique identifier
29 * **name** (text): Source name (e.g., "New York Times", "Nature Journal")
30 * **domain** (text): Website domain (e.g., "nytimes.com")
31 * **type** (enum): NewsOutlet, AcademicJournal, GovernmentAgency, etc.
32 * **track_record_score** (0-100): Overall reliability score
33 * **accuracy_history** (JSON): Historical accuracy data
34 * **correction_frequency** (float): How often source publishes corrections
35 * **last_updated** (timestamp): When track record last recalculated
36 **How It Works**:
37 * Initial score based on source type (70 for academic journals, 30 for unknown)
38 * Updated daily by background scheduler
39 * Formula: accuracy_rate (50%) + correction_policy (20%) + editorial_standards (15%) + bias_transparency (10%) + longevity (5%)
40 * Track Record Check in AKEL pipeline: Adjusts evidence confidence based on source quality
41 * Quality thresholds: 90+=Exceptional, 70-89=Reliable, 50-69=Acceptable, 30-49=Questionable, <30=Unreliable
42 **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
43 Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**
44 **Key**: Automated source reliability tracking
45 ==== Source Scoring Process (Separation of Concerns) ====
46 **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.
47 **The Problem**: * Source scores should influence claim verdicts
48 * Claim verdicts should update source scores
49 * But: Direct feedback creates circular dependency and potential feedback loops
50 **The Solution**: Temporal separation
51 ==== Weekly Background Job (Source Scoring) ====
52 Runs independently of claim analysis:
53 {{code language="python"}}
54 def update_source_scores_weekly(): """ Background job: Calculate source reliability Never triggered by individual claim analysis """ # Analyze all claims from past week claims = get_claims_from_past_week() for source in get_all_sources(): # Calculate accuracy metrics correct_verdicts = count_correct_verdicts_citing(source, claims) total_citations = count_total_citations(source, claims) accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5 # Weight by claim importance weighted_score = calculate_weighted_score(source, claims) # Update source record source.track_record_score = weighted_score source.total_citations = total_citations source.last_updated = now() source.save() # Job runs: Sunday 2 AM UTC # Never during claim processing
55 {{/code}}
56 ==== Real-Time Claim Analysis (AKEL) ====
57 Uses source scores but never updates them:
58 {{code language="python"}}
59 def analyze_claim(claim_text): """ Real-time: Analyze claim using current source scores READ source scores, never UPDATE them """ # Gather evidence evidence_list = gather_evidence(claim_text) for evidence in evidence_list: # READ source score (snapshot from last weekly update) source = get_source(evidence.source_id) source_score = source.track_record_score # Use score to weight evidence evidence.weighted_relevance = evidence.relevance * source_score # Generate verdict using weighted evidence verdict = synthesize_verdict(evidence_list) # NEVER update source scores here # That happens in weekly background job return verdict
60 {{/code}}
61 ==== Monthly Audit (Quality Assurance) ====
62 Moderator review of flagged source scores:
63 * Verify scores make sense
64 * Detect gaming attempts
65 * Identify systematic biases
66 * Manual adjustments if needed
67 **Key Principles**:
68 ✅ **Scoring and analysis are temporally separated**
69 * Source scoring: Weekly batch job
70 * Claim analysis: Real-time processing
71 * Never update scores during analysis
72 ✅ **One-way data flow during processing**
73 * Claims READ source scores
74 * Claims NEVER WRITE source scores
75 * Updates happen in background only
76 ✅ **Predictable update cycle**
77 * Sources update every Sunday 2 AM
78 * Claims always use last week's scores
79 * No mid-week score changes
80 ✅ **Audit trail**
81 * Log all score changes
82 * Track score history
83 * Explainable calculations
84 **Benefits**:
85 * No circular dependencies
86 * Predictable behavior
87 * Easier to reason about
88 * Simpler testing
89 * Clear audit trail
90 **Example Timeline**:
91 ```
92 Sunday 2 AM: Calculate source scores for past week → NYT score: 0.87 (up from 0.85) → Blog X score: 0.52 (down from 0.61)
93 Monday-Saturday: Claims processed using these scores → All claims this week use NYT=0.87 → All claims this week use Blog X=0.52
94 Next Sunday 2 AM: Recalculate scores including this week's claims → NYT score: 0.89 (trending up) → Blog X score: 0.48 (trending down)
95 ```
96 === 1.4 Scenario ===
97 **Purpose**: Different interpretations or contexts for evaluating claims
98 **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
99 **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)
100 **Fields**:
101 * **id** (UUID): Unique identifier
102 * **claim_id** (UUID): Foreign key to claim (one-to-many)
103 * **description** (text): Human-readable description of the scenario
104 * **assumptions** (JSONB): Key assumptions that define this scenario context
105 * **extracted_from** (UUID): Reference to evidence that this scenario was extracted from
106 * **created_at** (timestamp): When scenario was created
107 * **updated_at** (timestamp): Last modification
108 **How Found**: Evidence search → Extract context → Create scenario → Link to claim
109 **Example**: For claim "Vaccines reduce hospitalization":
110 * Scenario 1: "Clinical trials (healthy adults 18-65, original strain)" from trial paper
111 * Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data
112 * Scenario 3: "Immunocompromised patients" from specialist study
113 **V2.0 Evolution**: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality. === 1.5 Verdict === **Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence. **Core Fields**:
114 * **id** (UUID): Primary key
115 * **scenario_id** (UUID FK): The scenario being assessed
116 * **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")
117 * **confidence** (decimal 0-1): How confident we are in this assessment
118 * **explanation_summary** (text): Human-readable reasoning explaining the verdict
119 * **uncertainty_factors** (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")
120 * **created_at** (timestamp): When verdict was created
121 * **updated_at** (timestamp): Last modification **Change Tracking**: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states. **Relationship**: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity. **Example**:
122 For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":
123 * Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
124 * After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
125 * Edit entity records the complete before/after change with timestamp and reason **Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios. === 1.6 User ===
126 Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count
127 === User Reputation System ==
128 **V1.0 Approach**: Simple manual role assignment
129 **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
130 === Roles (Manual Assignment) ===
131 **reader** (default):
132 * View published claims and evidence
133 * Browse and search content
134 * No editing permissions
135 **contributor**:
136 * Submit new claims
137 * Suggest edits to existing content
138 * Add evidence
139 * Requires manual promotion by moderator/admin
140 **moderator**:
141 * Approve/reject contributor suggestions
142 * Flag inappropriate content
143 * Handle abuse reports
144 * Assigned by admins based on trust
145 **admin**:
146 * Manage users and roles
147 * System configuration
148 * Access to all features
149 * Founder-appointed initially
150 === Contribution Tracking (Simple) ===
151 **Basic metrics only**:
152 * `contributions_count`: Total number of contributions
153 * `created_at`: Account age
154 * `last_active`: Recent activity
155 **No complex calculations**:
156 * No point systems
157 * No automated privilege escalation
158 * No reputation decay
159 * No threshold-based promotions
160 === Promotion Process ===
161 **Manual review by moderators/admins**:
162 1. User demonstrates value through contributions
163 2. Moderator reviews user's contribution history
164 3. Moderator promotes user to contributor role
165 4. Admin promotes trusted contributors to moderator
166 **Criteria** (guidelines, not automated):
167 * Quality of contributions
168 * Consistency over time
169 * Collaborative behavior
170 * Understanding of project goals
171 === V2.0+ Evolution ===
172 **Add complex reputation when**:
173 * 100+ active contributors
174 * Manual role management becomes bottleneck
175 * Clear patterns of abuse emerge requiring automation
176 **Future features may include**:
177 * Automated point calculations
178 * Threshold-based promotions
179 * Reputation decay for inactive users
180 * Track record scoring for contributors
181 See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
182 === 1.7 Edit ===
183 **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
184 **Purpose**: Complete audit trail for all content changes
185 === Edit History Details ===
186 **What Gets Edited**:
187 * **Claims** (20% edited): assertion, domain, status, scores, analysis
188 * **Evidence** (10% edited): excerpt, relevance_score, supports
189 * **Scenarios** (5% edited): description, assumptions, confidence
190 * **Sources**: NOT versioned (continuous updates, not editorial decisions)
191 **Who Edits**:
192 * **Contributors** (rep sufficient): Corrections, additions
193 * **Trusted Contributors** (rep sufficient): Major improvements, approvals
194 * **Moderators**: Abuse handling, dispute resolution
195 * **System (AKEL)**: Re-analysis, automated improvements (user_id = NULL)
196 **Edit Types**:
197 * `CONTENT_CORRECTION`: User fixes factual error
198 * `CLARIFICATION`: Improved wording
199 * `SYSTEM_REANALYSIS`: AKEL re-processed claim
200 * `MODERATION_ACTION`: Hide/unhide for abuse
201 * `REVERT`: Rollback to previous version
202 **Retention Policy** (5 years total):
203 1. **Hot storage** (3 months): PostgreSQL, instant access
204 2. **Warm storage** (2 years): Partitioned, slower queries
205 3. **Cold storage** (3 years): S3 compressed, download required
206 4. **Deletion**: After 5 years (except legal holds)
207 **Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit)
208 **Use Cases**:
209 * View claim history timeline
210 * Detect vandalism patterns
211 * Learn from user corrections (system improvement)
212 * Legal compliance (audit trail)
213 * Rollback capability
214 See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases
215 === 1.8 Flag ===
216 Fields: entity_id, reported_by, issue_type, status, resolution_note
217 === 1.9 QualityMetric ===
218 **Fields**: metric_type, category, value, target, timestamp
219 **Purpose**: Time-series quality tracking
220 **Usage**:
221 * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times
222 * **Quality dashboard**: Real-time display with trend charts
223 * **Alerting**: Automatic alerts when metrics exceed thresholds
224 * **A/B testing**: Compare control vs treatment metrics
225 * **Improvement validation**: Measure before/after changes
226 **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
227 === 1.10 ErrorPattern ===
228 **Fields**: error_category, claim_id, description, root_cause, frequency, status
229 **Purpose**: Capture errors to trigger system improvements
230 **Usage**:
231 * **Error capture**: When users flag issues or system detects problems
232 * **Pattern analysis**: Weekly grouping by category and frequency
233 * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor
234 * **Metrics**: Track error rate reduction over time
235 **Example**: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}` == 1.4 Core Data Model ERD == {{include reference="Test.FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}} == 1.5 User Class Diagram ==
236 {{include reference="Test.FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}}
237 == 2. Versioning Strategy ==
238 **All Content Entities Are Versioned**:
239 * **Claim**: Every edit creates new version (V1→V2→V3...)
240 * **Evidence**: Changes tracked in edit history
241 * **Scenario**: Modifications versioned
242 **How Versioning Works**:
243 * Entity table stores **current state only**
244 * Edit table stores **all historical states** (before_state, after_state as JSON)
245 * Version number increments with each edit
246 * Complete audit trail maintained forever
247 **Unversioned Entities** (current state only, no history):
248 * **Source**: Track record continuously updated (not versioned history, just current score)
249 * **User**: Account state (reputation accumulated, not versioned)
250 * **QualityMetric**: Time-series data (each record is a point in time, not a version)
251 * **ErrorPattern**: System improvement queue (status tracked, not versioned)
252 **Example**:
253 ```
254 Claim V1: "The sky is blue" → User edits → Claim V2: "The sky is blue during daytime" → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
255 ```
256 == 2.5. Storage vs Computation Strategy ==
257 **Critical architectural decision**: What to persist in databases vs compute dynamically?
258 **Trade-off**:
259 * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs
260 * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible
261 === Recommendation: Hybrid Approach ===
262 **STORE (in PostgreSQL):**
263 ==== Claims (Current State + History) ====
264 * **What**: assertion, domain, status, created_at, updated_at, version
265 * **Why**: Core entity, must be persistent
266 * **Also store**: confidence_score (computed once, then cached)
267 * **Size**: ~1 KB per claim
268 * **Growth**: Linear with claims
269 * **Decision**: ✅ STORE - Essential
270 ==== Evidence (All Records) ====
271 * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
272 * **Why**: Hard to re-gather, user contributions, reproducibility
273 * **Size**: ~2 KB per evidence (with excerpt)
274 * **Growth**: 3-10 evidence per claim
275 * **Decision**: ✅ STORE - Essential for reproducibility
276 ==== Sources (Track Records) ====
277 * **What**: name, domain, track_record_score, accuracy_history, correction_frequency
278 * **Why**: Continuously updated, expensive to recompute
279 * **Size**: ~500 bytes per source
280 * **Growth**: Slow (limited number of sources)
281 * **Decision**: ✅ STORE - Essential for quality
282 ==== Edit History (All Versions) ====
283 * **What**: before_state, after_state, user_id, reason, timestamp
284 * **Why**: Audit trail, legal requirement, reproducibility
285 * **Size**: ~2 KB per edit
286 * **Growth**: Linear with edits (~A portion of claims get edited)
287 * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
288 * **Decision**: ✅ STORE - Essential for accountability
289 ==== Flags (User Reports) ====
290 * **What**: entity_id, reported_by, issue_type, description, status
291 * **Why**: Error detection, system improvement triggers
292 * **Size**: ~500 bytes per flag
293 * **Growth**: 5-high percentage of claims get flagged
294 * **Decision**: ✅ STORE - Essential for improvement
295 ==== ErrorPatterns (System Improvement) ====
296 * **What**: error_category, claim_id, description, root_cause, frequency, status
297 * **Why**: Learning loop, prevent recurring errors
298 * **Size**: ~1 KB per pattern
299 * **Growth**: Slow (limited patterns, many fixed)
300 * **Decision**: ✅ STORE - Essential for learning
301 ==== QualityMetrics (Time Series) ====
302 * **What**: metric_type, category, value, target, timestamp
303 * **Why**: Trend analysis, cannot recreate historical metrics
304 * **Size**: ~200 bytes per metric
305 * **Growth**: Hourly = 8,760 per year per metric type
306 * **Retention**: 2 years hot, then aggregate and archive
307 * **Decision**: ✅ STORE - Essential for monitoring
308 **STORE (Computed Once, Then Cached):**
309 ==== Analysis Summary ====
310 * **What**: Neutral text summary of claim analysis (200-500 words)
311 * **Computed**: Once by AKEL when claim first analyzed
312 * **Stored in**: Claim table (text field)
313 * **Recomputed**: Only when system significantly improves OR claim edited
314 * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
315 * **Size**: ~2 KB per claim
316 * **Decision**: ✅ STORE (cached) - Cost-effective
317 ==== Confidence Score ====
318 * **What**: 0-100 score of analysis confidence
319 * **Computed**: Once by AKEL
320 * **Stored in**: Claim table (integer field)
321 * **Recomputed**: When evidence added, source track record changes significantly, or system improves
322 * **Why store**: Cheap to store, expensive to compute, users need it fast
323 * **Size**: 4 bytes per claim
324 * **Decision**: ✅ STORE (cached) - Performance critical
325 ==== Risk Score ====
326 * **What**: 0-100 score of claim risk level
327 * **Computed**: Once by AKEL
328 * **Stored in**: Claim table (integer field)
329 * **Recomputed**: When domain changes, evidence changes, or controversy detected
330 * **Why store**: Same as confidence score
331 * **Size**: 4 bytes per claim
332 * **Decision**: ✅ STORE (cached) - Performance critical
333 **COMPUTE DYNAMICALLY (Never Store):**
334 ==== Scenarios ==== ⚠️ CRITICAL DECISION
335 * **What**: 2-5 possible interpretations of claim with assumptions
336 * **Current design**: Stored in Scenario table
337 * **Alternative**: Compute on-demand when user views claim details
338 * **Storage cost**: ~1 KB per scenario × 3 scenarios average = ~3 KB per claim
339 * **Compute cost**: $0.005-0.01 per request (LLM API call)
340 * **Frequency**: Viewed in detail by ~20% of users
341 * **Trade-off analysis**: - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
342 * **Reproducibility**: Scenarios may improve as AI improves (good to recompute)
343 * **Speed**: Computed = 5-8 seconds delay, Stored = instant
344 * **Decision**: ✅ STORE (hybrid approach below)
345 **Scenario Strategy** (APPROVED):
346 1. **Store scenarios** initially when claim analyzed
347 2. **Mark as stale** when system improves significantly
348 3. **Recompute on next view** if marked stale
349 4. **Cache for 30 days** if frequently accessed
350 5. **Result**: Best of both worlds - speed + freshness
351 ==== Verdict Synthesis ==== * **What**: Final conclusion text synthesizing all scenarios
352 * **Compute cost**: $0.002-0.005 per request
353 * **Frequency**: Every time claim viewed
354 * **Why not store**: Changes as evidence/scenarios change, users want fresh analysis
355 * **Speed**: 2-3 seconds (acceptable)
356 **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale
357 * **Recommendation**: ✅ STORE cached version, mark stale when changes occur
358 ==== Search Results ====
359 * **What**: Lists of claims matching search query
360 * **Compute from**: Elasticsearch index
361 * **Cache**: 15 minutes in Redis for popular queries
362 * **Why not store permanently**: Constantly changing, infinite possible queries
363 ==== Aggregated Statistics ====
364 * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.
365 * **Compute from**: Database queries
366 * **Cache**: 1 hour in Redis
367 * **Why not store**: Can be derived, relatively cheap to compute
368 ==== User Reputation ====
369 * **What**: Score based on contributions
370 * **Current design**: Stored in User table
371 * **Alternative**: Compute from Edit table
372 * **Trade-off**: - Stored: Fast, simple - Computed: Always accurate, no denormalization
373 * **Frequency**: Read on every user action
374 * **Compute cost**: Simple COUNT query, milliseconds
375 * **Decision**: ✅ STORE - Performance critical, read-heavy
376 === Summary Table ===
377 | Data Type | Storage | Compute | Size per Claim | Decision | Rationale |
378 |-----------|---------|---------|----------------|----------|-----------|
379 | Claim core | ✅ | - | 1 KB | STORE | Essential |
380 | Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |
381 | Sources | ✅ | - | 500 B (shared) | STORE | Track record |
382 | Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |
383 | Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |
384 | Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |
385 | Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |
386 | Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |
387 | Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |
388 | Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |
389 | ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |
390 | QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |
391 | Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |
392 | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |
393 **Total storage per claim**: ~18 KB (without edits and flags)
394 **For 1 million claims**:
395 * **Storage**: ~18 GB (manageable)
396 * **PostgreSQL**: ~$50/month (standard instance)
397 * **Redis cache**: ~$20/month (1 GB instance)
398 * **S3 archives**: ~$5/month (old edits)
399 * **Total**: ~$75/month infrastructure
400 **LLM cost savings by caching**:
401 * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
402 * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims * Verdict stored: Save $0.003 per claim = $3K per 1M claims
403 * **Total savings**: ~$35K per 1M claims vs recomputing every time
404 === Recomputation Triggers ===
405 **When to mark cached data as stale and recompute:**
406 1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)
407 2. **Evidence added** → Recompute: scenarios, verdict, confidence score
408 3. **Source track record changes >10 points** → Recompute: confidence score, verdict
409 4. **System improvement deployed** → Mark affected claims stale, recompute on next view
410 5. **Controversy detected** (high flag rate) → Recompute: risk score
411 **Recomputation strategy**:
412 * **Eager**: Immediately recompute (for user edits)
413 * **Lazy**: Recompute on next view (for system improvements)
414 * **Batch**: Nightly re-evaluation of stale claims (if <1000)
415 === Database Size Projection ===
416 **Year 1**: 10K claims
417 * Storage: 180 MB
418 * Cost: $10/month
419 **Year 3**: 100K claims * Storage: 1.8 GB
420 * Cost: $30/month
421 **Year 5**: 1M claims
422 * Storage: 18 GB * Cost: $75/month
423 **Year 10**: 10M claims
424 * Storage: 180 GB
425 * Cost: $300/month
426 * Optimization: Archive old claims to S3 ($5/TB/month)
427 **Conclusion**: Storage costs are manageable, LLM cost savings are substantial.
428 == 3. Key Simplifications ==
429 * **Two content states only**: Published, Hidden
430 * **Three user roles only**: Reader, Contributor, Moderator
431 * **No complex versioning**: Linear edit history
432 * **Reputation-based permissions**: Not role hierarchy
433 * **Source track records**: Continuous evaluation
434 == 3. What Gets Stored in the Database ==
435 === 3.1 Primary Storage (PostgreSQL) ===
436 **Claims Table**:
437 * Current state only (latest version)
438 * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
439 **Evidence Table**:
440 * All evidence records
441 * Fields: id, claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at, archived
442 **Scenario Table**:
443 * All scenarios for each claim
444 * Fields: id, claim_id, description, assumptions (text array), confidence, created_by, created_at
445 **Source Table**:
446 * Track record database (continuously updated)
447 * Fields: id, name, domain, type, track_record_score, accuracy_history (JSON), correction_frequency, last_updated, claim_count, corrections_count
448 **User Table**:
449 * All user accounts
450 * Fields: id, username, email, role (Reader/Contributor/Moderator), reputation, created_at, last_active, contributions_count, flags_submitted, flags_accepted
451 **Edit Table**:
452 * Complete version history
453 * Fields: id, entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
454 **Flag Table**:
455 * User-reported issues
456 * Fields: id, entity_type, entity_id, reported_by, issue_type, description, status, resolved_by, resolution_note, created_at, resolved_at
457 **ErrorPattern Table**:
458 * System improvement queue
459 * Fields: id, error_category, claim_id, description, root_cause, frequency, status, created_at, fixed_at
460 **QualityMetric Table**:
461 * Time-series quality data
462 * Fields: id, metric_type, metric_category, value, target, timestamp
463 === 3.2 What's NOT Stored (Computed on-the-fly) ===
464 * **Verdicts**: Synthesized from evidence + scenarios when requested
465 * **Risk scores**: Recalculated based on current factors
466 * **Aggregated statistics**: Computed from base data
467 * **Search results**: Generated from Elasticsearch index
468 === 3.3 Cache Layer (Redis) ===
469 **Cached for performance**:
470 * Frequently accessed claims (TTL: 1 hour)
471 * Search results (TTL: 15 minutes)
472 * User sessions (TTL: 24 hours)
473 * Source track records (TTL: 1 hour)
474 === 3.4 File Storage (S3) ===
475 **Archived content**:
476 * Old edit history (>3 months)
477 * Evidence documents (archived copies)
478 * Database backups
479 * Export files
480 === 3.5 Search Index (Elasticsearch) ===
481 **Indexed for search**:
482 * Claim assertions (full-text)
483 * Evidence excerpts (full-text)
484 * Scenario descriptions (full-text)
485 * Source names (autocomplete)
486 Synchronized from PostgreSQL via change data capture or periodic sync.
487 == 4. Related Pages ==
488 * [[Architecture>>Test.FactHarbor.Specification.Architecture.WebHome]]
489 * [[Requirements>>Test.FactHarbor.Specification.Requirements.WebHome]]
490 * [[Workflows>>Test.FactHarbor.Specification.Workflows.WebHome]]