Wiki source code of Data Model

Version 1.1 by Robert Schaub on 2026/01/20 21:40

Show last authors
1 = Data Model =
2
3 {{warning}}
4 **Implementation Status (Updated January 2026)**
5
6 This specification describes the **target** normalized data model. Current implementation (v2.6.33) differs significantly:
7
8 * **Storage**: All data stored as **JSON blobs in SQLite**, not normalized PostgreSQL tables
9 * **Scenarios**: **Replaced with KeyFactors** - decomposition questions, not separate entities
10 * **Caching**: Redis cache **not implemented**; no claim caching yet
11 * **Source Scoring**: Uses external MBFC bundle, not internal track record calculation
12 * **User System**: Not implemented (no authentication in current version)
13
14 This specification remains valuable as the target architecture for future versions.
15
16 See `Docs/STATUS/Documentation_Inconsistencies.md` for full comparison.
17 {{/warning}}
18
19 FactHarbor's data model is **simple, focused, designed for automated processing**.
20 == 1. Core Entities ==
21 === 1.1 Claim ===
22 Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count
23 ==== Performance Optimization: Denormalized Fields ====
24 **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
25 **Additional cached fields in claims table**:
26 * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores
27 * Avoids joining evidence table for listing/preview
28 * Updated when evidence is added/removed
29 * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
30 * **source_names** (TEXT[]): Array of source names for quick display
31 * Avoids joining through evidence to sources
32 * Updated when sources change
33 * Format: `["New York Times", "Nature Journal", ...]`
34 * **scenario_count** (INTEGER): Number of scenarios for this claim
35 * Quick metric without counting rows
36 * Updated when scenarios added/removed
37 * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed
38 * Helps invalidate stale caches
39 * Triggers background refresh if too old
40 **Update Strategy**:
41 * **Immediate**: Update on claim edit (user-facing)
42 * **Deferred**: Update via background job every hour (non-critical)
43 * **Invalidation**: Clear cache when source data changes significantly
44 **Trade-offs**:
45 * ✅ 70% fewer joins on common queries
46 * ✅ Much faster claim list/search pages
47 * ✅ Better user experience
48 * ⚠️ Small storage increase (~10%)
49 * ⚠️ Need to keep caches in sync
50 === 1.2 Evidence ===
51 Fields: claim_id, source_id, excerpt, url, relevance_score, supports
52 === 1.3 Source ===
53 **Purpose**: Track reliability of information sources over time
54 **Fields**:
55 * **id** (UUID): Unique identifier
56 * **name** (text): Source name (e.g., "New York Times", "Nature Journal")
57 * **domain** (text): Website domain (e.g., "nytimes.com")
58 * **type** (enum): NewsOutlet, AcademicJournal, GovernmentAgency, etc.
59 * **track_record_score** (0-100): Overall reliability score
60 * **accuracy_history** (JSON): Historical accuracy data
61 * **correction_frequency** (float): How often source publishes corrections
62 * **last_updated** (timestamp): When track record last recalculated
63 **How It Works**:
64 * Initial score based on source type (70 for academic journals, 30 for unknown)
65 * Updated daily by background scheduler
66 * Formula: accuracy_rate (50%) + correction_policy (20%) + editorial_standards (15%) + bias_transparency (10%) + longevity (5%)
67 * Track Record Check in AKEL pipeline: Adjusts evidence confidence based on source quality
68 * Quality thresholds: 90+=Exceptional, 70-89=Reliable, 50-69=Acceptable, 30-49=Questionable, <30=Unreliable
69
70 {{info}}
71 **Current Implementation (v2.6.33):** Source reliability uses external **MBFC (Media Bias/Fact Check) bundle** instead of internal track record calculation. Scores are loaded from a configurable JSON file. See `Docs/ARCHITECTURE/Source_Reliability.md`.
72 {{/info}}
73
74 **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
75 Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**
76 **Key**: Automated source reliability tracking
77 ==== Source Scoring Process (Separation of Concerns) ====
78 **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.
79 **The Problem**:
80 * Source scores should influence claim verdicts
81 * Claim verdicts should update source scores
82 * But: Direct feedback creates circular dependency and potential feedback loops
83 **The Solution**: Temporal separation
84 ==== Weekly Background Job (Source Scoring) ====
85 Runs independently of claim analysis:
86 {{code language="python"}}
87 def update_source_scores_weekly():
88 """
89 Background job: Calculate source reliability
90 Never triggered by individual claim analysis
91 """
92 # Analyze all claims from past week
93 claims = get_claims_from_past_week()
94 for source in get_all_sources():
95 # Calculate accuracy metrics
96 correct_verdicts = count_correct_verdicts_citing(source, claims)
97 total_citations = count_total_citations(source, claims)
98 accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5
99 # Weight by claim importance
100 weighted_score = calculate_weighted_score(source, claims)
101 # Update source record
102 source.track_record_score = weighted_score
103 source.total_citations = total_citations
104 source.last_updated = now()
105 source.save()
106 # Job runs: Sunday 2 AM UTC
107 # Never during claim processing
108 {{/code}}
109 ==== Real-Time Claim Analysis (AKEL) ====
110 Uses source scores but never updates them:
111 {{code language="python"}}
112 def analyze_claim(claim_text):
113 """
114 Real-time: Analyze claim using current source scores
115 READ source scores, never UPDATE them
116 """
117 # Gather evidence
118 evidence_list = gather_evidence(claim_text)
119 for evidence in evidence_list:
120 # READ source score (snapshot from last weekly update)
121 source = get_source(evidence.source_id)
122 source_score = source.track_record_score
123 # Use score to weight evidence
124 evidence.weighted_relevance = evidence.relevance * source_score
125 # Generate verdict using weighted evidence
126 verdict = synthesize_verdict(evidence_list)
127 # NEVER update source scores here
128 # That happens in weekly background job
129 return verdict
130 {{/code}}
131 ==== Monthly Audit (Quality Assurance) ====
132 Moderator review of flagged source scores:
133 * Verify scores make sense
134 * Detect gaming attempts
135 * Identify systematic biases
136 * Manual adjustments if needed
137 **Key Principles**:
138 ✅ **Scoring and analysis are temporally separated**
139 * Source scoring: Weekly batch job
140 * Claim analysis: Real-time processing
141 * Never update scores during analysis
142 ✅ **One-way data flow during processing**
143 * Claims READ source scores
144 * Claims NEVER WRITE source scores
145 * Updates happen in background only
146 ✅ **Predictable update cycle**
147 * Sources update every Sunday 2 AM
148 * Claims always use last week's scores
149 * No mid-week score changes
150 ✅ **Audit trail**
151 * Log all score changes
152 * Track score history
153 * Explainable calculations
154 **Benefits**:
155 * No circular dependencies
156 * Predictable behavior
157 * Easier to reason about
158 * Simpler testing
159 * Clear audit trail
160 **Example Timeline**:
161 ```
162 Sunday 2 AM: Calculate source scores for past week
163 → NYT score: 0.87 (up from 0.85)
164 → Blog X score: 0.52 (down from 0.61)
165 Monday-Saturday: Claims processed using these scores
166 → All claims this week use NYT=0.87
167 → All claims this week use Blog X=0.52
168 Next Sunday 2 AM: Recalculate scores including this week's claims
169 → NYT score: 0.89 (trending up)
170 → Blog X score: 0.48 (trending down)
171 ```
172 === 1.4 Scenario ===
173
174 {{warning}}
175 **Implementation Change:** Scenarios were **replaced with KeyFactors** in the current implementation. KeyFactors are decomposition questions discovered during the understanding phase, not separate stored entities. See `Docs/ARCHITECTURE/KeyFactors_Design.md` for the design rationale.
176 {{/warning}}
177
178 **Purpose**: Different interpretations or contexts for evaluating claims
179 **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
180 **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)
181 **Fields**:
182 * **id** (UUID): Unique identifier
183 * **claim_id** (UUID): Foreign key to claim (one-to-many)
184 * **description** (text): Human-readable description of the scenario
185 * **assumptions** (JSONB): Key assumptions that define this scenario context
186 * **extracted_from** (UUID): Reference to evidence that this scenario was extracted from
187 * **created_at** (timestamp): When scenario was created
188 * **updated_at** (timestamp): Last modification
189 **How Found**: Evidence search → Extract context → Create scenario → Link to claim
190 **Example**:
191 For claim "Vaccines reduce hospitalization":
192 * Scenario 1: "Clinical trials (healthy adults 18-65, original strain)" from trial paper
193 * Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data
194 * Scenario 3: "Immunocompromised patients" from specialist study
195 **V2.0 Evolution**: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality.
196
197 === 1.5 Verdict ===
198
199 **Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.
200
201 **Core Fields**:
202 * **id** (UUID): Primary key
203 * **scenario_id** (UUID FK): The scenario being assessed
204 * **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")
205 * **confidence** (decimal 0-1): How confident we are in this assessment
206 * **explanation_summary** (text): Human-readable reasoning explaining the verdict
207 * **uncertainty_factors** (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")
208 * **created_at** (timestamp): When verdict was created
209 * **updated_at** (timestamp): Last modification
210
211 **Change Tracking**: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states.
212
213 **Relationship**: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity.
214
215 **Example**:
216 For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":
217 * Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
218 * After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
219 * Edit entity records the complete before/after change with timestamp and reason
220
221 **Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios.
222
223 === 1.6 User ===
224 Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count
225 === User Reputation System ==
226 **V1.0 Approach**: Simple manual role assignment
227 **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
228 === Roles (Manual Assignment) ===
229 **reader** (default):
230 * View published claims and evidence
231 * Browse and search content
232 * No editing permissions
233 **contributor**:
234 * Submit new claims
235 * Suggest edits to existing content
236 * Add evidence
237 * Requires manual promotion by moderator/admin
238 **moderator**:
239 * Approve/reject contributor suggestions
240 * Flag inappropriate content
241 * Handle abuse reports
242 * Assigned by admins based on trust
243 **admin**:
244 * Manage users and roles
245 * System configuration
246 * Access to all features
247 * Founder-appointed initially
248 === Contribution Tracking (Simple) ===
249 **Basic metrics only**:
250 * `contributions_count`: Total number of contributions
251 * `created_at`: Account age
252 * `last_active`: Recent activity
253 **No complex calculations**:
254 * No point systems
255 * No automated privilege escalation
256 * No reputation decay
257 * No threshold-based promotions
258 === Promotion Process ===
259 **Manual review by moderators/admins**:
260 1. User demonstrates value through contributions
261 2. Moderator reviews user's contribution history
262 3. Moderator promotes user to contributor role
263 4. Admin promotes trusted contributors to moderator
264 **Criteria** (guidelines, not automated):
265 * Quality of contributions
266 * Consistency over time
267 * Collaborative behavior
268 * Understanding of project goals
269 === V2.0+ Evolution ===
270 **Add complex reputation when**:
271 * 100+ active contributors
272 * Manual role management becomes bottleneck
273 * Clear patterns of abuse emerge requiring automation
274 **Future features may include**:
275 * Automated point calculations
276 * Threshold-based promotions
277 * Reputation decay for inactive users
278 * Track record scoring for contributors
279 See [[When to Add Complexity>>FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
280 === 1.7 Edit ===
281 **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
282 **Purpose**: Complete audit trail for all content changes
283 === Edit History Details ===
284 **What Gets Edited**:
285 * **Claims** (20% edited): assertion, domain, status, scores, analysis
286 * **Evidence** (10% edited): excerpt, relevance_score, supports
287 * **Scenarios** (5% edited): description, assumptions, confidence
288 * **Sources**: NOT versioned (continuous updates, not editorial decisions)
289 **Who Edits**:
290 * **Contributors** (rep sufficient): Corrections, additions
291 * **Trusted Contributors** (rep sufficient): Major improvements, approvals
292 * **Moderators**: Abuse handling, dispute resolution
293 * **System (AKEL)**: Re-analysis, automated improvements (user_id = NULL)
294 **Edit Types**:
295 * `CONTENT_CORRECTION`: User fixes factual error
296 * `CLARIFICATION`: Improved wording
297 * `SYSTEM_REANALYSIS`: AKEL re-processed claim
298 * `MODERATION_ACTION`: Hide/unhide for abuse
299 * `REVERT`: Rollback to previous version
300 **Retention Policy** (5 years total):
301 1. **Hot storage** (3 months): PostgreSQL, instant access
302 2. **Warm storage** (2 years): Partitioned, slower queries
303 3. **Cold storage** (3 years): S3 compressed, download required
304 4. **Deletion**: After 5 years (except legal holds)
305 **Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit)
306 **Use Cases**:
307 * View claim history timeline
308 * Detect vandalism patterns
309 * Learn from user corrections (system improvement)
310 * Legal compliance (audit trail)
311 * Rollback capability
312 See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases
313 === 1.8 Flag ===
314 Fields: entity_id, reported_by, issue_type, status, resolution_note
315 === 1.9 QualityMetric ===
316 **Fields**: metric_type, category, value, target, timestamp
317 **Purpose**: Time-series quality tracking
318 **Usage**:
319 * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times
320 * **Quality dashboard**: Real-time display with trend charts
321 * **Alerting**: Automatic alerts when metrics exceed thresholds
322 * **A/B testing**: Compare control vs treatment metrics
323 * **Improvement validation**: Measure before/after changes
324 **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
325 === 1.10 ErrorPattern ===
326 **Fields**: error_category, claim_id, description, root_cause, frequency, status
327 **Purpose**: Capture errors to trigger system improvements
328 **Usage**:
329 * **Error capture**: When users flag issues or system detects problems
330 * **Pattern analysis**: Weekly grouping by category and frequency
331 * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor
332 * **Metrics**: Track error rate reduction over time
333 **Example**: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}`
334
335 == 1.4 Core Data Model ERD ==
336
337 {{include reference="FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}}
338
339 == 1.5 User Class Diagram ==
340 {{include reference="FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}}
341 == 2. Versioning Strategy ==
342 **All Content Entities Are Versioned**:
343 * **Claim**: Every edit creates new version (V1→V2→V3...)
344 * **Evidence**: Changes tracked in edit history
345 * **Scenario**: Modifications versioned
346 **How Versioning Works**:
347 * Entity table stores **current state only**
348 * Edit table stores **all historical states** (before_state, after_state as JSON)
349 * Version number increments with each edit
350 * Complete audit trail maintained forever
351 **Unversioned Entities** (current state only, no history):
352 * **Source**: Track record continuously updated (not versioned history, just current score)
353 * **User**: Account state (reputation accumulated, not versioned)
354 * **QualityMetric**: Time-series data (each record is a point in time, not a version)
355 * **ErrorPattern**: System improvement queue (status tracked, not versioned)
356 **Example**:
357 ```
358 Claim V1: "The sky is blue"
359 → User edits →
360 Claim V2: "The sky is blue during daytime"
361 → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
362 ```
363 == 2.5. Storage vs Computation Strategy ==
364 **Critical architectural decision**: What to persist in databases vs compute dynamically?
365 **Trade-off**:
366 * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs
367 * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible
368 === Recommendation: Hybrid Approach ===
369 **STORE (in PostgreSQL):**
370 ==== Claims (Current State + History) ====
371 * **What**: assertion, domain, status, created_at, updated_at, version
372 * **Why**: Core entity, must be persistent
373 * **Also store**: confidence_score (computed once, then cached)
374 * **Size**: ~1 KB per claim
375 * **Growth**: Linear with claims
376 * **Decision**: ✅ STORE - Essential
377 ==== Evidence (All Records) ====
378 * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
379 * **Why**: Hard to re-gather, user contributions, reproducibility
380 * **Size**: ~2 KB per evidence (with excerpt)
381 * **Growth**: 3-10 evidence per claim
382 * **Decision**: ✅ STORE - Essential for reproducibility
383 ==== Sources (Track Records) ====
384 * **What**: name, domain, track_record_score, accuracy_history, correction_frequency
385 * **Why**: Continuously updated, expensive to recompute
386 * **Size**: ~500 bytes per source
387 * **Growth**: Slow (limited number of sources)
388 * **Decision**: ✅ STORE - Essential for quality
389 ==== Edit History (All Versions) ====
390 * **What**: before_state, after_state, user_id, reason, timestamp
391 * **Why**: Audit trail, legal requirement, reproducibility
392 * **Size**: ~2 KB per edit
393 * **Growth**: Linear with edits (~A portion of claims get edited)
394 * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
395 * **Decision**: ✅ STORE - Essential for accountability
396 ==== Flags (User Reports) ====
397 * **What**: entity_id, reported_by, issue_type, description, status
398 * **Why**: Error detection, system improvement triggers
399 * **Size**: ~500 bytes per flag
400 * **Growth**: 5-high percentage of claims get flagged
401 * **Decision**: ✅ STORE - Essential for improvement
402 ==== ErrorPatterns (System Improvement) ====
403 * **What**: error_category, claim_id, description, root_cause, frequency, status
404 * **Why**: Learning loop, prevent recurring errors
405 * **Size**: ~1 KB per pattern
406 * **Growth**: Slow (limited patterns, many fixed)
407 * **Decision**: ✅ STORE - Essential for learning
408 ==== QualityMetrics (Time Series) ====
409 * **What**: metric_type, category, value, target, timestamp
410 * **Why**: Trend analysis, cannot recreate historical metrics
411 * **Size**: ~200 bytes per metric
412 * **Growth**: Hourly = 8,760 per year per metric type
413 * **Retention**: 2 years hot, then aggregate and archive
414 * **Decision**: ✅ STORE - Essential for monitoring
415 **STORE (Computed Once, Then Cached):**
416 ==== Analysis Summary ====
417 * **What**: Neutral text summary of claim analysis (200-500 words)
418 * **Computed**: Once by AKEL when claim first analyzed
419 * **Stored in**: Claim table (text field)
420 * **Recomputed**: Only when system significantly improves OR claim edited
421 * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
422 * **Size**: ~2 KB per claim
423 * **Decision**: ✅ STORE (cached) - Cost-effective
424 ==== Confidence Score ====
425 * **What**: 0-100 score of analysis confidence
426 * **Computed**: Once by AKEL
427 * **Stored in**: Claim table (integer field)
428 * **Recomputed**: When evidence added, source track record changes significantly, or system improves
429 * **Why store**: Cheap to store, expensive to compute, users need it fast
430 * **Size**: 4 bytes per claim
431 * **Decision**: ✅ STORE (cached) - Performance critical
432 ==== Risk Score ====
433 * **What**: 0-100 score of claim risk level
434 * **Computed**: Once by AKEL
435 * **Stored in**: Claim table (integer field)
436 * **Recomputed**: When domain changes, evidence changes, or controversy detected
437 * **Why store**: Same as confidence score
438 * **Size**: 4 bytes per claim
439 * **Decision**: ✅ STORE (cached) - Performance critical
440 **COMPUTE DYNAMICALLY (Never Store):**
441 ==== Scenarios ==== ⚠️ CRITICAL DECISION
442 * **What**: 2-5 possible interpretations of claim with assumptions
443 * **Current design**: Stored in Scenario table
444 * **Alternative**: Compute on-demand when user views claim details
445 * **Storage cost**: ~1 KB per scenario × 3 scenarios average = ~3 KB per claim
446 * **Compute cost**: $0.005-0.01 per request (LLM API call)
447 * **Frequency**: Viewed in detail by ~20% of users
448 * **Trade-off analysis**:
449 - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access
450 - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
451 * **Reproducibility**: Scenarios may improve as AI improves (good to recompute)
452 * **Speed**: Computed = 5-8 seconds delay, Stored = instant
453 * **Decision**: ✅ STORE (hybrid approach below)
454 **Scenario Strategy** (APPROVED):
455 1. **Store scenarios** initially when claim analyzed
456 2. **Mark as stale** when system improves significantly
457 3. **Recompute on next view** if marked stale
458 4. **Cache for 30 days** if frequently accessed
459 5. **Result**: Best of both worlds - speed + freshness
460 ==== Verdict Synthesis ====
461 * **What**: Final conclusion text synthesizing all scenarios
462 * **Compute cost**: $0.002-0.005 per request
463 * **Frequency**: Every time claim viewed
464 * **Why not store**: Changes as evidence/scenarios change, users want fresh analysis
465 * **Speed**: 2-3 seconds (acceptable)
466 **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale
467 * **Recommendation**: ✅ STORE cached version, mark stale when changes occur
468 ==== Search Results ====
469 * **What**: Lists of claims matching search query
470 * **Compute from**: Elasticsearch index
471 * **Cache**: 15 minutes in Redis for popular queries
472 * **Why not store permanently**: Constantly changing, infinite possible queries
473 ==== Aggregated Statistics ====
474 * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.
475 * **Compute from**: Database queries
476 * **Cache**: 1 hour in Redis
477 * **Why not store**: Can be derived, relatively cheap to compute
478 ==== User Reputation ====
479 * **What**: Score based on contributions
480 * **Current design**: Stored in User table
481 * **Alternative**: Compute from Edit table
482 * **Trade-off**:
483 - Stored: Fast, simple
484 - Computed: Always accurate, no denormalization
485 * **Frequency**: Read on every user action
486 * **Compute cost**: Simple COUNT query, milliseconds
487 * **Decision**: ✅ STORE - Performance critical, read-heavy
488 === Summary Table ===
489 | Data Type | Storage | Compute | Size per Claim | Decision | Rationale |
490 |-----------|---------|---------|----------------|----------|-----------|
491 | Claim core | ✅ | - | 1 KB | STORE | Essential |
492 | Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |
493 | Sources | ✅ | - | 500 B (shared) | STORE | Track record |
494 | Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |
495 | Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |
496 | Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |
497 | Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |
498 | Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |
499 | Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |
500 | Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |
501 | ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |
502 | QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |
503 | Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |
504 | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |
505 **Total storage per claim**: ~18 KB (without edits and flags)
506 **For 1 million claims**:
507 * **Storage**: ~18 GB (manageable)
508 * **PostgreSQL**: ~$50/month (standard instance)
509 * **Redis cache**: ~$20/month (1 GB instance)
510 * **S3 archives**: ~$5/month (old edits)
511 * **Total**: ~$75/month infrastructure
512 **LLM cost savings by caching**:
513 * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
514 * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims
515 * Verdict stored: Save $0.003 per claim = $3K per 1M claims
516 * **Total savings**: ~$35K per 1M claims vs recomputing every time
517 === Recomputation Triggers ===
518 **When to mark cached data as stale and recompute:**
519 1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)
520 2. **Evidence added** → Recompute: scenarios, verdict, confidence score
521 3. **Source track record changes >10 points** → Recompute: confidence score, verdict
522 4. **System improvement deployed** → Mark affected claims stale, recompute on next view
523 5. **Controversy detected** (high flag rate) → Recompute: risk score
524 **Recomputation strategy**:
525 * **Eager**: Immediately recompute (for user edits)
526 * **Lazy**: Recompute on next view (for system improvements)
527 * **Batch**: Nightly re-evaluation of stale claims (if <1000)
528 === Database Size Projection ===
529 **Year 1**: 10K claims
530 * Storage: 180 MB
531 * Cost: $10/month
532 **Year 3**: 100K claims
533 * Storage: 1.8 GB
534 * Cost: $30/month
535 **Year 5**: 1M claims
536 * Storage: 18 GB
537 * Cost: $75/month
538 **Year 10**: 10M claims
539 * Storage: 180 GB
540 * Cost: $300/month
541 * Optimization: Archive old claims to S3 ($5/TB/month)
542 **Conclusion**: Storage costs are manageable, LLM cost savings are substantial.
543 == 3. Key Simplifications ==
544 * **Two content states only**: Published, Hidden
545 * **Three user roles only**: Reader, Contributor, Moderator
546 * **No complex versioning**: Linear edit history
547 * **Reputation-based permissions**: Not role hierarchy
548 * **Source track records**: Continuous evaluation
549 == 3. What Gets Stored in the Database ==
550 === 3.1 Primary Storage (PostgreSQL) ===
551 **Claims Table**:
552 * Current state only (latest version)
553 * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
554 **Evidence Table**:
555 * All evidence records
556 * Fields: id, claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at, archived
557 **Scenario Table**:
558 * All scenarios for each claim
559 * Fields: id, claim_id, description, assumptions (text array), confidence, created_by, created_at
560 **Source Table**:
561 * Track record database (continuously updated)
562 * Fields: id, name, domain, type, track_record_score, accuracy_history (JSON), correction_frequency, last_updated, claim_count, corrections_count
563 **User Table**:
564 * All user accounts
565 * Fields: id, username, email, role (Reader/Contributor/Moderator), reputation, created_at, last_active, contributions_count, flags_submitted, flags_accepted
566 **Edit Table**:
567 * Complete version history
568 * Fields: id, entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
569 **Flag Table**:
570 * User-reported issues
571 * Fields: id, entity_type, entity_id, reported_by, issue_type, description, status, resolved_by, resolution_note, created_at, resolved_at
572 **ErrorPattern Table**:
573 * System improvement queue
574 * Fields: id, error_category, claim_id, description, root_cause, frequency, status, created_at, fixed_at
575 **QualityMetric Table**:
576 * Time-series quality data
577 * Fields: id, metric_type, metric_category, value, target, timestamp
578 === 3.2 What's NOT Stored (Computed on-the-fly) ===
579 * **Verdicts**: Synthesized from evidence + scenarios when requested
580 * **Risk scores**: Recalculated based on current factors
581 * **Aggregated statistics**: Computed from base data
582 * **Search results**: Generated from Elasticsearch index
583 === 3.3 Cache Layer (Redis) ===
584
585 {{warning}}
586 **Implementation Status:** Redis caching is **NOT YET IMPLEMENTED**. Current implementation has no caching layer.
587 {{/warning}}
588
589 **Cached for performance (Planned)**:
590 * Frequently accessed claims (TTL: 1 hour)
591 * Search results (TTL: 15 minutes)
592 * User sessions (TTL: 24 hours)
593 * Source track records (TTL: 1 hour)
594 === 3.4 File Storage (S3) ===
595 **Archived content**:
596 * Old edit history (>3 months)
597 * Evidence documents (archived copies)
598 * Database backups
599 * Export files
600 === 3.5 Search Index (Elasticsearch) ===
601 **Indexed for search**:
602 * Claim assertions (full-text)
603 * Evidence excerpts (full-text)
604 * Scenario descriptions (full-text)
605 * Source names (autocomplete)
606 Synchronized from PostgreSQL via change data capture or periodic sync.
607 == 4. Related Pages ==
608 * [[Architecture>>FactHarbor.Specification.Architecture.WebHome]]
609 * [[Requirements>>FactHarbor.Specification.Requirements.WebHome]]
610 * [[Workflows>>FactHarbor.Specification.Workflows.WebHome]]