Wiki source code of Data Model

Last modified by Robert Schaub on 2026/02/08 08:32

version	line-number	content
1.1	1	= Data Model =
	2
	3	{{warning}}
	4	Implementation Status (Updated January 2026)
	5
	6	This specification describes the target normalized data model. Current implementation (v2.6.33) differs significantly:
	7
	8	* Storage: All data stored as JSON blobs in SQLite, not normalized PostgreSQL tables
	9	* Scenarios: Replaced with KeyFactors - decomposition questions, not separate entities
	10	* Caching: Redis cache not implemented; no claim caching yet
	11	* Source Scoring: Uses external MBFC bundle, not internal track record calculation
	12	* User System: Not implemented (no authentication in current version)
	13
	14	This specification remains valuable as the target architecture for future versions.
	15
	16	See `Docs/STATUS/Documentation_Inconsistencies.md` for full comparison.
	17	{{/warning}}
	18
	19	FactHarbor's data model is simple, focused, designed for automated processing.
1.2	20
1.1	21	== 1. Core Entities ==
1.2	22
1.1	23	=== 1.1 Claim ===
1.2	24
1.1	25	Fields: id, assertion, domain, status (Published/Hidden only), confidence_score, risk_score, completeness_score, version, views, edit_count
1.2	26
1.1	27	==== Performance Optimization: Denormalized Fields ====
1.2	28
1.1	29	Rationale: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
	30	Additional cached fields in claims table:
1.2	31
1.1	32	* evidence_summary (JSONB): Top 5 most relevant evidence snippets with scores
1.2	33	* Avoids joining evidence table for listing/preview
	34	* Updated when evidence is added/removed
	35	* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
1.1	36	* source_names (TEXT[]): Array of source names for quick display
1.2	37	* Avoids joining through evidence to sources
	38	* Updated when sources change
	39	* Format: `["New York Times", "Nature Journal", ...]`
1.1	40	* scenario_count (INTEGER): Number of scenarios for this claim
1.2	41	* Quick metric without counting rows
	42	* Updated when scenarios added/removed
1.1	43	* cache_updated_at (TIMESTAMP): When denormalized data was last refreshed
1.2	44	* Helps invalidate stale caches
	45	* Triggers background refresh if too old
1.1	46	Update Strategy:
	47	* Immediate: Update on claim edit (user-facing)
	48	* Deferred: Update via background job every hour (non-critical)
	49	* Invalidation: Clear cache when source data changes significantly
	50	Trade-offs:
	51	* ✅ 70% fewer joins on common queries
	52	* ✅ Much faster claim list/search pages
	53	* ✅ Better user experience
1.2	54	* ⚠️ Small storage increase (10%)
1.1	55	* ⚠️ Need to keep caches in sync
1.2	56
1.1	57	=== 1.2 Evidence ===
1.2	58
1.1	59	Fields: claim_id, source_id, excerpt, url, relevance_score, supports
1.2	60
1.1	61	=== 1.3 Source ===
1.2	62
1.1	63	Purpose: Track reliability of information sources over time
	64	Fields:
1.2	65
1.1	66	* id (UUID): Unique identifier
	67	* name (text): Source name (e.g., "New York Times", "Nature Journal")
	68	* domain (text): Website domain (e.g., "nytimes.com")
	69	* type (enum): NewsOutlet, AcademicJournal, GovernmentAgency, etc.
	70	* track_record_score (0-100): Overall reliability score
	71	* accuracy_history (JSON): Historical accuracy data
	72	* correction_frequency (float): How often source publishes corrections
	73	* last_updated (timestamp): When track record last recalculated
	74	How It Works:
	75	* Initial score based on source type (70 for academic journals, 30 for unknown)
	76	* Updated daily by background scheduler
	77	* Formula: accuracy_rate (50%) + correction_policy (20%) + editorial_standards (15%) + bias_transparency (10%) + longevity (5%)
	78	* Track Record Check in AKEL pipeline: Adjusts evidence confidence based on source quality
	79	* Quality thresholds: 90+=Exceptional, 70-89=Reliable, 50-69=Acceptable, 30-49=Questionable, <30=Unreliable
	80
	81	{{info}}
	82	Current Implementation (v2.6.33): Source reliability uses external MBFC (Media Bias/Fact Check) bundle instead of internal track record calculation. Scores are loaded from a configurable JSON file. See `Docs/ARCHITECTURE/Source_Reliability.md`.
	83	{{/info}}
	84
	85	See: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
	86	Fields: id, name, domain, track_record_score, accuracy_history, correction_frequency
	87	Key: Automated source reliability tracking
1.2	88
1.1	89	==== Source Scoring Process (Separation of Concerns) ====
1.2	90
1.1	91	Critical design principle: Prevent circular dependencies between source scoring and claim analysis.
	92	The Problem:
1.2	93
1.1	94	* Source scores should influence claim verdicts
	95	* Claim verdicts should update source scores
	96	* But: Direct feedback creates circular dependency and potential feedback loops
	97	The Solution: Temporal separation
1.2	98
1.1	99	==== Weekly Background Job (Source Scoring) ====
1.2	100
1.1	101	Runs independently of claim analysis:
1.2	102	{{code language="python"}}def update_source_scores_weekly():
1.1	103	"""
	104	Background job: Calculate source reliability
	105	Never triggered by individual claim analysis
	106	"""
	107	# Analyze all claims from past week
	108	claims = get_claims_from_past_week()
	109	for source in get_all_sources():
	110	# Calculate accuracy metrics
	111	correct_verdicts = count_correct_verdicts_citing(source, claims)
	112	total_citations = count_total_citations(source, claims)
	113	accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5
	114	# Weight by claim importance
	115	weighted_score = calculate_weighted_score(source, claims)
	116	# Update source record
	117	source.track_record_score = weighted_score
	118	source.total_citations = total_citations
	119	source.last_updated = now()
	120	source.save()
	121	# Job runs: Sunday 2 AM UTC
1.2	122	# Never during claim processing{{/code}}
	123
1.1	124	==== Real-Time Claim Analysis (AKEL) ====
1.2	125
1.1	126	Uses source scores but never updates them:
1.2	127	{{code language="python"}}def analyze_claim(claim_text):
1.1	128	"""
	129	Real-time: Analyze claim using current source scores
	130	READ source scores, never UPDATE them
	131	"""
	132	# Gather evidence
	133	evidence_list = gather_evidence(claim_text)
	134	for evidence in evidence_list:
	135	# READ source score (snapshot from last weekly update)
	136	source = get_source(evidence.source_id)
	137	source_score = source.track_record_score
	138	# Use score to weight evidence
	139	evidence.weighted_relevance = evidence.relevance * source_score
	140	# Generate verdict using weighted evidence
	141	verdict = synthesize_verdict(evidence_list)
	142	# NEVER update source scores here
	143	# That happens in weekly background job
1.2	144	return verdict{{/code}}
	145
1.1	146	==== Monthly Audit (Quality Assurance) ====
1.2	147
1.1	148	Moderator review of flagged source scores:
1.2	149
1.1	150	* Verify scores make sense
	151	* Detect gaming attempts
	152	* Identify systematic biases
	153	* Manual adjustments if needed
	154	Key Principles:
	155	✅ Scoring and analysis are temporally separated
	156	* Source scoring: Weekly batch job
	157	* Claim analysis: Real-time processing
	158	* Never update scores during analysis
	159	✅ One-way data flow during processing
	160	* Claims READ source scores
	161	* Claims NEVER WRITE source scores
	162	* Updates happen in background only
	163	✅ Predictable update cycle
	164	* Sources update every Sunday 2 AM
	165	* Claims always use last week's scores
	166	* No mid-week score changes
	167	✅ Audit trail
	168	* Log all score changes
	169	* Track score history
	170	* Explainable calculations
	171	Benefits:
	172	* No circular dependencies
	173	* Predictable behavior
	174	* Easier to reason about
	175	* Simpler testing
	176	* Clear audit trail
	177	Example Timeline:
	178	```
	179	Sunday 2 AM: Calculate source scores for past week
	180	→ NYT score: 0.87 (up from 0.85)
	181	→ Blog X score: 0.52 (down from 0.61)
	182	Monday-Saturday: Claims processed using these scores
	183	→ All claims this week use NYT=0.87
	184	→ All claims this week use Blog X=0.52
	185	Next Sunday 2 AM: Recalculate scores including this week's claims
	186	→ NYT score: 0.89 (trending up)
	187	→ Blog X score: 0.48 (trending down)
	188	```
1.2	189
1.1	190	=== 1.4 Scenario ===
	191
	192	{{warning}}
	193	Implementation Change: Scenarios were replaced with KeyFactors in the current implementation. KeyFactors are decomposition questions discovered during the understanding phase, not separate stored entities. See `Docs/ARCHITECTURE/KeyFactors_Design.md` for the design rationale.
	194	{{/warning}}
	195
	196	Purpose: Different interpretations or contexts for evaluating claims
	197	Key Concept: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
	198	Relationship: One-to-many with Claims (simplified for V1.0: scenario belongs to single claim)
	199	Fields:
1.2	200
1.1	201	* id (UUID): Unique identifier
	202	* claim_id (UUID): Foreign key to claim (one-to-many)
	203	* description (text): Human-readable description of the scenario
	204	* assumptions (JSONB): Key assumptions that define this scenario context
	205	* extracted_from (UUID): Reference to evidence that this scenario was extracted from
	206	* created_at (timestamp): When scenario was created
	207	* updated_at (timestamp): Last modification
	208	How Found: Evidence search → Extract context → Create scenario → Link to claim
	209	Example:
	210	For claim "Vaccines reduce hospitalization":
	211	* Scenario 1: "Clinical trials (healthy adults 18-65, original strain)" from trial paper
	212	* Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data
	213	* Scenario 3: "Immunocompromised patients" from specialist study
	214	V2.0 Evolution: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality.
	215
	216	=== 1.5 Verdict ===
	217
	218	Purpose: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.
	219
	220	Core Fields:
1.2	221
1.1	222	* id (UUID): Primary key
	223	* scenario_id (UUID FK): The scenario being assessed
	224	* likelihood_range (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")
	225	* confidence (decimal 0-1): How confident we are in this assessment
	226	* explanation_summary (text): Human-readable reasoning explaining the verdict
	227	* uncertainty_factors (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")
	228	* created_at (timestamp): When verdict was created
	229	* updated_at (timestamp): Last modification
	230
	231	Change Tracking: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states.
	232
	233	Relationship: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity.
	234
	235	Example:
	236	For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":
1.2	237
1.1	238	* Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
	239	* After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
	240	* Edit entity records the complete before/after change with timestamp and reason
	241
	242	Key Design: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios.
	243
	244	=== 1.6 User ===
1.2	245
1.1	246	Fields: username, email, role (Reader/Contributor/Moderator), reputation, contributions_count
1.2	247
	248	=== User Reputation System ===
	249
1.1	250	V1.0 Approach: Simple manual role assignment
	251	Rationale: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
1.2	252
1.1	253	=== Roles (Manual Assignment) ===
1.2	254
1.1	255	reader (default):
1.2	256
1.1	257	* View published claims and evidence
	258	* Browse and search content
	259	* No editing permissions
	260	contributor:
	261	* Submit new claims
	262	* Suggest edits to existing content
	263	* Add evidence
	264	* Requires manual promotion by moderator/admin
	265	moderator:
	266	* Approve/reject contributor suggestions
	267	* Flag inappropriate content
	268	* Handle abuse reports
	269	* Assigned by admins based on trust
	270	admin:
	271	* Manage users and roles
	272	* System configuration
	273	* Access to all features
	274	* Founder-appointed initially
1.2	275
1.1	276	=== Contribution Tracking (Simple) ===
1.2	277
1.1	278	Basic metrics only:
1.2	279
1.1	280	* `contributions_count`: Total number of contributions
	281	* `created_at`: Account age
	282	* `last_active`: Recent activity
	283	No complex calculations:
	284	* No point systems
	285	* No automated privilege escalation
	286	* No reputation decay
	287	* No threshold-based promotions
1.2	288
1.1	289	=== Promotion Process ===
1.2	290
1.1	291	Manual review by moderators/admins:
1.2	292
1.1	293	1. User demonstrates value through contributions
	294	2. Moderator reviews user's contribution history
	295	3. Moderator promotes user to contributor role
	296	4. Admin promotes trusted contributors to moderator
	297	Criteria (guidelines, not automated):
1.2	298
1.1	299	* Quality of contributions
	300	* Consistency over time
	301	* Collaborative behavior
	302	* Understanding of project goals
1.2	303
1.1	304	=== V2.0+ Evolution ===
1.2	305
1.1	306	Add complex reputation when:
1.2	307
1.1	308	* 100+ active contributors
	309	* Manual role management becomes bottleneck
	310	* Clear patterns of abuse emerge requiring automation
	311	Future features may include:
	312	* Automated point calculations
	313	* Threshold-based promotions
	314	* Reputation decay for inactive users
	315	* Track record scoring for contributors
	316	See [[When to Add Complexity>>FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
1.2	317
1.1	318	=== 1.7 Edit ===
1.2	319
1.1	320	Fields: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
	321	Purpose: Complete audit trail for all content changes
1.2	322
1.1	323	=== Edit History Details ===
1.2	324
1.1	325	What Gets Edited:
1.2	326
1.1	327	* Claims (20% edited): assertion, domain, status, scores, analysis
	328	* Evidence (10% edited): excerpt, relevance_score, supports
	329	* Scenarios (5% edited): description, assumptions, confidence
	330	* Sources: NOT versioned (continuous updates, not editorial decisions)
	331	Who Edits:
	332	* Contributors (rep sufficient): Corrections, additions
	333	* Trusted Contributors (rep sufficient): Major improvements, approvals
	334	* Moderators: Abuse handling, dispute resolution
	335	* System (AKEL): Re-analysis, automated improvements (user_id = NULL)
	336	Edit Types:
	337	* `CONTENT_CORRECTION`: User fixes factual error
	338	* `CLARIFICATION`: Improved wording
	339	* `SYSTEM_REANALYSIS`: AKEL re-processed claim
	340	* `MODERATION_ACTION`: Hide/unhide for abuse
	341	* `REVERT`: Rollback to previous version
	342	Retention Policy (5 years total):
1.2	343
1.1	344	1. Hot storage (3 months): PostgreSQL, instant access
	345	2. Warm storage (2 years): Partitioned, slower queries
	346	3. Cold storage (3 years): S3 compressed, download required
	347	4. Deletion: After 5 years (except legal holds)
1.2	348	Storage per 1M claims: 400 MB (20% edited × 2 KB per edit)
1.1	349	Use Cases:
1.2	350
1.1	351	* View claim history timeline
	352	* Detect vandalism patterns
	353	* Learn from user corrections (system improvement)
	354	* Legal compliance (audit trail)
	355	* Rollback capability
	356	See Edit History Documentation for complete details on what gets edited by whom, retention policy, and use cases
1.2	357
1.1	358	=== 1.8 Flag ===
1.2	359
1.1	360	Fields: entity_id, reported_by, issue_type, status, resolution_note
1.2	361
1.1	362	=== 1.9 QualityMetric ===
1.2	363
1.1	364	Fields: metric_type, category, value, target, timestamp
	365	Purpose: Time-series quality tracking
	366	Usage:
1.2	367
1.1	368	* Continuous monitoring: Hourly calculation of error rates, confidence scores, processing times
	369	* Quality dashboard: Real-time display with trend charts
	370	* Alerting: Automatic alerts when metrics exceed thresholds
	371	* A/B testing: Compare control vs treatment metrics
	372	* Improvement validation: Measure before/after changes
	373	Example: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
1.2	374
1.1	375	=== 1.10 ErrorPattern ===
1.2	376
1.1	377	Fields: error_category, claim_id, description, root_cause, frequency, status
	378	Purpose: Capture errors to trigger system improvements
	379	Usage:
1.2	380
1.1	381	* Error capture: When users flag issues or system detects problems
	382	* Pattern analysis: Weekly grouping by category and frequency
	383	* Improvement workflow: Analyze → Fix → Test → Deploy → Re-process → Monitor
	384	* Metrics: Track error rate reduction over time
	385	Example: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}`
	386
	387	== 1.4 Core Data Model ERD ==
	388
1.4	389	{{include reference="Archive.FactHarbor 2026\.02\.08.Specification.Diagrams.Core Data Model ERD.WebHome"/}}
1.1	390
	391	== 1.5 User Class Diagram ==
1.2	392
1.5	393	{{include reference="Archive.FactHarbor 2026\.02\.08.Specification.Diagrams.User Class Diagram.WebHome"/}}
1.2	394
1.1	395	== 2. Versioning Strategy ==
1.2	396
1.1	397	All Content Entities Are Versioned:
1.2	398
1.1	399	* Claim: Every edit creates new version (V1→V2→V3...)
	400	* Evidence: Changes tracked in edit history
	401	* Scenario: Modifications versioned
	402	How Versioning Works:
	403	* Entity table stores current state only
	404	* Edit table stores all historical states (before_state, after_state as JSON)
	405	* Version number increments with each edit
	406	* Complete audit trail maintained forever
	407	Unversioned Entities (current state only, no history):
	408	* Source: Track record continuously updated (not versioned history, just current score)
	409	* User: Account state (reputation accumulated, not versioned)
	410	* QualityMetric: Time-series data (each record is a point in time, not a version)
	411	* ErrorPattern: System improvement queue (status tracked, not versioned)
	412	Example:
	413	```
	414	Claim V1: "The sky is blue"
	415	→ User edits →
	416	Claim V2: "The sky is blue during daytime"
	417	→ EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
	418	```
1.2	419
1.1	420	== 2.5. Storage vs Computation Strategy ==
1.2	421
1.1	422	Critical architectural decision: What to persist in databases vs compute dynamically?
	423	Trade-off:
1.2	424
1.1	425	* Store more: Better reproducibility, faster, lower LLM costs \| Higher storage/maintenance costs
	426	* Compute more: Lower storage/maintenance costs \| Slower, higher LLM costs, less reproducible
1.2	427
1.1	428	=== Recommendation: Hybrid Approach ===
1.2	429
1.1	430	STORE (in PostgreSQL):
1.2	431
1.1	432	==== Claims (Current State + History) ====
1.2	433
1.1	434	* What: assertion, domain, status, created_at, updated_at, version
	435	* Why: Core entity, must be persistent
	436	* Also store: confidence_score (computed once, then cached)
1.2	437	* Size: 1 KB per claim
1.1	438	* Growth: Linear with claims
	439	* Decision: ✅ STORE - Essential
1.2	440
1.1	441	==== Evidence (All Records) ====
1.2	442
1.1	443	* What: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
	444	* Why: Hard to re-gather, user contributions, reproducibility
1.2	445	* Size: 2 KB per evidence (with excerpt)
1.1	446	* Growth: 3-10 evidence per claim
	447	* Decision: ✅ STORE - Essential for reproducibility
1.2	448
1.1	449	==== Sources (Track Records) ====
1.2	450
1.1	451	* What: name, domain, track_record_score, accuracy_history, correction_frequency
	452	* Why: Continuously updated, expensive to recompute
1.2	453	* Size: 500 bytes per source
1.1	454	* Growth: Slow (limited number of sources)
	455	* Decision: ✅ STORE - Essential for quality
1.2	456
1.1	457	==== Edit History (All Versions) ====
1.2	458
1.1	459	* What: before_state, after_state, user_id, reason, timestamp
	460	* Why: Audit trail, legal requirement, reproducibility
1.2	461	* Size: 2 KB per edit
	462	* Growth: Linear with edits (A portion of claims get edited)
1.1	463	* Retention: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
	464	* Decision: ✅ STORE - Essential for accountability
1.2	465
1.1	466	==== Flags (User Reports) ====
1.2	467
1.1	468	* What: entity_id, reported_by, issue_type, description, status
	469	* Why: Error detection, system improvement triggers
1.2	470	* Size: 500 bytes per flag
1.1	471	* Growth: 5-high percentage of claims get flagged
	472	* Decision: ✅ STORE - Essential for improvement
1.2	473
1.1	474	==== ErrorPatterns (System Improvement) ====
1.2	475
1.1	476	* What: error_category, claim_id, description, root_cause, frequency, status
	477	* Why: Learning loop, prevent recurring errors
1.2	478	* Size: 1 KB per pattern
1.1	479	* Growth: Slow (limited patterns, many fixed)
	480	* Decision: ✅ STORE - Essential for learning
1.2	481
1.1	482	==== QualityMetrics (Time Series) ====
1.2	483
1.1	484	* What: metric_type, category, value, target, timestamp
	485	* Why: Trend analysis, cannot recreate historical metrics
1.2	486	* Size: 200 bytes per metric
1.1	487	* Growth: Hourly = 8,760 per year per metric type
	488	* Retention: 2 years hot, then aggregate and archive
	489	* Decision: ✅ STORE - Essential for monitoring
	490	STORE (Computed Once, Then Cached):
1.2	491
1.1	492	==== Analysis Summary ====
1.2	493
1.1	494	* What: Neutral text summary of claim analysis (200-500 words)
	495	* Computed: Once by AKEL when claim first analyzed
	496	* Stored in: Claim table (text field)
	497	* Recomputed: Only when system significantly improves OR claim edited
	498	* Why store: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
1.2	499	* Size: 2 KB per claim
1.1	500	* Decision: ✅ STORE (cached) - Cost-effective
1.2	501
1.1	502	==== Confidence Score ====
1.2	503
1.1	504	* What: 0-100 score of analysis confidence
	505	* Computed: Once by AKEL
	506	* Stored in: Claim table (integer field)
	507	* Recomputed: When evidence added, source track record changes significantly, or system improves
	508	* Why store: Cheap to store, expensive to compute, users need it fast
	509	* Size: 4 bytes per claim
	510	* Decision: ✅ STORE (cached) - Performance critical
1.2	511
1.1	512	==== Risk Score ====
1.2	513
1.1	514	* What: 0-100 score of claim risk level
	515	* Computed: Once by AKEL
	516	* Stored in: Claim table (integer field)
	517	* Recomputed: When domain changes, evidence changes, or controversy detected
	518	* Why store: Same as confidence score
	519	* Size: 4 bytes per claim
	520	* Decision: ✅ STORE (cached) - Performance critical
	521	COMPUTE DYNAMICALLY (Never Store):
1.2	522
	523	==== Scenarios ====
	524
	525	⚠️ CRITICAL DECISION
	526
1.1	527	* What: 2-5 possible interpretations of claim with assumptions
	528	* Current design: Stored in Scenario table
	529	* Alternative: Compute on-demand when user views claim details
1.2	530	* Storage cost: 1 KB per scenario × 3 scenarios average = 3 KB per claim
1.1	531	* Compute cost: $0.005-0.01 per request (LLM API call)
1.2	532	* Frequency: Viewed in detail by 20% of users
1.1	533	* Trade-off analysis:
	534	- IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access
	535	- IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
	536	* Reproducibility: Scenarios may improve as AI improves (good to recompute)
	537	* Speed: Computed = 5-8 seconds delay, Stored = instant
	538	* Decision: ✅ STORE (hybrid approach below)
	539	Scenario Strategy (APPROVED):
1.2	540
1.1	541	1. Store scenarios initially when claim analyzed
	542	2. Mark as stale when system improves significantly
	543	3. Recompute on next view if marked stale
	544	4. Cache for 30 days if frequently accessed
	545	5. Result: Best of both worlds - speed + freshness
1.2	546
	547	==== Verdict Synthesis ====
	548
	549
	550
1.1	551	* What: Final conclusion text synthesizing all scenarios
	552	* Compute cost: $0.002-0.005 per request
	553	* Frequency: Every time claim viewed
	554	* Why not store: Changes as evidence/scenarios change, users want fresh analysis
	555	* Speed: 2-3 seconds (acceptable)
	556	Alternative: Store "last verdict" as cached field, recompute only if claim edited or marked stale
	557	* Recommendation: ✅ STORE cached version, mark stale when changes occur
1.2	558
1.1	559	==== Search Results ====
1.2	560
1.1	561	* What: Lists of claims matching search query
	562	* Compute from: Elasticsearch index
	563	* Cache: 15 minutes in Redis for popular queries
	564	* Why not store permanently: Constantly changing, infinite possible queries
1.2	565
1.1	566	==== Aggregated Statistics ====
1.2	567
1.1	568	* What: "Total claims: 1,234,567", "Average confidence: 78%", etc.
	569	* Compute from: Database queries
	570	* Cache: 1 hour in Redis
	571	* Why not store: Can be derived, relatively cheap to compute
1.2	572
1.1	573	==== User Reputation ====
1.2	574
1.1	575	* What: Score based on contributions
	576	* Current design: Stored in User table
	577	* Alternative: Compute from Edit table
	578	* Trade-off:
	579	- Stored: Fast, simple
	580	- Computed: Always accurate, no denormalization
	581	* Frequency: Read on every user action
	582	* Compute cost: Simple COUNT query, milliseconds
	583	* Decision: ✅ STORE - Performance critical, read-heavy
1.2	584
1.1	585	=== Summary Table ===
1.2	586
	587	\| Data Type \| Storage \| Compute \| Size per Claim \| Decision \| Rationale \|\\
1.4	588	\|-\|-\|-\|\|\|-\|\\
1.2	589	\| Claim core \| ✅ \| - \| 1 KB \| STORE \| Essential \|\\
	590	\| Evidence \| ✅ \| - \| 2 KB × 5 = 10 KB \| STORE \| Reproducibility \|\\
	591	\| Sources \| ✅ \| - \| 500 B (shared) \| STORE \| Track record \|\\
	592	\| Edit history \| ✅ \| - \| 2 KB × 20% = 400 B avg \| STORE \| Audit \|\\
	593	\| Analysis summary \| ✅ \| Once \| 2 KB \| STORE (cached) \| Cost-effective \|\\
	594	\| Confidence score \| ✅ \| Once \| 4 B \| STORE (cached) \| Fast access \|\\
	595	\| Risk score \| ✅ \| Once \| 4 B \| STORE (cached) \| Fast access \|\\
	596	\| Scenarios \| ✅ \| When stale \| 3 KB \| STORE (hybrid) \| Balance cost/speed \|\\
	597	\| Verdict \| ✅ \| When stale \| 1 KB \| STORE (cached) \| Fast access \|\\
	598	\| Flags \| ✅ \| - \| 500 B × 10% = 50 B avg \| STORE \| Improvement \|\\
	599	\| ErrorPatterns \| ✅ \| - \| 1 KB (global) \| STORE \| Learning \|\\
	600	\| QualityMetrics \| ✅ \| - \| 200 B (time series) \| STORE \| Trending \|\\
	601	\| Search results \| - \| ✅ \| - \| COMPUTE + 15min cache \| Dynamic \|\\
1.1	602	\| Aggregations \| - \| ✅ \| - \| COMPUTE + 1hr cache \| Derivable \|
1.2	603	Total storage per claim: 18 KB (without edits and flags)
1.1	604	For 1 million claims:
1.2	605
	606	* Storage: 18 GB (manageable)
	607	* PostgreSQL: $50/month (standard instance)
	608	* Redis cache: $20/month (1 GB instance)
	609	* S3 archives: $5/month (old edits)
	610	* Total: $75/month infrastructure
1.1	611	LLM cost savings by caching:
	612	* Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
	613	* Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims
	614	* Verdict stored: Save $0.003 per claim = $3K per 1M claims
1.2	615	* Total savings: $35K per 1M claims vs recomputing every time
	616
1.1	617	=== Recomputation Triggers ===
1.2	618
1.1	619	When to mark cached data as stale and recompute:
1.2	620
1.1	621	1. User edits claim → Recompute: all (analysis, scenarios, verdict, scores)
	622	2. Evidence added → Recompute: scenarios, verdict, confidence score
	623	3. Source track record changes >10 points → Recompute: confidence score, verdict
	624	4. System improvement deployed → Mark affected claims stale, recompute on next view
	625	5. Controversy detected (high flag rate) → Recompute: risk score
	626	Recomputation strategy:
1.2	627
1.1	628	* Eager: Immediately recompute (for user edits)
	629	* Lazy: Recompute on next view (for system improvements)
	630	* Batch: Nightly re-evaluation of stale claims (if <1000)
1.2	631
1.1	632	=== Database Size Projection ===
1.2	633
1.1	634	Year 1: 10K claims
1.2	635
1.1	636	* Storage: 180 MB
	637	* Cost: $10/month
	638	Year 3: 100K claims
	639	* Storage: 1.8 GB
	640	* Cost: $30/month
	641	Year 5: 1M claims
	642	* Storage: 18 GB
	643	* Cost: $75/month
	644	Year 10: 10M claims
	645	* Storage: 180 GB
	646	* Cost: $300/month
	647	* Optimization: Archive old claims to S3 ($5/TB/month)
	648	Conclusion: Storage costs are manageable, LLM cost savings are substantial.
1.2	649
1.1	650	== 3. Key Simplifications ==
1.2	651
1.1	652	* Two content states only: Published, Hidden
	653	* Three user roles only: Reader, Contributor, Moderator
	654	* No complex versioning: Linear edit history
	655	* Reputation-based permissions: Not role hierarchy
	656	* Source track records: Continuous evaluation
1.2	657
1.1	658	== 3. What Gets Stored in the Database ==
1.2	659
1.1	660	=== 3.1 Primary Storage (PostgreSQL) ===
1.2	661
1.1	662	Claims Table:
1.2	663
1.1	664	* Current state only (latest version)
	665	* Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
	666	Evidence Table:
	667	* All evidence records
	668	* Fields: id, claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at, archived
	669	Scenario Table:
	670	* All scenarios for each claim
	671	* Fields: id, claim_id, description, assumptions (text array), confidence, created_by, created_at
	672	Source Table:
	673	* Track record database (continuously updated)
	674	* Fields: id, name, domain, type, track_record_score, accuracy_history (JSON), correction_frequency, last_updated, claim_count, corrections_count
	675	User Table:
	676	* All user accounts
	677	* Fields: id, username, email, role (Reader/Contributor/Moderator), reputation, created_at, last_active, contributions_count, flags_submitted, flags_accepted
	678	Edit Table:
	679	* Complete version history
	680	* Fields: id, entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
	681	Flag Table:
	682	* User-reported issues
	683	* Fields: id, entity_type, entity_id, reported_by, issue_type, description, status, resolved_by, resolution_note, created_at, resolved_at
	684	ErrorPattern Table:
	685	* System improvement queue
	686	* Fields: id, error_category, claim_id, description, root_cause, frequency, status, created_at, fixed_at
	687	QualityMetric Table:
	688	* Time-series quality data
	689	* Fields: id, metric_type, metric_category, value, target, timestamp
1.2	690
1.1	691	=== 3.2 What's NOT Stored (Computed on-the-fly) ===
1.2	692
1.1	693	* Verdicts: Synthesized from evidence + scenarios when requested
	694	* Risk scores: Recalculated based on current factors
	695	* Aggregated statistics: Computed from base data
	696	* Search results: Generated from Elasticsearch index
1.2	697
1.1	698	=== 3.3 Cache Layer (Redis) ===
	699
	700	{{warning}}
	701	Implementation Status: Redis caching is NOT YET IMPLEMENTED. Current implementation has no caching layer.
	702	{{/warning}}
	703
	704	Cached for performance (Planned):
1.2	705
1.1	706	* Frequently accessed claims (TTL: 1 hour)
	707	* Search results (TTL: 15 minutes)
	708	* User sessions (TTL: 24 hours)
	709	* Source track records (TTL: 1 hour)
1.2	710
1.1	711	=== 3.4 File Storage (S3) ===
1.2	712
1.1	713	Archived content:
1.2	714
1.1	715	* Old edit history (>3 months)
	716	* Evidence documents (archived copies)
	717	* Database backups
	718	* Export files
1.2	719
1.1	720	=== 3.5 Search Index (Elasticsearch) ===
1.2	721
1.1	722	Indexed for search:
1.2	723
1.1	724	* Claim assertions (full-text)
	725	* Evidence excerpts (full-text)
	726	* Scenario descriptions (full-text)
	727	* Source names (autocomplete)
	728	Synchronized from PostgreSQL via change data capture or periodic sync.
1.2	729
1.1	730	== 4. Related Pages ==
1.2	731
	732	* [[Architecture>>Archive.FactHarbor 2026\.02\.08.Specification.Architecture.WebHome]]
1.6	733	* [[Requirements>>Archive.FactHarbor 2026\.02\.08.Specification.Requirements.WebHome]]
1.7	734	* [[Workflows>>Archive.FactHarbor 2026\.02\.08.Specification.Workflows.WebHome]]