Wiki source code of Data Model

Last modified by Robert Schaub on 2025/12/24 11:48

version	line-number	content
1.1	1	= Data Model =
1.2	2
1.1	3	FactHarbor's data model is simple, focused, designed for automated processing.
1.2	4
1.1	5	== 1. Core Entities ==
1.2	6
1.1	7	=== 1.1 Claim ===
1.2	8
1.1	9	Fields: id, assertion, domain, status (Published/Hidden only), confidence_score, risk_score, completeness_score, version, views, edit_count
1.2	10
1.1	11	==== Performance Optimization: Denormalized Fields ====
1.2	12
1.1	13	Rationale: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
	14	Additional cached fields in claims table:
1.2	15
1.1	16	* evidence_summary (JSONB): Top 5 most relevant evidence snippets with scores
1.2	17	* Avoids joining evidence table for listing/preview
	18	* Updated when evidence is added/removed
	19	* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
1.1	20	* source_names (TEXT[]): Array of source names for quick display
1.2	21	* Avoids joining through evidence to sources
	22	* Updated when sources change
	23	* Format: `["New York Times", "Nature Journal", ...]`
1.1	24	* scenario_count (INTEGER): Number of scenarios for this claim
1.2	25	* Quick metric without counting rows
	26	* Updated when scenarios added/removed
1.1	27	* cache_updated_at (TIMESTAMP): When denormalized data was last refreshed
1.2	28	* Helps invalidate stale caches
	29	* Triggers background refresh if too old
1.1	30	Update Strategy:
	31	* Immediate: Update on claim edit (user-facing)
	32	* Deferred: Update via background job every hour (non-critical)
	33	* Invalidation: Clear cache when source data changes significantly
	34	Trade-offs:
	35	* ✅ 70% fewer joins on common queries
	36	* ✅ Much faster claim list/search pages
	37	* ✅ Better user experience
1.2	38	* ⚠️ Small storage increase (10%)
1.1	39	* ⚠️ Need to keep caches in sync
1.2	40
1.1	41	=== 1.2 Evidence ===
1.2	42
1.1	43	Fields: claim_id, source_id, excerpt, url, relevance_score, supports
1.2	44
1.1	45	=== 1.3 Source ===
1.2	46
1.1	47	Purpose: Track reliability of information sources over time
	48	Fields:
1.2	49
1.1	50	* id (UUID): Unique identifier
	51	* name (text): Source name (e.g., "New York Times", "Nature Journal")
	52	* domain (text): Website domain (e.g., "nytimes.com")
	53	* type (enum): NewsOutlet, AcademicJournal, GovernmentAgency, etc.
	54	* track_record_score (0-100): Overall reliability score
	55	* accuracy_history (JSON): Historical accuracy data
	56	* correction_frequency (float): How often source publishes corrections
	57	* last_updated (timestamp): When track record last recalculated
	58	How It Works:
	59	* Initial score based on source type (70 for academic journals, 30 for unknown)
	60	* Updated daily by background scheduler
	61	* Formula: accuracy_rate (50%) + correction_policy (20%) + editorial_standards (15%) + bias_transparency (10%) + longevity (5%)
	62	* Track Record Check in AKEL pipeline: Adjusts evidence confidence based on source quality
	63	* Quality thresholds: 90+=Exceptional, 70-89=Reliable, 50-69=Acceptable, 30-49=Questionable, <30=Unreliable
	64	See: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
	65	Fields: id, name, domain, track_record_score, accuracy_history, correction_frequency
	66	Key: Automated source reliability tracking
1.2	67
1.1	68	==== Source Scoring Process (Separation of Concerns) ====
1.2	69
1.1	70	Critical design principle: Prevent circular dependencies between source scoring and claim analysis.
	71	The Problem:
1.2	72
1.1	73	* Source scores should influence claim verdicts
	74	* Claim verdicts should update source scores
	75	* But: Direct feedback creates circular dependency and potential feedback loops
	76	The Solution: Temporal separation
1.2	77
1.1	78	==== Weekly Background Job (Source Scoring) ====
1.2	79
1.1	80	Runs independently of claim analysis:
1.2	81	{{code language="python"}}def update_source_scores_weekly():
1.1	82	"""
	83	Background job: Calculate source reliability
	84	Never triggered by individual claim analysis
	85	"""
	86	# Analyze all claims from past week
	87	claims = get_claims_from_past_week()
	88	for source in get_all_sources():
	89	# Calculate accuracy metrics
	90	correct_verdicts = count_correct_verdicts_citing(source, claims)
	91	total_citations = count_total_citations(source, claims)
	92	accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5
	93	# Weight by claim importance
	94	weighted_score = calculate_weighted_score(source, claims)
	95	# Update source record
	96	source.track_record_score = weighted_score
	97	source.total_citations = total_citations
	98	source.last_updated = now()
	99	source.save()
	100	# Job runs: Sunday 2 AM UTC
1.2	101	# Never during claim processing{{/code}}
	102
1.1	103	==== Real-Time Claim Analysis (AKEL) ====
1.2	104
1.1	105	Uses source scores but never updates them:
1.2	106	{{code language="python"}}def analyze_claim(claim_text):
1.1	107	"""
	108	Real-time: Analyze claim using current source scores
	109	READ source scores, never UPDATE them
	110	"""
	111	# Gather evidence
	112	evidence_list = gather_evidence(claim_text)
	113	for evidence in evidence_list:
	114	# READ source score (snapshot from last weekly update)
	115	source = get_source(evidence.source_id)
	116	source_score = source.track_record_score
	117	# Use score to weight evidence
	118	evidence.weighted_relevance = evidence.relevance * source_score
	119	# Generate verdict using weighted evidence
	120	verdict = synthesize_verdict(evidence_list)
	121	# NEVER update source scores here
	122	# That happens in weekly background job
1.2	123	return verdict{{/code}}
	124
1.1	125	==== Monthly Audit (Quality Assurance) ====
1.2	126
1.1	127	Moderator review of flagged source scores:
1.2	128
1.1	129	* Verify scores make sense
	130	* Detect gaming attempts
	131	* Identify systematic biases
	132	* Manual adjustments if needed
	133	Key Principles:
	134	✅ Scoring and analysis are temporally separated
	135	* Source scoring: Weekly batch job
	136	* Claim analysis: Real-time processing
	137	* Never update scores during analysis
	138	✅ One-way data flow during processing
	139	* Claims READ source scores
	140	* Claims NEVER WRITE source scores
	141	* Updates happen in background only
	142	✅ Predictable update cycle
	143	* Sources update every Sunday 2 AM
	144	* Claims always use last week's scores
	145	* No mid-week score changes
	146	✅ Audit trail
	147	* Log all score changes
	148	* Track score history
	149	* Explainable calculations
	150	Benefits:
	151	* No circular dependencies
	152	* Predictable behavior
	153	* Easier to reason about
	154	* Simpler testing
	155	* Clear audit trail
	156	Example Timeline:
	157	```
	158	Sunday 2 AM: Calculate source scores for past week
	159	→ NYT score: 0.87 (up from 0.85)
	160	→ Blog X score: 0.52 (down from 0.61)
	161	Monday-Saturday: Claims processed using these scores
	162	→ All claims this week use NYT=0.87
	163	→ All claims this week use Blog X=0.52
	164	Next Sunday 2 AM: Recalculate scores including this week's claims
	165	→ NYT score: 0.89 (trending up)
	166	→ Blog X score: 0.48 (trending down)
	167	```
1.2	168
1.1	169	=== 1.4 Scenario ===
1.2	170
1.1	171	Purpose: Different interpretations or contexts for evaluating claims
	172	Key Concept: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
	173	Relationship: One-to-many with Claims (simplified for V1.0: scenario belongs to single claim)
	174	Fields:
1.2	175
1.1	176	* id (UUID): Unique identifier
	177	* claim_id (UUID): Foreign key to claim (one-to-many)
	178	* description (text): Human-readable description of the scenario
	179	* assumptions (JSONB): Key assumptions that define this scenario context
	180	* extracted_from (UUID): Reference to evidence that this scenario was extracted from
	181	* created_at (timestamp): When scenario was created
	182	* updated_at (timestamp): Last modification
	183	How Found: Evidence search → Extract context → Create scenario → Link to claim
	184	Example:
	185	For claim "Vaccines reduce hospitalization":
	186	* Scenario 1: "Clinical trials (healthy adults 18-65, original strain)" from trial paper
	187	* Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data
	188	* Scenario 3: "Immunocompromised patients" from specialist study
	189	V2.0 Evolution: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality.
	190
	191	=== 1.5 Verdict ===
	192
	193	Purpose: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.
	194
	195	Core Fields:
1.2	196
1.1	197	* id (UUID): Primary key
	198	* scenario_id (UUID FK): The scenario being assessed
	199	* likelihood_range (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")
	200	* confidence (decimal 0-1): How confident we are in this assessment
	201	* explanation_summary (text): Human-readable reasoning explaining the verdict
	202	* uncertainty_factors (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")
	203	* created_at (timestamp): When verdict was created
	204	* updated_at (timestamp): Last modification
	205
	206	Change Tracking: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states.
	207
	208	Relationship: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity.
	209
	210	Example:
	211	For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":
1.2	212
1.1	213	* Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
	214	* After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
	215	* Edit entity records the complete before/after change with timestamp and reason
	216
	217	Key Design: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios.
	218
	219	=== 1.6 User ===
1.2	220
1.1	221	Fields: username, email, role (Reader/Contributor/Moderator), reputation, contributions_count
1.2	222
	223	=== User Reputation System ===
	224
1.1	225	V1.0 Approach: Simple manual role assignment
	226	Rationale: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
1.2	227
1.1	228	=== Roles (Manual Assignment) ===
1.2	229
1.1	230	reader (default):
1.2	231
1.1	232	* View published claims and evidence
	233	* Browse and search content
	234	* No editing permissions
	235	contributor:
	236	* Submit new claims
	237	* Suggest edits to existing content
	238	* Add evidence
	239	* Requires manual promotion by moderator/admin
	240	moderator:
	241	* Approve/reject contributor suggestions
	242	* Flag inappropriate content
	243	* Handle abuse reports
	244	* Assigned by admins based on trust
	245	admin:
	246	* Manage users and roles
	247	* System configuration
	248	* Access to all features
	249	* Founder-appointed initially
1.2	250
1.1	251	=== Contribution Tracking (Simple) ===
1.2	252
1.1	253	Basic metrics only:
1.2	254
1.1	255	* `contributions_count`: Total number of contributions
	256	* `created_at`: Account age
	257	* `last_active`: Recent activity
	258	No complex calculations:
	259	* No point systems
	260	* No automated privilege escalation
	261	* No reputation decay
	262	* No threshold-based promotions
1.2	263
1.1	264	=== Promotion Process ===
1.2	265
1.1	266	Manual review by moderators/admins:
1.2	267
1.1	268	1. User demonstrates value through contributions
	269	2. Moderator reviews user's contribution history
	270	3. Moderator promotes user to contributor role
	271	4. Admin promotes trusted contributors to moderator
	272	Criteria (guidelines, not automated):
1.2	273
1.1	274	* Quality of contributions
	275	* Consistency over time
	276	* Collaborative behavior
	277	* Understanding of project goals
1.2	278
1.1	279	=== V2.0+ Evolution ===
1.2	280
1.1	281	Add complex reputation when:
1.2	282
1.1	283	* 100+ active contributors
	284	* Manual role management becomes bottleneck
	285	* Clear patterns of abuse emerge requiring automation
	286	Future features may include:
	287	* Automated point calculations
	288	* Threshold-based promotions
	289	* Reputation decay for inactive users
	290	* Track record scoring for contributors
	291	See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
1.2	292
1.1	293	=== 1.7 Edit ===
1.2	294
1.1	295	Fields: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
	296	Purpose: Complete audit trail for all content changes
1.2	297
1.1	298	=== Edit History Details ===
1.2	299
1.1	300	What Gets Edited:
1.2	301
1.1	302	* Claims (20% edited): assertion, domain, status, scores, analysis
	303	* Evidence (10% edited): excerpt, relevance_score, supports
	304	* Scenarios (5% edited): description, assumptions, confidence
	305	* Sources: NOT versioned (continuous updates, not editorial decisions)
	306	Who Edits:
	307	* Contributors (rep sufficient): Corrections, additions
	308	* Trusted Contributors (rep sufficient): Major improvements, approvals
	309	* Moderators: Abuse handling, dispute resolution
	310	* System (AKEL): Re-analysis, automated improvements (user_id = NULL)
	311	Edit Types:
	312	* `CONTENT_CORRECTION`: User fixes factual error
	313	* `CLARIFICATION`: Improved wording
	314	* `SYSTEM_REANALYSIS`: AKEL re-processed claim
	315	* `MODERATION_ACTION`: Hide/unhide for abuse
	316	* `REVERT`: Rollback to previous version
	317	Retention Policy (5 years total):
1.2	318
1.1	319	1. Hot storage (3 months): PostgreSQL, instant access
	320	2. Warm storage (2 years): Partitioned, slower queries
	321	3. Cold storage (3 years): S3 compressed, download required
	322	4. Deletion: After 5 years (except legal holds)
1.2	323	Storage per 1M claims: 400 MB (20% edited × 2 KB per edit)
1.1	324	Use Cases:
1.2	325
1.1	326	* View claim history timeline
	327	* Detect vandalism patterns
	328	* Learn from user corrections (system improvement)
	329	* Legal compliance (audit trail)
	330	* Rollback capability
	331	See Edit History Documentation for complete details on what gets edited by whom, retention policy, and use cases
1.2	332
1.1	333	=== 1.8 Flag ===
1.2	334
1.1	335	Fields: entity_id, reported_by, issue_type, status, resolution_note
1.2	336
	337	=== 1.9 QualityMetric ===
	338
1.1	339	Fields: metric_type, category, value, target, timestamp
	340	Purpose: Time-series quality tracking
	341	Usage:
1.2	342
1.1	343	* Continuous monitoring: Hourly calculation of error rates, confidence scores, processing times
	344	* Quality dashboard: Real-time display with trend charts
	345	* Alerting: Automatic alerts when metrics exceed thresholds
	346	* A/B testing: Compare control vs treatment metrics
	347	* Improvement validation: Measure before/after changes
	348	Example: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
1.2	349
	350	=== 1.10 ErrorPattern ===
	351
1.1	352	Fields: error_category, claim_id, description, root_cause, frequency, status
	353	Purpose: Capture errors to trigger system improvements
	354	Usage:
1.2	355
1.1	356	* Error capture: When users flag issues or system detects problems
	357	* Pattern analysis: Weekly grouping by category and frequency
	358	* Improvement workflow: Analyze → Fix → Test → Deploy → Re-process → Monitor
	359	* Metrics: Track error rate reduction over time
	360	Example: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}`
	361
	362	== 1.4 Core Data Model ERD ==
	363
1.4	364	{{include reference="Test.FactHarbor V0\.9\.100.Specification.Diagrams.Core Data Model ERD.WebHome"/}}
1.1	365
	366	== 1.5 User Class Diagram ==
1.2	367
1.5	368	{{include reference="Test.FactHarbor V0\.9\.100.Specification.Diagrams.User Class Diagram.WebHome"/}}
1.2	369
1.1	370	== 2. Versioning Strategy ==
1.2	371
1.1	372	All Content Entities Are Versioned:
1.2	373
1.1	374	* Claim: Every edit creates new version (V1→V2→V3...)
	375	* Evidence: Changes tracked in edit history
	376	* Scenario: Modifications versioned
	377	How Versioning Works:
	378	* Entity table stores current state only
	379	* Edit table stores all historical states (before_state, after_state as JSON)
	380	* Version number increments with each edit
	381	* Complete audit trail maintained forever
	382	Unversioned Entities (current state only, no history):
	383	* Source: Track record continuously updated (not versioned history, just current score)
	384	* User: Account state (reputation accumulated, not versioned)
	385	* QualityMetric: Time-series data (each record is a point in time, not a version)
	386	* ErrorPattern: System improvement queue (status tracked, not versioned)
	387	Example:
	388	```
	389	Claim V1: "The sky is blue"
	390	→ User edits →
	391	Claim V2: "The sky is blue during daytime"
	392	→ EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
	393	```
1.2	394
1.1	395	== 2.5. Storage vs Computation Strategy ==
1.2	396
1.1	397	Critical architectural decision: What to persist in databases vs compute dynamically?
	398	Trade-off:
1.2	399
1.1	400	* Store more: Better reproducibility, faster, lower LLM costs \| Higher storage/maintenance costs
	401	* Compute more: Lower storage/maintenance costs \| Slower, higher LLM costs, less reproducible
1.2	402
1.1	403	=== Recommendation: Hybrid Approach ===
1.2	404
1.1	405	STORE (in PostgreSQL):
1.2	406
1.1	407	==== Claims (Current State + History) ====
1.2	408
1.1	409	* What: assertion, domain, status, created_at, updated_at, version
	410	* Why: Core entity, must be persistent
	411	* Also store: confidence_score (computed once, then cached)
1.2	412	* Size: 1 KB per claim
1.1	413	* Growth: Linear with claims
	414	* Decision: ✅ STORE - Essential
1.2	415
1.1	416	==== Evidence (All Records) ====
1.2	417
1.1	418	* What: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
	419	* Why: Hard to re-gather, user contributions, reproducibility
1.2	420	* Size: 2 KB per evidence (with excerpt)
1.1	421	* Growth: 3-10 evidence per claim
	422	* Decision: ✅ STORE - Essential for reproducibility
1.2	423
1.1	424	==== Sources (Track Records) ====
1.2	425
1.1	426	* What: name, domain, track_record_score, accuracy_history, correction_frequency
	427	* Why: Continuously updated, expensive to recompute
1.2	428	* Size: 500 bytes per source
1.1	429	* Growth: Slow (limited number of sources)
	430	* Decision: ✅ STORE - Essential for quality
1.2	431
1.1	432	==== Edit History (All Versions) ====
1.2	433
1.1	434	* What: before_state, after_state, user_id, reason, timestamp
	435	* Why: Audit trail, legal requirement, reproducibility
1.2	436	* Size: 2 KB per edit
	437	* Growth: Linear with edits (A portion of claims get edited)
1.1	438	* Retention: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
	439	* Decision: ✅ STORE - Essential for accountability
1.2	440
1.1	441	==== Flags (User Reports) ====
1.2	442
1.1	443	* What: entity_id, reported_by, issue_type, description, status
	444	* Why: Error detection, system improvement triggers
1.2	445	* Size: 500 bytes per flag
1.1	446	* Growth: 5-high percentage of claims get flagged
	447	* Decision: ✅ STORE - Essential for improvement
1.2	448
1.1	449	==== ErrorPatterns (System Improvement) ====
1.2	450
1.1	451	* What: error_category, claim_id, description, root_cause, frequency, status
	452	* Why: Learning loop, prevent recurring errors
1.2	453	* Size: 1 KB per pattern
1.1	454	* Growth: Slow (limited patterns, many fixed)
	455	* Decision: ✅ STORE - Essential for learning
1.2	456
1.1	457	==== QualityMetrics (Time Series) ====
1.2	458
1.1	459	* What: metric_type, category, value, target, timestamp
	460	* Why: Trend analysis, cannot recreate historical metrics
1.2	461	* Size: 200 bytes per metric
1.1	462	* Growth: Hourly = 8,760 per year per metric type
	463	* Retention: 2 years hot, then aggregate and archive
	464	* Decision: ✅ STORE - Essential for monitoring
	465	STORE (Computed Once, Then Cached):
1.2	466
1.1	467	==== Analysis Summary ====
1.2	468
1.1	469	* What: Neutral text summary of claim analysis (200-500 words)
	470	* Computed: Once by AKEL when claim first analyzed
	471	* Stored in: Claim table (text field)
	472	* Recomputed: Only when system significantly improves OR claim edited
	473	* Why store: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
1.2	474	* Size: 2 KB per claim
1.1	475	* Decision: ✅ STORE (cached) - Cost-effective
1.2	476
1.1	477	==== Confidence Score ====
1.2	478
1.1	479	* What: 0-100 score of analysis confidence
	480	* Computed: Once by AKEL
	481	* Stored in: Claim table (integer field)
	482	* Recomputed: When evidence added, source track record changes significantly, or system improves
	483	* Why store: Cheap to store, expensive to compute, users need it fast
	484	* Size: 4 bytes per claim
	485	* Decision: ✅ STORE (cached) - Performance critical
1.2	486
1.1	487	==== Risk Score ====
1.2	488
1.1	489	* What: 0-100 score of claim risk level
	490	* Computed: Once by AKEL
	491	* Stored in: Claim table (integer field)
	492	* Recomputed: When domain changes, evidence changes, or controversy detected
	493	* Why store: Same as confidence score
	494	* Size: 4 bytes per claim
	495	* Decision: ✅ STORE (cached) - Performance critical
	496	COMPUTE DYNAMICALLY (Never Store):
1.2	497
	498	==== Scenarios ====
	499
	500	⚠️ CRITICAL DECISION
	501
1.1	502	* What: 2-5 possible interpretations of claim with assumptions
	503	* Current design: Stored in Scenario table
	504	* Alternative: Compute on-demand when user views claim details
1.2	505	* Storage cost: 1 KB per scenario × 3 scenarios average = 3 KB per claim
1.1	506	* Compute cost: $0.005-0.01 per request (LLM API call)
1.2	507	* Frequency: Viewed in detail by 20% of users
1.1	508	* Trade-off analysis:
	509	- IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access
	510	- IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
	511	* Reproducibility: Scenarios may improve as AI improves (good to recompute)
	512	* Speed: Computed = 5-8 seconds delay, Stored = instant
	513	* Decision: ✅ STORE (hybrid approach below)
	514	Scenario Strategy (APPROVED):
1.2	515
1.1	516	1. Store scenarios initially when claim analyzed
	517	2. Mark as stale when system improves significantly
	518	3. Recompute on next view if marked stale
	519	4. Cache for 30 days if frequently accessed
	520	5. Result: Best of both worlds - speed + freshness
1.2	521
	522	==== Verdict Synthesis ====
	523
	524
	525
1.1	526	* What: Final conclusion text synthesizing all scenarios
	527	* Compute cost: $0.002-0.005 per request
	528	* Frequency: Every time claim viewed
	529	* Why not store: Changes as evidence/scenarios change, users want fresh analysis
	530	* Speed: 2-3 seconds (acceptable)
	531	Alternative: Store "last verdict" as cached field, recompute only if claim edited or marked stale
	532	* Recommendation: ✅ STORE cached version, mark stale when changes occur
1.2	533
1.1	534	==== Search Results ====
1.2	535
1.1	536	* What: Lists of claims matching search query
	537	* Compute from: Elasticsearch index
	538	* Cache: 15 minutes in Redis for popular queries
	539	* Why not store permanently: Constantly changing, infinite possible queries
1.2	540
1.1	541	==== Aggregated Statistics ====
1.2	542
1.1	543	* What: "Total claims: 1,234,567", "Average confidence: 78%", etc.
	544	* Compute from: Database queries
	545	* Cache: 1 hour in Redis
	546	* Why not store: Can be derived, relatively cheap to compute
1.2	547
1.1	548	==== User Reputation ====
1.2	549
1.1	550	* What: Score based on contributions
	551	* Current design: Stored in User table
	552	* Alternative: Compute from Edit table
	553	* Trade-off:
	554	- Stored: Fast, simple
	555	- Computed: Always accurate, no denormalization
	556	* Frequency: Read on every user action
	557	* Compute cost: Simple COUNT query, milliseconds
	558	* Decision: ✅ STORE - Performance critical, read-heavy
1.2	559
1.1	560	=== Summary Table ===
1.2	561
	562	\| Data Type \| Storage \| Compute \| Size per Claim \| Decision \| Rationale \|\\
1.4	563	\|-\|-\|-\|\|\|-\|\\
1.2	564	\| Claim core \| ✅ \| - \| 1 KB \| STORE \| Essential \|\\
	565	\| Evidence \| ✅ \| - \| 2 KB × 5 = 10 KB \| STORE \| Reproducibility \|\\
	566	\| Sources \| ✅ \| - \| 500 B (shared) \| STORE \| Track record \|\\
	567	\| Edit history \| ✅ \| - \| 2 KB × 20% = 400 B avg \| STORE \| Audit \|\\
	568	\| Analysis summary \| ✅ \| Once \| 2 KB \| STORE (cached) \| Cost-effective \|\\
	569	\| Confidence score \| ✅ \| Once \| 4 B \| STORE (cached) \| Fast access \|\\
	570	\| Risk score \| ✅ \| Once \| 4 B \| STORE (cached) \| Fast access \|\\
	571	\| Scenarios \| ✅ \| When stale \| 3 KB \| STORE (hybrid) \| Balance cost/speed \|\\
	572	\| Verdict \| ✅ \| When stale \| 1 KB \| STORE (cached) \| Fast access \|\\
	573	\| Flags \| ✅ \| - \| 500 B × 10% = 50 B avg \| STORE \| Improvement \|\\
	574	\| ErrorPatterns \| ✅ \| - \| 1 KB (global) \| STORE \| Learning \|\\
	575	\| QualityMetrics \| ✅ \| - \| 200 B (time series) \| STORE \| Trending \|\\
	576	\| Search results \| - \| ✅ \| - \| COMPUTE + 15min cache \| Dynamic \|\\
1.1	577	\| Aggregations \| - \| ✅ \| - \| COMPUTE + 1hr cache \| Derivable \|
1.2	578	Total storage per claim: 18 KB (without edits and flags)
1.1	579	For 1 million claims:
1.2	580
	581	* Storage: 18 GB (manageable)
	582	* PostgreSQL: $50/month (standard instance)
	583	* Redis cache: $20/month (1 GB instance)
	584	* S3 archives: $5/month (old edits)
	585	* Total: $75/month infrastructure
1.1	586	LLM cost savings by caching:
	587	* Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
	588	* Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims
	589	* Verdict stored: Save $0.003 per claim = $3K per 1M claims
1.2	590	* Total savings: $35K per 1M claims vs recomputing every time
	591
1.1	592	=== Recomputation Triggers ===
1.2	593
1.1	594	When to mark cached data as stale and recompute:
1.2	595
1.1	596	1. User edits claim → Recompute: all (analysis, scenarios, verdict, scores)
	597	2. Evidence added → Recompute: scenarios, verdict, confidence score
	598	3. Source track record changes >10 points → Recompute: confidence score, verdict
	599	4. System improvement deployed → Mark affected claims stale, recompute on next view
	600	5. Controversy detected (high flag rate) → Recompute: risk score
	601	Recomputation strategy:
1.2	602
1.1	603	* Eager: Immediately recompute (for user edits)
	604	* Lazy: Recompute on next view (for system improvements)
	605	* Batch: Nightly re-evaluation of stale claims (if <1000)
1.2	606
1.1	607	=== Database Size Projection ===
1.2	608
1.1	609	Year 1: 10K claims
1.2	610
1.1	611	* Storage: 180 MB
	612	* Cost: $10/month
	613	Year 3: 100K claims
	614	* Storage: 1.8 GB
	615	* Cost: $30/month
	616	Year 5: 1M claims
	617	* Storage: 18 GB
	618	* Cost: $75/month
	619	Year 10: 10M claims
	620	* Storage: 180 GB
	621	* Cost: $300/month
	622	* Optimization: Archive old claims to S3 ($5/TB/month)
	623	Conclusion: Storage costs are manageable, LLM cost savings are substantial.
1.2	624
1.1	625	== 3. Key Simplifications ==
1.2	626
1.1	627	* Two content states only: Published, Hidden
	628	* Three user roles only: Reader, Contributor, Moderator
	629	* No complex versioning: Linear edit history
	630	* Reputation-based permissions: Not role hierarchy
	631	* Source track records: Continuous evaluation
1.2	632
1.1	633	== 3. What Gets Stored in the Database ==
1.2	634
1.1	635	=== 3.1 Primary Storage (PostgreSQL) ===
1.2	636
1.1	637	Claims Table:
1.2	638
1.1	639	* Current state only (latest version)
	640	* Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
	641	Evidence Table:
	642	* All evidence records
	643	* Fields: id, claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at, archived
	644	Scenario Table:
	645	* All scenarios for each claim
	646	* Fields: id, claim_id, description, assumptions (text array), confidence, created_by, created_at
	647	Source Table:
	648	* Track record database (continuously updated)
	649	* Fields: id, name, domain, type, track_record_score, accuracy_history (JSON), correction_frequency, last_updated, claim_count, corrections_count
	650	User Table:
	651	* All user accounts
	652	* Fields: id, username, email, role (Reader/Contributor/Moderator), reputation, created_at, last_active, contributions_count, flags_submitted, flags_accepted
	653	Edit Table:
	654	* Complete version history
	655	* Fields: id, entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
	656	Flag Table:
	657	* User-reported issues
	658	* Fields: id, entity_type, entity_id, reported_by, issue_type, description, status, resolved_by, resolution_note, created_at, resolved_at
	659	ErrorPattern Table:
	660	* System improvement queue
	661	* Fields: id, error_category, claim_id, description, root_cause, frequency, status, created_at, fixed_at
	662	QualityMetric Table:
	663	* Time-series quality data
	664	* Fields: id, metric_type, metric_category, value, target, timestamp
1.2	665
1.1	666	=== 3.2 What's NOT Stored (Computed on-the-fly) ===
1.2	667
1.1	668	* Verdicts: Synthesized from evidence + scenarios when requested
	669	* Risk scores: Recalculated based on current factors
	670	* Aggregated statistics: Computed from base data
	671	* Search results: Generated from Elasticsearch index
1.2	672
1.1	673	=== 3.3 Cache Layer (Redis) ===
1.2	674
1.1	675	Cached for performance:
1.2	676
1.1	677	* Frequently accessed claims (TTL: 1 hour)
	678	* Search results (TTL: 15 minutes)
	679	* User sessions (TTL: 24 hours)
	680	* Source track records (TTL: 1 hour)
1.2	681
1.1	682	=== 3.4 File Storage (S3) ===
1.2	683
1.1	684	Archived content:
1.2	685
1.1	686	* Old edit history (>3 months)
	687	* Evidence documents (archived copies)
	688	* Database backups
	689	* Export files
1.2	690
1.1	691	=== 3.5 Search Index (Elasticsearch) ===
1.2	692
1.1	693	Indexed for search:
1.2	694
1.1	695	* Claim assertions (full-text)
	696	* Evidence excerpts (full-text)
	697	* Scenario descriptions (full-text)
	698	* Source names (autocomplete)
	699	Synchronized from PostgreSQL via change data capture or periodic sync.
1.2	700
1.1	701	== 4. Related Pages ==
1.2	702
	703	* [[Architecture>>Test.FactHarbor V0\.9\.100.Specification.Architecture.WebHome]]
1.6	704	* [[Requirements>>Test.FactHarbor V0\.9\.100.Specification.Requirements.WebHome]]
1.7	705	* [[Workflows>>Test.FactHarbor V0\.9\.100.Specification.Workflows.WebHome]]

Wiki source code of Data Model

Applications

Navigation

Need help?