Wiki source code of Data Model

Version 3.1 by Robert Schaub on 2025/12/19 14:41

version	line-number	content
1.1	1	= Data Model =
	2	FactHarbor's data model is simple, focused, designed for automated processing.
	3	== 1. Core Entities ==
	4	=== 1.1 Claim ===
	5	Fields: id, assertion, domain, status (Published/Hidden only), confidence_score, risk_score, completeness_score, version, views, edit_count
	6	==== Performance Optimization: Denormalized Fields ====
	7	Rationale: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
	8	Additional cached fields in claims table:
	9	* evidence_summary (JSONB): Top 5 most relevant evidence snippets with scores
	10	* Avoids joining evidence table for listing/preview
	11	* Updated when evidence is added/removed
	12	* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
	13	* source_names (TEXT[]): Array of source names for quick display
	14	* Avoids joining through evidence to sources
	15	* Updated when sources change
	16	* Format: `["New York Times", "Nature Journal", ...]`
	17	* scenario_count (INTEGER): Number of scenarios for this claim
	18	* Quick metric without counting rows
	19	* Updated when scenarios added/removed
	20	* cache_updated_at (TIMESTAMP): When denormalized data was last refreshed
	21	* Helps invalidate stale caches
	22	* Triggers background refresh if too old
	23	Update Strategy:
	24	* Immediate: Update on claim edit (user-facing)
	25	* Deferred: Update via background job every hour (non-critical)
	26	* Invalidation: Clear cache when source data changes significantly
	27	Trade-offs:
	28	* ✅ 70% fewer joins on common queries
	29	* ✅ Much faster claim list/search pages
	30	* ✅ Better user experience
	31	* ⚠️ Small storage increase (~10%)
	32	* ⚠️ Need to keep caches in sync
	33	=== 1.2 Evidence ===
	34	Fields: claim_id, source_id, excerpt, url, relevance_score, supports
	35	=== 1.3 Source ===
	36	Purpose: Track reliability of information sources over time
	37	Fields:
	38	* id (UUID): Unique identifier
	39	* name (text): Source name (e.g., "New York Times", "Nature Journal")
	40	* domain (text): Website domain (e.g., "nytimes.com")
	41	* type (enum): NewsOutlet, AcademicJournal, GovernmentAgency, etc.
	42	* track_record_score (0-100): Overall reliability score
	43	* accuracy_history (JSON): Historical accuracy data
	44	* correction_frequency (float): How often source publishes corrections
	45	* last_updated (timestamp): When track record last recalculated
	46	How It Works:
	47	* Initial score based on source type (70 for academic journals, 30 for unknown)
	48	* Updated daily by background scheduler
	49	* Formula: accuracy_rate (50%) + correction_policy (20%) + editorial_standards (15%) + bias_transparency (10%) + longevity (5%)
	50	* Track Record Check in AKEL pipeline: Adjusts evidence confidence based on source quality
	51	* Quality thresholds: 90+=Exceptional, 70-89=Reliable, 50-69=Acceptable, 30-49=Questionable, <30=Unreliable
	52	See: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
	53	Fields: id, name, domain, track_record_score, accuracy_history, correction_frequency
	54	Key: Automated source reliability tracking
	55	==== Source Scoring Process (Separation of Concerns) ====
	56	Critical design principle: Prevent circular dependencies between source scoring and claim analysis.
	57	The Problem:
	58	* Source scores should influence claim verdicts
	59	* Claim verdicts should update source scores
	60	* But: Direct feedback creates circular dependency and potential feedback loops
	61	The Solution: Temporal separation
	62	==== Weekly Background Job (Source Scoring) ====
	63	Runs independently of claim analysis:
	64	{{code language="python"}}
	65	def update_source_scores_weekly():
	66	"""
	67	Background job: Calculate source reliability
	68	Never triggered by individual claim analysis
	69	"""
	70	# Analyze all claims from past week
	71	claims = get_claims_from_past_week()
	72	for source in get_all_sources():
	73	# Calculate accuracy metrics
	74	correct_verdicts = count_correct_verdicts_citing(source, claims)
	75	total_citations = count_total_citations(source, claims)
	76	accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5
	77	# Weight by claim importance
	78	weighted_score = calculate_weighted_score(source, claims)
	79	# Update source record
	80	source.track_record_score = weighted_score
	81	source.total_citations = total_citations
	82	source.last_updated = now()
	83	source.save()
	84	# Job runs: Sunday 2 AM UTC
	85	# Never during claim processing
	86	{{/code}}
	87	==== Real-Time Claim Analysis (AKEL) ====
	88	Uses source scores but never updates them:
	89	{{code language="python"}}
	90	def analyze_claim(claim_text):
	91	"""
	92	Real-time: Analyze claim using current source scores
	93	READ source scores, never UPDATE them
	94	"""
	95	# Gather evidence
	96	evidence_list = gather_evidence(claim_text)
	97	for evidence in evidence_list:
	98	# READ source score (snapshot from last weekly update)
	99	source = get_source(evidence.source_id)
	100	source_score = source.track_record_score
	101	# Use score to weight evidence
	102	evidence.weighted_relevance = evidence.relevance * source_score
	103	# Generate verdict using weighted evidence
	104	verdict = synthesize_verdict(evidence_list)
	105	# NEVER update source scores here
	106	# That happens in weekly background job
	107	return verdict
	108	{{/code}}
	109	==== Monthly Audit (Quality Assurance) ====
	110	Moderator review of flagged source scores:
	111	* Verify scores make sense
	112	* Detect gaming attempts
	113	* Identify systematic biases
	114	* Manual adjustments if needed
	115	Key Principles:
	116	✅ Scoring and analysis are temporally separated
	117	* Source scoring: Weekly batch job
	118	* Claim analysis: Real-time processing
	119	* Never update scores during analysis
	120	✅ One-way data flow during processing
	121	* Claims READ source scores
	122	* Claims NEVER WRITE source scores
	123	* Updates happen in background only
	124	✅ Predictable update cycle
	125	* Sources update every Sunday 2 AM
	126	* Claims always use last week's scores
	127	* No mid-week score changes
	128	✅ Audit trail
	129	* Log all score changes
	130	* Track score history
	131	* Explainable calculations
	132	Benefits:
	133	* No circular dependencies
	134	* Predictable behavior
	135	* Easier to reason about
	136	* Simpler testing
	137	* Clear audit trail
	138	Example Timeline:
	139	```
	140	Sunday 2 AM: Calculate source scores for past week
	141	→ NYT score: 0.87 (up from 0.85)
	142	→ Blog X score: 0.52 (down from 0.61)
	143	Monday-Saturday: Claims processed using these scores
	144	→ All claims this week use NYT=0.87
	145	→ All claims this week use Blog X=0.52
	146	Next Sunday 2 AM: Recalculate scores including this week's claims
	147	→ NYT score: 0.89 (trending up)
	148	→ Blog X score: 0.48 (trending down)
	149	```
	150	=== 1.4 Scenario ===
	151	Purpose: Different interpretations or contexts for evaluating claims
	152	Key Concept: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
	153	Relationship: One-to-many with Claims (simplified for V1.0: scenario belongs to single claim)
	154	Fields:
	155	* id (UUID): Unique identifier
	156	* claim_id (UUID): Foreign key to claim (one-to-many)
	157	* description (text): Human-readable description of the scenario
	158	* assumptions (JSONB): Key assumptions that define this scenario context
	159	* extracted_from (UUID): Reference to evidence that this scenario was extracted from
	160	* created_at (timestamp): When scenario was created
	161	* updated_at (timestamp): Last modification
	162	How Found: Evidence search → Extract context → Create scenario → Link to claim
	163	Example:
	164	For claim "Vaccines reduce hospitalization":
	165	* Scenario 1: "Clinical trials (healthy adults 18-65, original strain)" from trial paper
	166	* Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data
	167	* Scenario 3: "Immunocompromised patients" from specialist study
	168	V2.0 Evolution: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality.
3.1	169
	170	=== 1.5 Verdict ===
	171
	172	Purpose: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.
	173
	174	Core Fields:
	175	* id (UUID): Primary key
	176	* scenario_id (UUID FK): The scenario being assessed
	177	* created_at (timestamp): When verdict was first created
	178
	179	Versioned via VERDICT_VERSION: Verdicts evolve as new evidence emerges or analysis improves. Each version captures:
	180	* likelihood_range (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")
	181	* confidence (decimal 0-1): How confident we are in this assessment
	182	* explanation_summary (text): Human-readable reasoning explaining the verdict
	183	* uncertainty_factors (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")
	184	* created_at (timestamp): When this version was generated
	185
	186	Relationship: Each Scenario has multiple Verdicts over time (as understanding evolves). Each Verdict has multiple versions.
	187
	188	Example:
	189	For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":
	190	* Initial verdict (v1): likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
	191	* Updated verdict (v2): likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
	192
	193	Key Design: Separating Verdict from Scenario allows tracking how our understanding evolves without losing history.
	194
	195	=== 1.6 User ===
1.1	196	Fields: username, email, role (Reader/Contributor/Moderator), reputation, contributions_count
	197	=== User Reputation System ==
	198	V1.0 Approach: Simple manual role assignment
	199	Rationale: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
	200	=== Roles (Manual Assignment) ===
	201	reader (default):
	202	* View published claims and evidence
	203	* Browse and search content
	204	* No editing permissions
	205	contributor:
	206	* Submit new claims
	207	* Suggest edits to existing content
	208	* Add evidence
	209	* Requires manual promotion by moderator/admin
	210	moderator:
	211	* Approve/reject contributor suggestions
	212	* Flag inappropriate content
	213	* Handle abuse reports
	214	* Assigned by admins based on trust
	215	admin:
	216	* Manage users and roles
	217	* System configuration
	218	* Access to all features
	219	* Founder-appointed initially
	220	=== Contribution Tracking (Simple) ===
	221	Basic metrics only:
	222	* `contributions_count`: Total number of contributions
	223	* `created_at`: Account age
	224	* `last_active`: Recent activity
	225	No complex calculations:
	226	* No point systems
	227	* No automated privilege escalation
	228	* No reputation decay
	229	* No threshold-based promotions
	230	=== Promotion Process ===
	231	Manual review by moderators/admins:
	232	1. User demonstrates value through contributions
	233	2. Moderator reviews user's contribution history
	234	3. Moderator promotes user to contributor role
	235	4. Admin promotes trusted contributors to moderator
	236	Criteria (guidelines, not automated):
	237	* Quality of contributions
	238	* Consistency over time
	239	* Collaborative behavior
	240	* Understanding of project goals
	241	=== V2.0+ Evolution ===
	242	Add complex reputation when:
	243	* 100+ active contributors
	244	* Manual role management becomes bottleneck
	245	* Clear patterns of abuse emerge requiring automation
	246	Future features may include:
	247	* Automated point calculations
	248	* Threshold-based promotions
	249	* Reputation decay for inactive users
	250	* Track record scoring for contributors
3.1	251	See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
	252	=== 1.7 Edit ===
1.1	253	Fields: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
	254	Purpose: Complete audit trail for all content changes
	255	=== Edit History Details ===
	256	What Gets Edited:
	257	* Claims (20% edited): assertion, domain, status, scores, analysis
	258	* Evidence (10% edited): excerpt, relevance_score, supports
	259	* Scenarios (5% edited): description, assumptions, confidence
	260	* Sources: NOT versioned (continuous updates, not editorial decisions)
	261	Who Edits:
	262	* Contributors (rep sufficient): Corrections, additions
	263	* Trusted Contributors (rep sufficient): Major improvements, approvals
	264	* Moderators: Abuse handling, dispute resolution
	265	* System (AKEL): Re-analysis, automated improvements (user_id = NULL)
	266	Edit Types:
	267	* `CONTENT_CORRECTION`: User fixes factual error
	268	* `CLARIFICATION`: Improved wording
	269	* `SYSTEM_REANALYSIS`: AKEL re-processed claim
	270	* `MODERATION_ACTION`: Hide/unhide for abuse
	271	* `REVERT`: Rollback to previous version
	272	Retention Policy (5 years total):
	273	1. Hot storage (3 months): PostgreSQL, instant access
	274	2. Warm storage (2 years): Partitioned, slower queries
	275	3. Cold storage (3 years): S3 compressed, download required
	276	4. Deletion: After 5 years (except legal holds)
	277	Storage per 1M claims: ~400 MB (20% edited × 2 KB per edit)
	278	Use Cases:
	279	* View claim history timeline
	280	* Detect vandalism patterns
	281	* Learn from user corrections (system improvement)
	282	* Legal compliance (audit trail)
	283	* Rollback capability
	284	See Edit History Documentation for complete details on what gets edited by whom, retention policy, and use cases
3.1	285	=== 1.8 Flag ===
1.1	286	Fields: entity_id, reported_by, issue_type, status, resolution_note
3.1	287	=== 1.9 QualityMetric ===
1.1	288	Fields: metric_type, category, value, target, timestamp
	289	Purpose: Time-series quality tracking
	290	Usage:
	291	* Continuous monitoring: Hourly calculation of error rates, confidence scores, processing times
	292	* Quality dashboard: Real-time display with trend charts
	293	* Alerting: Automatic alerts when metrics exceed thresholds
	294	* A/B testing: Compare control vs treatment metrics
	295	* Improvement validation: Measure before/after changes
	296	Example: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
3.1	297	=== 1.10 ErrorPattern ===
1.1	298	Fields: error_category, claim_id, description, root_cause, frequency, status
	299	Purpose: Capture errors to trigger system improvements
	300	Usage:
	301	* Error capture: When users flag issues or system detects problems
	302	* Pattern analysis: Weekly grouping by category and frequency
	303	* Improvement workflow: Analyze → Fix → Test → Deploy → Re-process → Monitor
	304	* Metrics: Track error rate reduction over time
	305	Example: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}`
2.1	306
	307	== 1.4 Core Data Model ERD ==
	308
3.1	309	{{include reference="Test.FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}}
2.1	310
1.1	311	== 1.5 User Class Diagram ==
3.1	312	{{include reference="Test.FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}}
1.1	313	== 2. Versioning Strategy ==
	314	All Content Entities Are Versioned:
	315	* Claim: Every edit creates new version (V1→V2→V3...)
	316	* Evidence: Changes tracked in edit history
	317	* Scenario: Modifications versioned
	318	How Versioning Works:
	319	* Entity table stores current state only
	320	* Edit table stores all historical states (before_state, after_state as JSON)
	321	* Version number increments with each edit
	322	* Complete audit trail maintained forever
	323	Unversioned Entities (current state only, no history):
	324	* Source: Track record continuously updated (not versioned history, just current score)
	325	* User: Account state (reputation accumulated, not versioned)
	326	* QualityMetric: Time-series data (each record is a point in time, not a version)
	327	* ErrorPattern: System improvement queue (status tracked, not versioned)
	328	Example:
	329	```
	330	Claim V1: "The sky is blue"
	331	→ User edits →
	332	Claim V2: "The sky is blue during daytime"
	333	→ EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
	334	```
	335	== 2.5. Storage vs Computation Strategy ==
	336	Critical architectural decision: What to persist in databases vs compute dynamically?
	337	Trade-off:
	338	* Store more: Better reproducibility, faster, lower LLM costs \| Higher storage/maintenance costs
	339	* Compute more: Lower storage/maintenance costs \| Slower, higher LLM costs, less reproducible
	340	=== Recommendation: Hybrid Approach ===
	341	STORE (in PostgreSQL):
	342	==== Claims (Current State + History) ====
	343	* What: assertion, domain, status, created_at, updated_at, version
	344	* Why: Core entity, must be persistent
	345	* Also store: confidence_score (computed once, then cached)
	346	* Size: ~1 KB per claim
	347	* Growth: Linear with claims
	348	* Decision: ✅ STORE - Essential
	349	==== Evidence (All Records) ====
	350	* What: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
	351	* Why: Hard to re-gather, user contributions, reproducibility
	352	* Size: ~2 KB per evidence (with excerpt)
	353	* Growth: 3-10 evidence per claim
	354	* Decision: ✅ STORE - Essential for reproducibility
	355	==== Sources (Track Records) ====
	356	* What: name, domain, track_record_score, accuracy_history, correction_frequency
	357	* Why: Continuously updated, expensive to recompute
	358	* Size: ~500 bytes per source
	359	* Growth: Slow (limited number of sources)
	360	* Decision: ✅ STORE - Essential for quality
	361	==== Edit History (All Versions) ====
	362	* What: before_state, after_state, user_id, reason, timestamp
	363	* Why: Audit trail, legal requirement, reproducibility
	364	* Size: ~2 KB per edit
	365	* Growth: Linear with edits (~A portion of claims get edited)
	366	* Retention: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
	367	* Decision: ✅ STORE - Essential for accountability
	368	==== Flags (User Reports) ====
	369	* What: entity_id, reported_by, issue_type, description, status
	370	* Why: Error detection, system improvement triggers
	371	* Size: ~500 bytes per flag
	372	* Growth: 5-high percentage of claims get flagged
	373	* Decision: ✅ STORE - Essential for improvement
	374	==== ErrorPatterns (System Improvement) ====
	375	* What: error_category, claim_id, description, root_cause, frequency, status
	376	* Why: Learning loop, prevent recurring errors
	377	* Size: ~1 KB per pattern
	378	* Growth: Slow (limited patterns, many fixed)
	379	* Decision: ✅ STORE - Essential for learning
	380	==== QualityMetrics (Time Series) ====
	381	* What: metric_type, category, value, target, timestamp
	382	* Why: Trend analysis, cannot recreate historical metrics
	383	* Size: ~200 bytes per metric
	384	* Growth: Hourly = 8,760 per year per metric type
	385	* Retention: 2 years hot, then aggregate and archive
	386	* Decision: ✅ STORE - Essential for monitoring
	387	STORE (Computed Once, Then Cached):
	388	==== Analysis Summary ====
	389	* What: Neutral text summary of claim analysis (200-500 words)
	390	* Computed: Once by AKEL when claim first analyzed
	391	* Stored in: Claim table (text field)
	392	* Recomputed: Only when system significantly improves OR claim edited
	393	* Why store: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
	394	* Size: ~2 KB per claim
	395	* Decision: ✅ STORE (cached) - Cost-effective
	396	==== Confidence Score ====
	397	* What: 0-100 score of analysis confidence
	398	* Computed: Once by AKEL
	399	* Stored in: Claim table (integer field)
	400	* Recomputed: When evidence added, source track record changes significantly, or system improves
	401	* Why store: Cheap to store, expensive to compute, users need it fast
	402	* Size: 4 bytes per claim
	403	* Decision: ✅ STORE (cached) - Performance critical
	404	==== Risk Score ====
	405	* What: 0-100 score of claim risk level
	406	* Computed: Once by AKEL
	407	* Stored in: Claim table (integer field)
	408	* Recomputed: When domain changes, evidence changes, or controversy detected
	409	* Why store: Same as confidence score
	410	* Size: 4 bytes per claim
	411	* Decision: ✅ STORE (cached) - Performance critical
	412	COMPUTE DYNAMICALLY (Never Store):
	413	==== Scenarios ==== ⚠️ CRITICAL DECISION
	414	* What: 2-5 possible interpretations of claim with assumptions
	415	* Current design: Stored in Scenario table
	416	* Alternative: Compute on-demand when user views claim details
	417	* Storage cost: ~1 KB per scenario × 3 scenarios average = ~3 KB per claim
	418	* Compute cost: $0.005-0.01 per request (LLM API call)
	419	* Frequency: Viewed in detail by ~20% of users
	420	* Trade-off analysis:
	421	- IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access
	422	- IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
	423	* Reproducibility: Scenarios may improve as AI improves (good to recompute)
	424	* Speed: Computed = 5-8 seconds delay, Stored = instant
	425	* Decision: ✅ STORE (hybrid approach below)
	426	Scenario Strategy (APPROVED):
	427	1. Store scenarios initially when claim analyzed
	428	2. Mark as stale when system improves significantly
	429	3. Recompute on next view if marked stale
	430	4. Cache for 30 days if frequently accessed
	431	5. Result: Best of both worlds - speed + freshness
	432	==== Verdict Synthesis ====
	433	* What: Final conclusion text synthesizing all scenarios
	434	* Compute cost: $0.002-0.005 per request
	435	* Frequency: Every time claim viewed
	436	* Why not store: Changes as evidence/scenarios change, users want fresh analysis
	437	* Speed: 2-3 seconds (acceptable)
	438	Alternative: Store "last verdict" as cached field, recompute only if claim edited or marked stale
	439	* Recommendation: ✅ STORE cached version, mark stale when changes occur
	440	==== Search Results ====
	441	* What: Lists of claims matching search query
	442	* Compute from: Elasticsearch index
	443	* Cache: 15 minutes in Redis for popular queries
	444	* Why not store permanently: Constantly changing, infinite possible queries
	445	==== Aggregated Statistics ====
	446	* What: "Total claims: 1,234,567", "Average confidence: 78%", etc.
	447	* Compute from: Database queries
	448	* Cache: 1 hour in Redis
	449	* Why not store: Can be derived, relatively cheap to compute
	450	==== User Reputation ====
	451	* What: Score based on contributions
	452	* Current design: Stored in User table
	453	* Alternative: Compute from Edit table
	454	* Trade-off:
	455	- Stored: Fast, simple
	456	- Computed: Always accurate, no denormalization
	457	* Frequency: Read on every user action
	458	* Compute cost: Simple COUNT query, milliseconds
	459	* Decision: ✅ STORE - Performance critical, read-heavy
	460	=== Summary Table ===
	461	\| Data Type \| Storage \| Compute \| Size per Claim \| Decision \| Rationale \|
	462	\|-----------\|---------\|---------\|----------------\|----------\|-----------\|
	463	\| Claim core \| ✅ \| - \| 1 KB \| STORE \| Essential \|
	464	\| Evidence \| ✅ \| - \| 2 KB × 5 = 10 KB \| STORE \| Reproducibility \|
	465	\| Sources \| ✅ \| - \| 500 B (shared) \| STORE \| Track record \|
	466	\| Edit history \| ✅ \| - \| 2 KB × 20% = 400 B avg \| STORE \| Audit \|
	467	\| Analysis summary \| ✅ \| Once \| 2 KB \| STORE (cached) \| Cost-effective \|
	468	\| Confidence score \| ✅ \| Once \| 4 B \| STORE (cached) \| Fast access \|
	469	\| Risk score \| ✅ \| Once \| 4 B \| STORE (cached) \| Fast access \|
	470	\| Scenarios \| ✅ \| When stale \| 3 KB \| STORE (hybrid) \| Balance cost/speed \|
	471	\| Verdict \| ✅ \| When stale \| 1 KB \| STORE (cached) \| Fast access \|
	472	\| Flags \| ✅ \| - \| 500 B × 10% = 50 B avg \| STORE \| Improvement \|
	473	\| ErrorPatterns \| ✅ \| - \| 1 KB (global) \| STORE \| Learning \|
	474	\| QualityMetrics \| ✅ \| - \| 200 B (time series) \| STORE \| Trending \|
	475	\| Search results \| - \| ✅ \| - \| COMPUTE + 15min cache \| Dynamic \|
	476	\| Aggregations \| - \| ✅ \| - \| COMPUTE + 1hr cache \| Derivable \|
	477	Total storage per claim: ~18 KB (without edits and flags)
	478	For 1 million claims:
	479	* Storage: ~18 GB (manageable)
	480	* PostgreSQL: ~$50/month (standard instance)
	481	* Redis cache: ~$20/month (1 GB instance)
	482	* S3 archives: ~$5/month (old edits)
	483	* Total: ~$75/month infrastructure
	484	LLM cost savings by caching:
	485	* Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
	486	* Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims
	487	* Verdict stored: Save $0.003 per claim = $3K per 1M claims
	488	* Total savings: ~$35K per 1M claims vs recomputing every time
	489	=== Recomputation Triggers ===
	490	When to mark cached data as stale and recompute:
	491	1. User edits claim → Recompute: all (analysis, scenarios, verdict, scores)
	492	2. Evidence added → Recompute: scenarios, verdict, confidence score
	493	3. Source track record changes >10 points → Recompute: confidence score, verdict
	494	4. System improvement deployed → Mark affected claims stale, recompute on next view
	495	5. Controversy detected (high flag rate) → Recompute: risk score
	496	Recomputation strategy:
	497	* Eager: Immediately recompute (for user edits)
	498	* Lazy: Recompute on next view (for system improvements)
	499	* Batch: Nightly re-evaluation of stale claims (if <1000)
	500	=== Database Size Projection ===
	501	Year 1: 10K claims
	502	* Storage: 180 MB
	503	* Cost: $10/month
	504	Year 3: 100K claims
	505	* Storage: 1.8 GB
	506	* Cost: $30/month
	507	Year 5: 1M claims
	508	* Storage: 18 GB
	509	* Cost: $75/month
	510	Year 10: 10M claims
	511	* Storage: 180 GB
	512	* Cost: $300/month
	513	* Optimization: Archive old claims to S3 ($5/TB/month)
	514	Conclusion: Storage costs are manageable, LLM cost savings are substantial.
	515	== 3. Key Simplifications ==
	516	* Two content states only: Published, Hidden
	517	* Three user roles only: Reader, Contributor, Moderator
	518	* No complex versioning: Linear edit history
	519	* Reputation-based permissions: Not role hierarchy
	520	* Source track records: Continuous evaluation
	521	== 3. What Gets Stored in the Database ==
	522	=== 3.1 Primary Storage (PostgreSQL) ===
	523	Claims Table:
	524	* Current state only (latest version)
	525	* Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
	526	Evidence Table:
	527	* All evidence records
	528	* Fields: id, claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at, archived
	529	Scenario Table:
	530	* All scenarios for each claim
	531	* Fields: id, claim_id, description, assumptions (text array), confidence, created_by, created_at
	532	Source Table:
	533	* Track record database (continuously updated)
	534	* Fields: id, name, domain, type, track_record_score, accuracy_history (JSON), correction_frequency, last_updated, claim_count, corrections_count
	535	User Table:
	536	* All user accounts
	537	* Fields: id, username, email, role (Reader/Contributor/Moderator), reputation, created_at, last_active, contributions_count, flags_submitted, flags_accepted
	538	Edit Table:
	539	* Complete version history
	540	* Fields: id, entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
	541	Flag Table:
	542	* User-reported issues
	543	* Fields: id, entity_type, entity_id, reported_by, issue_type, description, status, resolved_by, resolution_note, created_at, resolved_at
	544	ErrorPattern Table:
	545	* System improvement queue
	546	* Fields: id, error_category, claim_id, description, root_cause, frequency, status, created_at, fixed_at
	547	QualityMetric Table:
	548	* Time-series quality data
	549	* Fields: id, metric_type, metric_category, value, target, timestamp
	550	=== 3.2 What's NOT Stored (Computed on-the-fly) ===
	551	* Verdicts: Synthesized from evidence + scenarios when requested
	552	* Risk scores: Recalculated based on current factors
	553	* Aggregated statistics: Computed from base data
	554	* Search results: Generated from Elasticsearch index
	555	=== 3.3 Cache Layer (Redis) ===
	556	Cached for performance:
	557	* Frequently accessed claims (TTL: 1 hour)
	558	* Search results (TTL: 15 minutes)
	559	* User sessions (TTL: 24 hours)
	560	* Source track records (TTL: 1 hour)
	561	=== 3.4 File Storage (S3) ===
	562	Archived content:
	563	* Old edit history (>3 months)
	564	* Evidence documents (archived copies)
	565	* Database backups
	566	* Export files
	567	=== 3.5 Search Index (Elasticsearch) ===
	568	Indexed for search:
	569	* Claim assertions (full-text)
	570	* Evidence excerpts (full-text)
	571	* Scenario descriptions (full-text)
	572	* Source names (autocomplete)
	573	Synchronized from PostgreSQL via change data capture or periodic sync.
	574	== 4. Related Pages ==
3.1	575	* [[Architecture>>Test.FactHarbor.Specification.Architecture.WebHome]]
	576	* [[Requirements>>Test.FactHarbor.Specification.Requirements.WebHome]]
	577	* [[Workflows>>Test.FactHarbor.Specification.Workflows.WebHome]]

Wiki source code of Data Model

Applications

Navigation

Need help?