Wiki source code of Data Model

Last modified by Robert Schaub on 2025/12/22 14:33

version	line-number	content
1.1	1	= Data Model =
1.2	2
1.1	3	FactHarbor's data model is simple, focused, designed for automated processing.
1.2	4
1.1	5	== 1. Core Entities ==
1.2	6
1.1	7	=== 1.1 Claim ===
1.2	8
1.1	9	Fields: id, assertion, domain, status (Published/Hidden only), confidence_score, risk_score, completeness_score, version, views, edit_count
1.2	10
1.1	11	==== Performance Optimization: Denormalized Fields ====
1.2	12
1.1	13	Rationale: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
	14	Additional cached fields in claims table:
1.2	15
1.1	16	* evidence_summary (JSONB): Top 5 most relevant evidence snippets with scores * Avoids joining evidence table for listing/preview * Updated when evidence is added/removed * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
	17	* source_names (TEXT[]): Array of source names for quick display * Avoids joining through evidence to sources * Updated when sources change * Format: `["New York Times", "Nature Journal", ...]`
	18	* scenario_count (INTEGER): Number of scenarios for this claim * Quick metric without counting rows * Updated when scenarios added/removed
	19	* cache_updated_at (TIMESTAMP): When denormalized data was last refreshed * Helps invalidate stale caches * Triggers background refresh if too old
	20	Update Strategy:
	21	* Immediate: Update on claim edit (user-facing)
	22	* Deferred: Update via background job every hour (non-critical)
	23	* Invalidation: Clear cache when source data changes significantly
	24	Trade-offs:
	25	* ✅ 70% fewer joins on common queries
	26	* ✅ Much faster claim list/search pages
	27	* ✅ Better user experience
1.2	28	* ⚠️ Small storage increase (10%)
1.1	29	* ⚠️ Need to keep caches in sync
1.2	30
1.1	31	=== 1.2 Evidence ===
1.2	32
1.1	33	Fields: claim_id, source_id, excerpt, url, relevance_score, supports
1.2	34
1.1	35	=== 1.3 Source ===
1.2	36
1.1	37	Purpose: Track reliability of information sources over time
	38	Fields:
1.2	39
1.1	40	* id (UUID): Unique identifier
	41	* name (text): Source name (e.g., "New York Times", "Nature Journal")
	42	* domain (text): Website domain (e.g., "nytimes.com")
	43	* type (enum): NewsOutlet, AcademicJournal, GovernmentAgency, etc.
	44	* track_record_score (0-100): Overall reliability score
	45	* accuracy_history (JSON): Historical accuracy data
	46	* correction_frequency (float): How often source publishes corrections
	47	* last_updated (timestamp): When track record last recalculated
	48	How It Works:
	49	* Initial score based on source type (70 for academic journals, 30 for unknown)
	50	* Updated daily by background scheduler
	51	* Formula: accuracy_rate (50%) + correction_policy (20%) + editorial_standards (15%) + bias_transparency (10%) + longevity (5%)
	52	* Track Record Check in AKEL pipeline: Adjusts evidence confidence based on source quality
	53	* Quality thresholds: 90+=Exceptional, 70-89=Reliable, 50-69=Acceptable, 30-49=Questionable, <30=Unreliable
	54	See: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
	55	Fields: id, name, domain, track_record_score, accuracy_history, correction_frequency
	56	Key: Automated source reliability tracking
1.2	57
1.1	58	==== Source Scoring Process (Separation of Concerns) ====
1.2	59
1.1	60	Critical design principle: Prevent circular dependencies between source scoring and claim analysis.
	61	The Problem: * Source scores should influence claim verdicts
1.2	62
1.1	63	* Claim verdicts should update source scores
	64	* But: Direct feedback creates circular dependency and potential feedback loops
	65	The Solution: Temporal separation
1.2	66
1.1	67	==== Weekly Background Job (Source Scoring) ====
1.2	68
1.1	69	Runs independently of claim analysis:
1.2	70	{{code language="python"}}def update_source_scores_weekly(): """ Background job: Calculate source reliability Never triggered by individual claim analysis """ # Analyze all claims from past week claims = get_claims_from_past_week() for source in get_all_sources(): # Calculate accuracy metrics correct_verdicts = count_correct_verdicts_citing(source, claims) total_citations = count_total_citations(source, claims) accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5 # Weight by claim importance weighted_score = calculate_weighted_score(source, claims) # Update source record source.track_record_score = weighted_score source.total_citations = total_citations source.last_updated = now() source.save() # Job runs: Sunday 2 AM UTC # Never during claim processing{{/code}}
	71
1.1	72	==== Real-Time Claim Analysis (AKEL) ====
1.2	73
1.1	74	Uses source scores but never updates them:
1.2	75	{{code language="python"}}def analyze_claim(claim_text): """ Real-time: Analyze claim using current source scores READ source scores, never UPDATE them """ # Gather evidence evidence_list = gather_evidence(claim_text) for evidence in evidence_list: # READ source score (snapshot from last weekly update) source = get_source(evidence.source_id) source_score = source.track_record_score # Use score to weight evidence evidence.weighted_relevance = evidence.relevance * source_score # Generate verdict using weighted evidence verdict = synthesize_verdict(evidence_list) # NEVER update source scores here # That happens in weekly background job return verdict{{/code}}
	76
1.1	77	==== Monthly Audit (Quality Assurance) ====
1.2	78
1.1	79	Moderator review of flagged source scores:
1.2	80
1.1	81	* Verify scores make sense
	82	* Detect gaming attempts
	83	* Identify systematic biases
	84	* Manual adjustments if needed
	85	Key Principles:
	86	✅ Scoring and analysis are temporally separated
	87	* Source scoring: Weekly batch job
	88	* Claim analysis: Real-time processing
	89	* Never update scores during analysis
	90	✅ One-way data flow during processing
	91	* Claims READ source scores
	92	* Claims NEVER WRITE source scores
	93	* Updates happen in background only
	94	✅ Predictable update cycle
	95	* Sources update every Sunday 2 AM
	96	* Claims always use last week's scores
	97	* No mid-week score changes
	98	✅ Audit trail
	99	* Log all score changes
	100	* Track score history
	101	* Explainable calculations
	102	Benefits:
	103	* No circular dependencies
	104	* Predictable behavior
	105	* Easier to reason about
	106	* Simpler testing
	107	* Clear audit trail
	108	Example Timeline:
	109	```
	110	Sunday 2 AM: Calculate source scores for past week → NYT score: 0.87 (up from 0.85) → Blog X score: 0.52 (down from 0.61)
	111	Monday-Saturday: Claims processed using these scores → All claims this week use NYT=0.87 → All claims this week use Blog X=0.52
	112	Next Sunday 2 AM: Recalculate scores including this week's claims → NYT score: 0.89 (trending up) → Blog X score: 0.48 (trending down)
	113	```
1.2	114
1.1	115	=== 1.4 Scenario ===
1.2	116
1.1	117	Purpose: Different interpretations or contexts for evaluating claims
	118	Key Concept: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
	119	Relationship: One-to-many with Claims (simplified for V1.0: scenario belongs to single claim)
	120	Fields:
1.2	121
1.1	122	* id (UUID): Unique identifier
	123	* claim_id (UUID): Foreign key to claim (one-to-many)
	124	* description (text): Human-readable description of the scenario
	125	* assumptions (JSONB): Key assumptions that define this scenario context
	126	* extracted_from (UUID): Reference to evidence that this scenario was extracted from
	127	* created_at (timestamp): When scenario was created
	128	* updated_at (timestamp): Last modification
	129	How Found: Evidence search → Extract context → Create scenario → Link to claim
	130	Example: For claim "Vaccines reduce hospitalization":
	131	* Scenario 1: "Clinical trials (healthy adults 18-65, original strain)" from trial paper
	132	* Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data
	133	* Scenario 3: "Immunocompromised patients" from specialist study
	134	V2.0 Evolution: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality. === 1.5 Verdict === Purpose: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence. Core Fields:
	135	* id (UUID): Primary key
	136	* scenario_id (UUID FK): The scenario being assessed
	137	* likelihood_range (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")
	138	* confidence (decimal 0-1): How confident we are in this assessment
	139	* explanation_summary (text): Human-readable reasoning explaining the verdict
	140	* uncertainty_factors (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")
	141	* created_at (timestamp): When verdict was created
	142	* updated_at (timestamp): Last modification Change Tracking: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states. Relationship: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity. Example:
	143	For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":
	144	* Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
	145	* After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
	146	* Edit entity records the complete before/after change with timestamp and reason Key Design: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios. === 1.6 User ===
	147	Fields: username, email, role (Reader/Contributor/Moderator), reputation, contributions_count
1.2	148
	149	=== User Reputation System ===
	150
1.1	151	V1.0 Approach: Simple manual role assignment
	152	Rationale: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
1.2	153
1.1	154	=== Roles (Manual Assignment) ===
1.2	155
1.1	156	reader (default):
1.2	157
1.1	158	* View published claims and evidence
	159	* Browse and search content
	160	* No editing permissions
	161	contributor:
	162	* Submit new claims
	163	* Suggest edits to existing content
	164	* Add evidence
	165	* Requires manual promotion by moderator/admin
	166	moderator:
	167	* Approve/reject contributor suggestions
	168	* Flag inappropriate content
	169	* Handle abuse reports
	170	* Assigned by admins based on trust
	171	admin:
	172	* Manage users and roles
	173	* System configuration
	174	* Access to all features
	175	* Founder-appointed initially
1.2	176
1.1	177	=== Contribution Tracking (Simple) ===
1.2	178
1.1	179	Basic metrics only:
1.2	180
1.1	181	* `contributions_count`: Total number of contributions
	182	* `created_at`: Account age
	183	* `last_active`: Recent activity
	184	No complex calculations:
	185	* No point systems
	186	* No automated privilege escalation
	187	* No reputation decay
	188	* No threshold-based promotions
1.2	189
1.1	190	=== Promotion Process ===
1.2	191
1.1	192	Manual review by moderators/admins:
1.2	193
1.1	194	1. User demonstrates value through contributions
	195	2. Moderator reviews user's contribution history
	196	3. Moderator promotes user to contributor role
	197	4. Admin promotes trusted contributors to moderator
	198	Criteria (guidelines, not automated):
1.2	199
1.1	200	* Quality of contributions
	201	* Consistency over time
	202	* Collaborative behavior
	203	* Understanding of project goals
1.2	204
1.1	205	=== V2.0+ Evolution ===
1.2	206
1.1	207	Add complex reputation when:
1.2	208
1.1	209	* 100+ active contributors
	210	* Manual role management becomes bottleneck
	211	* Clear patterns of abuse emerge requiring automation
	212	Future features may include:
	213	* Automated point calculations
	214	* Threshold-based promotions
	215	* Reputation decay for inactive users
	216	* Track record scoring for contributors
	217	See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
1.2	218
1.1	219	=== 1.7 Edit ===
1.2	220
1.1	221	Fields: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
	222	Purpose: Complete audit trail for all content changes
1.2	223
1.1	224	=== Edit History Details ===
1.2	225
1.1	226	What Gets Edited:
1.2	227
1.1	228	* Claims (20% edited): assertion, domain, status, scores, analysis
	229	* Evidence (10% edited): excerpt, relevance_score, supports
	230	* Scenarios (5% edited): description, assumptions, confidence
	231	* Sources: NOT versioned (continuous updates, not editorial decisions)
	232	Who Edits:
	233	* Contributors (rep sufficient): Corrections, additions
	234	* Trusted Contributors (rep sufficient): Major improvements, approvals
	235	* Moderators: Abuse handling, dispute resolution
	236	* System (AKEL): Re-analysis, automated improvements (user_id = NULL)
	237	Edit Types:
	238	* `CONTENT_CORRECTION`: User fixes factual error
	239	* `CLARIFICATION`: Improved wording
	240	* `SYSTEM_REANALYSIS`: AKEL re-processed claim
	241	* `MODERATION_ACTION`: Hide/unhide for abuse
	242	* `REVERT`: Rollback to previous version
	243	Retention Policy (5 years total):
1.2	244
1.1	245	1. Hot storage (3 months): PostgreSQL, instant access
	246	2. Warm storage (2 years): Partitioned, slower queries
	247	3. Cold storage (3 years): S3 compressed, download required
	248	4. Deletion: After 5 years (except legal holds)
1.2	249	Storage per 1M claims: 400 MB (20% edited × 2 KB per edit)
1.1	250	Use Cases:
1.2	251
1.1	252	* View claim history timeline
	253	* Detect vandalism patterns
	254	* Learn from user corrections (system improvement)
	255	* Legal compliance (audit trail)
	256	* Rollback capability
	257	See Edit History Documentation for complete details on what gets edited by whom, retention policy, and use cases
1.2	258
1.1	259	=== 1.8 Flag ===
1.2	260
1.1	261	Fields: entity_id, reported_by, issue_type, status, resolution_note
1.2	262
1.1	263	=== 1.9 QualityMetric ===
1.2	264
1.1	265	Fields: metric_type, category, value, target, timestamp
	266	Purpose: Time-series quality tracking
	267	Usage:
1.2	268
1.1	269	* Continuous monitoring: Hourly calculation of error rates, confidence scores, processing times
	270	* Quality dashboard: Real-time display with trend charts
	271	* Alerting: Automatic alerts when metrics exceed thresholds
	272	* A/B testing: Compare control vs treatment metrics
	273	* Improvement validation: Measure before/after changes
	274	Example: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
1.2	275
1.1	276	=== 1.10 ErrorPattern ===
1.2	277
1.1	278	Fields: error_category, claim_id, description, root_cause, frequency, status
	279	Purpose: Capture errors to trigger system improvements
	280	Usage:
1.2	281
1.1	282	* Error capture: When users flag issues or system detects problems
	283	* Pattern analysis: Weekly grouping by category and frequency
	284	* Improvement workflow: Analyze → Fix → Test → Deploy → Re-process → Monitor
	285	* Metrics: Track error rate reduction over time
1.4	286	Example: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}` == 1.4 Core Data Model ERD == {{include reference="Test.FactHarbor pre12 V0\.9\.70.Specification.Diagrams.Core Data Model ERD.WebHome"/}} == 1.5 User Class Diagram ==
1.5	287	{{include reference="Test.FactHarbor pre12 V0\.9\.70.Specification.Diagrams.User Class Diagram.WebHome"/}}
1.2	288
1.1	289	== 2. Versioning Strategy ==
1.2	290
1.1	291	All Content Entities Are Versioned:
1.2	292
1.1	293	* Claim: Every edit creates new version (V1→V2→V3...)
	294	* Evidence: Changes tracked in edit history
	295	* Scenario: Modifications versioned
	296	How Versioning Works:
	297	* Entity table stores current state only
	298	* Edit table stores all historical states (before_state, after_state as JSON)
	299	* Version number increments with each edit
	300	* Complete audit trail maintained forever
	301	Unversioned Entities (current state only, no history):
	302	* Source: Track record continuously updated (not versioned history, just current score)
	303	* User: Account state (reputation accumulated, not versioned)
	304	* QualityMetric: Time-series data (each record is a point in time, not a version)
	305	* ErrorPattern: System improvement queue (status tracked, not versioned)
	306	Example:
	307	```
	308	Claim V1: "The sky is blue" → User edits → Claim V2: "The sky is blue during daytime" → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
	309	```
1.2	310
1.1	311	== 2.5. Storage vs Computation Strategy ==
1.2	312
1.1	313	Critical architectural decision: What to persist in databases vs compute dynamically?
	314	Trade-off:
1.2	315
1.1	316	* Store more: Better reproducibility, faster, lower LLM costs \| Higher storage/maintenance costs
	317	* Compute more: Lower storage/maintenance costs \| Slower, higher LLM costs, less reproducible
1.2	318
1.1	319	=== Recommendation: Hybrid Approach ===
1.2	320
1.1	321	STORE (in PostgreSQL):
1.2	322
1.1	323	==== Claims (Current State + History) ====
1.2	324
1.1	325	* What: assertion, domain, status, created_at, updated_at, version
	326	* Why: Core entity, must be persistent
	327	* Also store: confidence_score (computed once, then cached)
1.2	328	* Size: 1 KB per claim
1.1	329	* Growth: Linear with claims
	330	* Decision: ✅ STORE - Essential
1.2	331
1.1	332	==== Evidence (All Records) ====
1.2	333
1.1	334	* What: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
	335	* Why: Hard to re-gather, user contributions, reproducibility
1.2	336	* Size: 2 KB per evidence (with excerpt)
1.1	337	* Growth: 3-10 evidence per claim
	338	* Decision: ✅ STORE - Essential for reproducibility
1.2	339
1.1	340	==== Sources (Track Records) ====
1.2	341
1.1	342	* What: name, domain, track_record_score, accuracy_history, correction_frequency
	343	* Why: Continuously updated, expensive to recompute
1.2	344	* Size: 500 bytes per source
1.1	345	* Growth: Slow (limited number of sources)
	346	* Decision: ✅ STORE - Essential for quality
1.2	347
1.1	348	==== Edit History (All Versions) ====
1.2	349
1.1	350	* What: before_state, after_state, user_id, reason, timestamp
	351	* Why: Audit trail, legal requirement, reproducibility
1.2	352	* Size: 2 KB per edit
	353	* Growth: Linear with edits (A portion of claims get edited)
1.1	354	* Retention: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
	355	* Decision: ✅ STORE - Essential for accountability
1.2	356
1.1	357	==== Flags (User Reports) ====
1.2	358
1.1	359	* What: entity_id, reported_by, issue_type, description, status
	360	* Why: Error detection, system improvement triggers
1.2	361	* Size: 500 bytes per flag
1.1	362	* Growth: 5-high percentage of claims get flagged
	363	* Decision: ✅ STORE - Essential for improvement
1.2	364
1.1	365	==== ErrorPatterns (System Improvement) ====
1.2	366
1.1	367	* What: error_category, claim_id, description, root_cause, frequency, status
	368	* Why: Learning loop, prevent recurring errors
1.2	369	* Size: 1 KB per pattern
1.1	370	* Growth: Slow (limited patterns, many fixed)
	371	* Decision: ✅ STORE - Essential for learning
1.2	372
1.1	373	==== QualityMetrics (Time Series) ====
1.2	374
1.1	375	* What: metric_type, category, value, target, timestamp
	376	* Why: Trend analysis, cannot recreate historical metrics
1.2	377	* Size: 200 bytes per metric
1.1	378	* Growth: Hourly = 8,760 per year per metric type
	379	* Retention: 2 years hot, then aggregate and archive
	380	* Decision: ✅ STORE - Essential for monitoring
	381	STORE (Computed Once, Then Cached):
1.2	382
1.1	383	==== Analysis Summary ====
1.2	384
1.1	385	* What: Neutral text summary of claim analysis (200-500 words)
	386	* Computed: Once by AKEL when claim first analyzed
	387	* Stored in: Claim table (text field)
	388	* Recomputed: Only when system significantly improves OR claim edited
	389	* Why store: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
1.2	390	* Size: 2 KB per claim
1.1	391	* Decision: ✅ STORE (cached) - Cost-effective
1.2	392
1.1	393	==== Confidence Score ====
1.2	394
1.1	395	* What: 0-100 score of analysis confidence
	396	* Computed: Once by AKEL
	397	* Stored in: Claim table (integer field)
	398	* Recomputed: When evidence added, source track record changes significantly, or system improves
	399	* Why store: Cheap to store, expensive to compute, users need it fast
	400	* Size: 4 bytes per claim
	401	* Decision: ✅ STORE (cached) - Performance critical
1.2	402
1.1	403	==== Risk Score ====
1.2	404
1.1	405	* What: 0-100 score of claim risk level
	406	* Computed: Once by AKEL
	407	* Stored in: Claim table (integer field)
	408	* Recomputed: When domain changes, evidence changes, or controversy detected
	409	* Why store: Same as confidence score
	410	* Size: 4 bytes per claim
	411	* Decision: ✅ STORE (cached) - Performance critical
	412	COMPUTE DYNAMICALLY (Never Store):
1.2	413
	414	==== Scenarios ====
	415
	416	⚠️ CRITICAL DECISION
	417
1.1	418	* What: 2-5 possible interpretations of claim with assumptions
	419	* Current design: Stored in Scenario table
	420	* Alternative: Compute on-demand when user views claim details
1.2	421	* Storage cost: 1 KB per scenario × 3 scenarios average = 3 KB per claim
1.1	422	* Compute cost: $0.005-0.01 per request (LLM API call)
1.2	423	* Frequency: Viewed in detail by 20% of users
1.1	424	* Trade-off analysis: - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
	425	* Reproducibility: Scenarios may improve as AI improves (good to recompute)
	426	* Speed: Computed = 5-8 seconds delay, Stored = instant
	427	* Decision: ✅ STORE (hybrid approach below)
	428	Scenario Strategy (APPROVED):
1.2	429
1.1	430	1. Store scenarios initially when claim analyzed
	431	2. Mark as stale when system improves significantly
	432	3. Recompute on next view if marked stale
	433	4. Cache for 30 days if frequently accessed
	434	5. Result: Best of both worlds - speed + freshness
1.2	435
	436	==== Verdict Synthesis ====
	437
	438	~* What: Final conclusion text synthesizing all scenarios
	439
1.1	440	* Compute cost: $0.002-0.005 per request
	441	* Frequency: Every time claim viewed
	442	* Why not store: Changes as evidence/scenarios change, users want fresh analysis
	443	* Speed: 2-3 seconds (acceptable)
	444	Alternative: Store "last verdict" as cached field, recompute only if claim edited or marked stale
	445	* Recommendation: ✅ STORE cached version, mark stale when changes occur
1.2	446
1.1	447	==== Search Results ====
1.2	448
1.1	449	* What: Lists of claims matching search query
	450	* Compute from: Elasticsearch index
	451	* Cache: 15 minutes in Redis for popular queries
	452	* Why not store permanently: Constantly changing, infinite possible queries
1.2	453
1.1	454	==== Aggregated Statistics ====
1.2	455
1.1	456	* What: "Total claims: 1,234,567", "Average confidence: 78%", etc.
	457	* Compute from: Database queries
	458	* Cache: 1 hour in Redis
	459	* Why not store: Can be derived, relatively cheap to compute
1.2	460
1.1	461	==== User Reputation ====
1.2	462
1.1	463	* What: Score based on contributions
	464	* Current design: Stored in User table
	465	* Alternative: Compute from Edit table
	466	* Trade-off: - Stored: Fast, simple - Computed: Always accurate, no denormalization
	467	* Frequency: Read on every user action
	468	* Compute cost: Simple COUNT query, milliseconds
	469	* Decision: ✅ STORE - Performance critical, read-heavy
1.2	470
1.1	471	=== Summary Table ===
1.2	472
	473	\| Data Type \| Storage \| Compute \| Size per Claim \| Decision \| Rationale \|\\
1.4	474	\|-\|-\|-\|\|\|-\|\\
1.2	475	\| Claim core \| ✅ \| - \| 1 KB \| STORE \| Essential \|\\
	476	\| Evidence \| ✅ \| - \| 2 KB × 5 = 10 KB \| STORE \| Reproducibility \|\\
	477	\| Sources \| ✅ \| - \| 500 B (shared) \| STORE \| Track record \|\\
	478	\| Edit history \| ✅ \| - \| 2 KB × 20% = 400 B avg \| STORE \| Audit \|\\
	479	\| Analysis summary \| ✅ \| Once \| 2 KB \| STORE (cached) \| Cost-effective \|\\
	480	\| Confidence score \| ✅ \| Once \| 4 B \| STORE (cached) \| Fast access \|\\
	481	\| Risk score \| ✅ \| Once \| 4 B \| STORE (cached) \| Fast access \|\\
	482	\| Scenarios \| ✅ \| When stale \| 3 KB \| STORE (hybrid) \| Balance cost/speed \|\\
	483	\| Verdict \| ✅ \| When stale \| 1 KB \| STORE (cached) \| Fast access \|\\
	484	\| Flags \| ✅ \| - \| 500 B × 10% = 50 B avg \| STORE \| Improvement \|\\
	485	\| ErrorPatterns \| ✅ \| - \| 1 KB (global) \| STORE \| Learning \|\\
	486	\| QualityMetrics \| ✅ \| - \| 200 B (time series) \| STORE \| Trending \|\\
	487	\| Search results \| - \| ✅ \| - \| COMPUTE + 15min cache \| Dynamic \|\\
1.1	488	\| Aggregations \| - \| ✅ \| - \| COMPUTE + 1hr cache \| Derivable \|
1.2	489	Total storage per claim: 18 KB (without edits and flags)
1.1	490	For 1 million claims:
1.2	491
	492	* Storage: 18 GB (manageable)
	493	* PostgreSQL: $50/month (standard instance)
	494	* Redis cache: $20/month (1 GB instance)
	495	* S3 archives: $5/month (old edits)
	496	* Total: $75/month infrastructure
1.1	497	LLM cost savings by caching:
	498	* Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
	499	* Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims * Verdict stored: Save $0.003 per claim = $3K per 1M claims
1.2	500	* Total savings: $35K per 1M claims vs recomputing every time
	501
1.1	502	=== Recomputation Triggers ===
1.2	503
1.1	504	When to mark cached data as stale and recompute:
1.2	505
1.1	506	1. User edits claim → Recompute: all (analysis, scenarios, verdict, scores)
	507	2. Evidence added → Recompute: scenarios, verdict, confidence score
	508	3. Source track record changes >10 points → Recompute: confidence score, verdict
	509	4. System improvement deployed → Mark affected claims stale, recompute on next view
	510	5. Controversy detected (high flag rate) → Recompute: risk score
	511	Recomputation strategy:
1.2	512
1.1	513	* Eager: Immediately recompute (for user edits)
	514	* Lazy: Recompute on next view (for system improvements)
	515	* Batch: Nightly re-evaluation of stale claims (if <1000)
1.2	516
1.1	517	=== Database Size Projection ===
1.2	518
1.1	519	Year 1: 10K claims
1.2	520
1.1	521	* Storage: 180 MB
	522	* Cost: $10/month
	523	Year 3: 100K claims * Storage: 1.8 GB
	524	* Cost: $30/month
	525	Year 5: 1M claims
	526	* Storage: 18 GB * Cost: $75/month
	527	Year 10: 10M claims
	528	* Storage: 180 GB
	529	* Cost: $300/month
	530	* Optimization: Archive old claims to S3 ($5/TB/month)
	531	Conclusion: Storage costs are manageable, LLM cost savings are substantial.
1.2	532
1.1	533	== 3. Key Simplifications ==
1.2	534
1.1	535	* Two content states only: Published, Hidden
	536	* Three user roles only: Reader, Contributor, Moderator
	537	* No complex versioning: Linear edit history
	538	* Reputation-based permissions: Not role hierarchy
	539	* Source track records: Continuous evaluation
1.2	540
1.1	541	== 3. What Gets Stored in the Database ==
1.2	542
1.1	543	=== 3.1 Primary Storage (PostgreSQL) ===
1.2	544
1.1	545	Claims Table:
1.2	546
1.1	547	* Current state only (latest version)
	548	* Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
	549	Evidence Table:
	550	* All evidence records
	551	* Fields: id, claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at, archived
	552	Scenario Table:
	553	* All scenarios for each claim
	554	* Fields: id, claim_id, description, assumptions (text array), confidence, created_by, created_at
	555	Source Table:
	556	* Track record database (continuously updated)
	557	* Fields: id, name, domain, type, track_record_score, accuracy_history (JSON), correction_frequency, last_updated, claim_count, corrections_count
	558	User Table:
	559	* All user accounts
	560	* Fields: id, username, email, role (Reader/Contributor/Moderator), reputation, created_at, last_active, contributions_count, flags_submitted, flags_accepted
	561	Edit Table:
	562	* Complete version history
	563	* Fields: id, entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
	564	Flag Table:
	565	* User-reported issues
	566	* Fields: id, entity_type, entity_id, reported_by, issue_type, description, status, resolved_by, resolution_note, created_at, resolved_at
	567	ErrorPattern Table:
	568	* System improvement queue
	569	* Fields: id, error_category, claim_id, description, root_cause, frequency, status, created_at, fixed_at
	570	QualityMetric Table:
	571	* Time-series quality data
	572	* Fields: id, metric_type, metric_category, value, target, timestamp
1.2	573
1.1	574	=== 3.2 What's NOT Stored (Computed on-the-fly) ===
1.2	575
1.1	576	* Verdicts: Synthesized from evidence + scenarios when requested
	577	* Risk scores: Recalculated based on current factors
	578	* Aggregated statistics: Computed from base data
	579	* Search results: Generated from Elasticsearch index
1.2	580
1.1	581	=== 3.3 Cache Layer (Redis) ===
1.2	582
1.1	583	Cached for performance:
1.2	584
1.1	585	* Frequently accessed claims (TTL: 1 hour)
	586	* Search results (TTL: 15 minutes)
	587	* User sessions (TTL: 24 hours)
	588	* Source track records (TTL: 1 hour)
1.2	589
1.1	590	=== 3.4 File Storage (S3) ===
1.2	591
1.1	592	Archived content:
1.2	593
1.1	594	* Old edit history (>3 months)
	595	* Evidence documents (archived copies)
	596	* Database backups
	597	* Export files
1.2	598
1.1	599	=== 3.5 Search Index (Elasticsearch) ===
1.2	600
1.1	601	Indexed for search:
1.2	602
1.1	603	* Claim assertions (full-text)
	604	* Evidence excerpts (full-text)
	605	* Scenario descriptions (full-text)
	606	* Source names (autocomplete)
	607	Synchronized from PostgreSQL via change data capture or periodic sync.
1.2	608
1.1	609	== 4. Related Pages ==
1.2	610
	611	* [[Architecture>>Test.FactHarbor pre12 V0\.9\.70.Specification.Architecture.WebHome]]
1.6	612	* [[Requirements>>Test.FactHarbor pre12 V0\.9\.70.Specification.Requirements.WebHome]]
1.7	613	* [[Workflows>>Test.FactHarbor pre12 V0\.9\.70.Specification.Workflows.WebHome]]

Wiki source code of Data Model

Applications

Navigation

Need help?