Wiki source code of Data Model

Last modified by Robert Schaub on 2026/02/08 08:32

Hide last authors
Robert Schaub 1.1 1 = Data Model =
2
3 {{warning}}
4 **Implementation Status (Updated January 2026)**
5
6 This specification describes the **target** normalized data model. Current implementation (v2.6.33) differs significantly:
7
8 * **Storage**: All data stored as **JSON blobs in SQLite**, not normalized PostgreSQL tables
9 * **Scenarios**: **Replaced with KeyFactors** - decomposition questions, not separate entities
10 * **Caching**: Redis cache **not implemented**; no claim caching yet
11 * **Source Scoring**: Uses external MBFC bundle, not internal track record calculation
12 * **User System**: Not implemented (no authentication in current version)
13
14 This specification remains valuable as the target architecture for future versions.
15
16 See `Docs/STATUS/Documentation_Inconsistencies.md` for full comparison.
17 {{/warning}}
18
19 FactHarbor's data model is **simple, focused, designed for automated processing**.
Robert Schaub 1.2 20
Robert Schaub 1.1 21 == 1. Core Entities ==
Robert Schaub 1.2 22
Robert Schaub 1.1 23 === 1.1 Claim ===
Robert Schaub 1.2 24
Robert Schaub 1.1 25 Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count
Robert Schaub 1.2 26
Robert Schaub 1.1 27 ==== Performance Optimization: Denormalized Fields ====
Robert Schaub 1.2 28
Robert Schaub 1.1 29 **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
30 **Additional cached fields in claims table**:
Robert Schaub 1.2 31
Robert Schaub 1.1 32 * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores
Robert Schaub 1.2 33 * Avoids joining evidence table for listing/preview
34 * Updated when evidence is added/removed
35 * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
Robert Schaub 1.1 36 * **source_names** (TEXT[]): Array of source names for quick display
Robert Schaub 1.2 37 * Avoids joining through evidence to sources
38 * Updated when sources change
39 * Format: `["New York Times", "Nature Journal", ...]`
Robert Schaub 1.1 40 * **scenario_count** (INTEGER): Number of scenarios for this claim
Robert Schaub 1.2 41 * Quick metric without counting rows
42 * Updated when scenarios added/removed
Robert Schaub 1.1 43 * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed
Robert Schaub 1.2 44 * Helps invalidate stale caches
45 * Triggers background refresh if too old
Robert Schaub 1.1 46 **Update Strategy**:
47 * **Immediate**: Update on claim edit (user-facing)
48 * **Deferred**: Update via background job every hour (non-critical)
49 * **Invalidation**: Clear cache when source data changes significantly
50 **Trade-offs**:
51 * ✅ 70% fewer joins on common queries
52 * ✅ Much faster claim list/search pages
53 * ✅ Better user experience
Robert Schaub 1.2 54 * ⚠️ Small storage increase (10%)
Robert Schaub 1.1 55 * ⚠️ Need to keep caches in sync
Robert Schaub 1.2 56
Robert Schaub 1.1 57 === 1.2 Evidence ===
Robert Schaub 1.2 58
Robert Schaub 1.1 59 Fields: claim_id, source_id, excerpt, url, relevance_score, supports
Robert Schaub 1.2 60
Robert Schaub 1.1 61 === 1.3 Source ===
Robert Schaub 1.2 62
Robert Schaub 1.1 63 **Purpose**: Track reliability of information sources over time
64 **Fields**:
Robert Schaub 1.2 65
Robert Schaub 1.1 66 * **id** (UUID): Unique identifier
67 * **name** (text): Source name (e.g., "New York Times", "Nature Journal")
68 * **domain** (text): Website domain (e.g., "nytimes.com")
69 * **type** (enum): NewsOutlet, AcademicJournal, GovernmentAgency, etc.
70 * **track_record_score** (0-100): Overall reliability score
71 * **accuracy_history** (JSON): Historical accuracy data
72 * **correction_frequency** (float): How often source publishes corrections
73 * **last_updated** (timestamp): When track record last recalculated
74 **How It Works**:
75 * Initial score based on source type (70 for academic journals, 30 for unknown)
76 * Updated daily by background scheduler
77 * Formula: accuracy_rate (50%) + correction_policy (20%) + editorial_standards (15%) + bias_transparency (10%) + longevity (5%)
78 * Track Record Check in AKEL pipeline: Adjusts evidence confidence based on source quality
79 * Quality thresholds: 90+=Exceptional, 70-89=Reliable, 50-69=Acceptable, 30-49=Questionable, <30=Unreliable
80
81 {{info}}
82 **Current Implementation (v2.6.33):** Source reliability uses external **MBFC (Media Bias/Fact Check) bundle** instead of internal track record calculation. Scores are loaded from a configurable JSON file. See `Docs/ARCHITECTURE/Source_Reliability.md`.
83 {{/info}}
84
85 **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
86 Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**
87 **Key**: Automated source reliability tracking
Robert Schaub 1.2 88
Robert Schaub 1.1 89 ==== Source Scoring Process (Separation of Concerns) ====
Robert Schaub 1.2 90
Robert Schaub 1.1 91 **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.
92 **The Problem**:
Robert Schaub 1.2 93
Robert Schaub 1.1 94 * Source scores should influence claim verdicts
95 * Claim verdicts should update source scores
96 * But: Direct feedback creates circular dependency and potential feedback loops
97 **The Solution**: Temporal separation
Robert Schaub 1.2 98
Robert Schaub 1.1 99 ==== Weekly Background Job (Source Scoring) ====
Robert Schaub 1.2 100
Robert Schaub 1.1 101 Runs independently of claim analysis:
Robert Schaub 1.2 102 {{code language="python"}}def update_source_scores_weekly():
Robert Schaub 1.1 103 """
104 Background job: Calculate source reliability
105 Never triggered by individual claim analysis
106 """
107 # Analyze all claims from past week
108 claims = get_claims_from_past_week()
109 for source in get_all_sources():
110 # Calculate accuracy metrics
111 correct_verdicts = count_correct_verdicts_citing(source, claims)
112 total_citations = count_total_citations(source, claims)
113 accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5
114 # Weight by claim importance
115 weighted_score = calculate_weighted_score(source, claims)
116 # Update source record
117 source.track_record_score = weighted_score
118 source.total_citations = total_citations
119 source.last_updated = now()
120 source.save()
121 # Job runs: Sunday 2 AM UTC
Robert Schaub 1.2 122 # Never during claim processing{{/code}}
123
Robert Schaub 1.1 124 ==== Real-Time Claim Analysis (AKEL) ====
Robert Schaub 1.2 125
Robert Schaub 1.1 126 Uses source scores but never updates them:
Robert Schaub 1.2 127 {{code language="python"}}def analyze_claim(claim_text):
Robert Schaub 1.1 128 """
129 Real-time: Analyze claim using current source scores
130 READ source scores, never UPDATE them
131 """
132 # Gather evidence
133 evidence_list = gather_evidence(claim_text)
134 for evidence in evidence_list:
135 # READ source score (snapshot from last weekly update)
136 source = get_source(evidence.source_id)
137 source_score = source.track_record_score
138 # Use score to weight evidence
139 evidence.weighted_relevance = evidence.relevance * source_score
140 # Generate verdict using weighted evidence
141 verdict = synthesize_verdict(evidence_list)
142 # NEVER update source scores here
143 # That happens in weekly background job
Robert Schaub 1.2 144 return verdict{{/code}}
145
Robert Schaub 1.1 146 ==== Monthly Audit (Quality Assurance) ====
Robert Schaub 1.2 147
Robert Schaub 1.1 148 Moderator review of flagged source scores:
Robert Schaub 1.2 149
Robert Schaub 1.1 150 * Verify scores make sense
151 * Detect gaming attempts
152 * Identify systematic biases
153 * Manual adjustments if needed
154 **Key Principles**:
155 ✅ **Scoring and analysis are temporally separated**
156 * Source scoring: Weekly batch job
157 * Claim analysis: Real-time processing
158 * Never update scores during analysis
159 ✅ **One-way data flow during processing**
160 * Claims READ source scores
161 * Claims NEVER WRITE source scores
162 * Updates happen in background only
163 ✅ **Predictable update cycle**
164 * Sources update every Sunday 2 AM
165 * Claims always use last week's scores
166 * No mid-week score changes
167 ✅ **Audit trail**
168 * Log all score changes
169 * Track score history
170 * Explainable calculations
171 **Benefits**:
172 * No circular dependencies
173 * Predictable behavior
174 * Easier to reason about
175 * Simpler testing
176 * Clear audit trail
177 **Example Timeline**:
178 ```
179 Sunday 2 AM: Calculate source scores for past week
180 → NYT score: 0.87 (up from 0.85)
181 → Blog X score: 0.52 (down from 0.61)
182 Monday-Saturday: Claims processed using these scores
183 → All claims this week use NYT=0.87
184 → All claims this week use Blog X=0.52
185 Next Sunday 2 AM: Recalculate scores including this week's claims
186 → NYT score: 0.89 (trending up)
187 → Blog X score: 0.48 (trending down)
188 ```
Robert Schaub 1.2 189
Robert Schaub 1.1 190 === 1.4 Scenario ===
191
192 {{warning}}
193 **Implementation Change:** Scenarios were **replaced with KeyFactors** in the current implementation. KeyFactors are decomposition questions discovered during the understanding phase, not separate stored entities. See `Docs/ARCHITECTURE/KeyFactors_Design.md` for the design rationale.
194 {{/warning}}
195
196 **Purpose**: Different interpretations or contexts for evaluating claims
197 **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
198 **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)
199 **Fields**:
Robert Schaub 1.2 200
Robert Schaub 1.1 201 * **id** (UUID): Unique identifier
202 * **claim_id** (UUID): Foreign key to claim (one-to-many)
203 * **description** (text): Human-readable description of the scenario
204 * **assumptions** (JSONB): Key assumptions that define this scenario context
205 * **extracted_from** (UUID): Reference to evidence that this scenario was extracted from
206 * **created_at** (timestamp): When scenario was created
207 * **updated_at** (timestamp): Last modification
208 **How Found**: Evidence search → Extract context → Create scenario → Link to claim
209 **Example**:
210 For claim "Vaccines reduce hospitalization":
211 * Scenario 1: "Clinical trials (healthy adults 18-65, original strain)" from trial paper
212 * Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data
213 * Scenario 3: "Immunocompromised patients" from specialist study
214 **V2.0 Evolution**: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality.
215
216 === 1.5 Verdict ===
217
218 **Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.
219
220 **Core Fields**:
Robert Schaub 1.2 221
Robert Schaub 1.1 222 * **id** (UUID): Primary key
223 * **scenario_id** (UUID FK): The scenario being assessed
224 * **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")
225 * **confidence** (decimal 0-1): How confident we are in this assessment
226 * **explanation_summary** (text): Human-readable reasoning explaining the verdict
227 * **uncertainty_factors** (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")
228 * **created_at** (timestamp): When verdict was created
229 * **updated_at** (timestamp): Last modification
230
231 **Change Tracking**: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states.
232
233 **Relationship**: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity.
234
235 **Example**:
236 For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":
Robert Schaub 1.2 237
Robert Schaub 1.1 238 * Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
239 * After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
240 * Edit entity records the complete before/after change with timestamp and reason
241
242 **Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios.
243
244 === 1.6 User ===
Robert Schaub 1.2 245
Robert Schaub 1.1 246 Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count
Robert Schaub 1.2 247
248 === User Reputation System ===
249
Robert Schaub 1.1 250 **V1.0 Approach**: Simple manual role assignment
251 **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
Robert Schaub 1.2 252
Robert Schaub 1.1 253 === Roles (Manual Assignment) ===
Robert Schaub 1.2 254
Robert Schaub 1.1 255 **reader** (default):
Robert Schaub 1.2 256
Robert Schaub 1.1 257 * View published claims and evidence
258 * Browse and search content
259 * No editing permissions
260 **contributor**:
261 * Submit new claims
262 * Suggest edits to existing content
263 * Add evidence
264 * Requires manual promotion by moderator/admin
265 **moderator**:
266 * Approve/reject contributor suggestions
267 * Flag inappropriate content
268 * Handle abuse reports
269 * Assigned by admins based on trust
270 **admin**:
271 * Manage users and roles
272 * System configuration
273 * Access to all features
274 * Founder-appointed initially
Robert Schaub 1.2 275
Robert Schaub 1.1 276 === Contribution Tracking (Simple) ===
Robert Schaub 1.2 277
Robert Schaub 1.1 278 **Basic metrics only**:
Robert Schaub 1.2 279
Robert Schaub 1.1 280 * `contributions_count`: Total number of contributions
281 * `created_at`: Account age
282 * `last_active`: Recent activity
283 **No complex calculations**:
284 * No point systems
285 * No automated privilege escalation
286 * No reputation decay
287 * No threshold-based promotions
Robert Schaub 1.2 288
Robert Schaub 1.1 289 === Promotion Process ===
Robert Schaub 1.2 290
Robert Schaub 1.1 291 **Manual review by moderators/admins**:
Robert Schaub 1.2 292
Robert Schaub 1.1 293 1. User demonstrates value through contributions
294 2. Moderator reviews user's contribution history
295 3. Moderator promotes user to contributor role
296 4. Admin promotes trusted contributors to moderator
297 **Criteria** (guidelines, not automated):
Robert Schaub 1.2 298
Robert Schaub 1.1 299 * Quality of contributions
300 * Consistency over time
301 * Collaborative behavior
302 * Understanding of project goals
Robert Schaub 1.2 303
Robert Schaub 1.1 304 === V2.0+ Evolution ===
Robert Schaub 1.2 305
Robert Schaub 1.1 306 **Add complex reputation when**:
Robert Schaub 1.2 307
Robert Schaub 1.1 308 * 100+ active contributors
309 * Manual role management becomes bottleneck
310 * Clear patterns of abuse emerge requiring automation
311 **Future features may include**:
312 * Automated point calculations
313 * Threshold-based promotions
314 * Reputation decay for inactive users
315 * Track record scoring for contributors
316 See [[When to Add Complexity>>FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
Robert Schaub 1.2 317
Robert Schaub 1.1 318 === 1.7 Edit ===
Robert Schaub 1.2 319
Robert Schaub 1.1 320 **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
321 **Purpose**: Complete audit trail for all content changes
Robert Schaub 1.2 322
Robert Schaub 1.1 323 === Edit History Details ===
Robert Schaub 1.2 324
Robert Schaub 1.1 325 **What Gets Edited**:
Robert Schaub 1.2 326
Robert Schaub 1.1 327 * **Claims** (20% edited): assertion, domain, status, scores, analysis
328 * **Evidence** (10% edited): excerpt, relevance_score, supports
329 * **Scenarios** (5% edited): description, assumptions, confidence
330 * **Sources**: NOT versioned (continuous updates, not editorial decisions)
331 **Who Edits**:
332 * **Contributors** (rep sufficient): Corrections, additions
333 * **Trusted Contributors** (rep sufficient): Major improvements, approvals
334 * **Moderators**: Abuse handling, dispute resolution
335 * **System (AKEL)**: Re-analysis, automated improvements (user_id = NULL)
336 **Edit Types**:
337 * `CONTENT_CORRECTION`: User fixes factual error
338 * `CLARIFICATION`: Improved wording
339 * `SYSTEM_REANALYSIS`: AKEL re-processed claim
340 * `MODERATION_ACTION`: Hide/unhide for abuse
341 * `REVERT`: Rollback to previous version
342 **Retention Policy** (5 years total):
Robert Schaub 1.2 343
Robert Schaub 1.1 344 1. **Hot storage** (3 months): PostgreSQL, instant access
345 2. **Warm storage** (2 years): Partitioned, slower queries
346 3. **Cold storage** (3 years): S3 compressed, download required
347 4. **Deletion**: After 5 years (except legal holds)
Robert Schaub 1.2 348 **Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit)
Robert Schaub 1.1 349 **Use Cases**:
Robert Schaub 1.2 350
Robert Schaub 1.1 351 * View claim history timeline
352 * Detect vandalism patterns
353 * Learn from user corrections (system improvement)
354 * Legal compliance (audit trail)
355 * Rollback capability
356 See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases
Robert Schaub 1.2 357
Robert Schaub 1.1 358 === 1.8 Flag ===
Robert Schaub 1.2 359
Robert Schaub 1.1 360 Fields: entity_id, reported_by, issue_type, status, resolution_note
Robert Schaub 1.2 361
Robert Schaub 1.1 362 === 1.9 QualityMetric ===
Robert Schaub 1.2 363
Robert Schaub 1.1 364 **Fields**: metric_type, category, value, target, timestamp
365 **Purpose**: Time-series quality tracking
366 **Usage**:
Robert Schaub 1.2 367
Robert Schaub 1.1 368 * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times
369 * **Quality dashboard**: Real-time display with trend charts
370 * **Alerting**: Automatic alerts when metrics exceed thresholds
371 * **A/B testing**: Compare control vs treatment metrics
372 * **Improvement validation**: Measure before/after changes
373 **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
Robert Schaub 1.2 374
Robert Schaub 1.1 375 === 1.10 ErrorPattern ===
Robert Schaub 1.2 376
Robert Schaub 1.1 377 **Fields**: error_category, claim_id, description, root_cause, frequency, status
378 **Purpose**: Capture errors to trigger system improvements
379 **Usage**:
Robert Schaub 1.2 380
Robert Schaub 1.1 381 * **Error capture**: When users flag issues or system detects problems
382 * **Pattern analysis**: Weekly grouping by category and frequency
383 * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor
384 * **Metrics**: Track error rate reduction over time
385 **Example**: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}`
386
387 == 1.4 Core Data Model ERD ==
388
Robert Schaub 1.4 389 {{include reference="Archive.FactHarbor 2026\.02\.08.Specification.Diagrams.Core Data Model ERD.WebHome"/}}
Robert Schaub 1.1 390
391 == 1.5 User Class Diagram ==
Robert Schaub 1.2 392
Robert Schaub 1.5 393 {{include reference="Archive.FactHarbor 2026\.02\.08.Specification.Diagrams.User Class Diagram.WebHome"/}}
Robert Schaub 1.2 394
Robert Schaub 1.1 395 == 2. Versioning Strategy ==
Robert Schaub 1.2 396
Robert Schaub 1.1 397 **All Content Entities Are Versioned**:
Robert Schaub 1.2 398
Robert Schaub 1.1 399 * **Claim**: Every edit creates new version (V1→V2→V3...)
400 * **Evidence**: Changes tracked in edit history
401 * **Scenario**: Modifications versioned
402 **How Versioning Works**:
403 * Entity table stores **current state only**
404 * Edit table stores **all historical states** (before_state, after_state as JSON)
405 * Version number increments with each edit
406 * Complete audit trail maintained forever
407 **Unversioned Entities** (current state only, no history):
408 * **Source**: Track record continuously updated (not versioned history, just current score)
409 * **User**: Account state (reputation accumulated, not versioned)
410 * **QualityMetric**: Time-series data (each record is a point in time, not a version)
411 * **ErrorPattern**: System improvement queue (status tracked, not versioned)
412 **Example**:
413 ```
414 Claim V1: "The sky is blue"
415 → User edits →
416 Claim V2: "The sky is blue during daytime"
417 → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
418 ```
Robert Schaub 1.2 419
Robert Schaub 1.1 420 == 2.5. Storage vs Computation Strategy ==
Robert Schaub 1.2 421
Robert Schaub 1.1 422 **Critical architectural decision**: What to persist in databases vs compute dynamically?
423 **Trade-off**:
Robert Schaub 1.2 424
Robert Schaub 1.1 425 * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs
426 * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible
Robert Schaub 1.2 427
Robert Schaub 1.1 428 === Recommendation: Hybrid Approach ===
Robert Schaub 1.2 429
Robert Schaub 1.1 430 **STORE (in PostgreSQL):**
Robert Schaub 1.2 431
Robert Schaub 1.1 432 ==== Claims (Current State + History) ====
Robert Schaub 1.2 433
Robert Schaub 1.1 434 * **What**: assertion, domain, status, created_at, updated_at, version
435 * **Why**: Core entity, must be persistent
436 * **Also store**: confidence_score (computed once, then cached)
Robert Schaub 1.2 437 * **Size**: 1 KB per claim
Robert Schaub 1.1 438 * **Growth**: Linear with claims
439 * **Decision**: ✅ STORE - Essential
Robert Schaub 1.2 440
Robert Schaub 1.1 441 ==== Evidence (All Records) ====
Robert Schaub 1.2 442
Robert Schaub 1.1 443 * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
444 * **Why**: Hard to re-gather, user contributions, reproducibility
Robert Schaub 1.2 445 * **Size**: 2 KB per evidence (with excerpt)
Robert Schaub 1.1 446 * **Growth**: 3-10 evidence per claim
447 * **Decision**: ✅ STORE - Essential for reproducibility
Robert Schaub 1.2 448
Robert Schaub 1.1 449 ==== Sources (Track Records) ====
Robert Schaub 1.2 450
Robert Schaub 1.1 451 * **What**: name, domain, track_record_score, accuracy_history, correction_frequency
452 * **Why**: Continuously updated, expensive to recompute
Robert Schaub 1.2 453 * **Size**: 500 bytes per source
Robert Schaub 1.1 454 * **Growth**: Slow (limited number of sources)
455 * **Decision**: ✅ STORE - Essential for quality
Robert Schaub 1.2 456
Robert Schaub 1.1 457 ==== Edit History (All Versions) ====
Robert Schaub 1.2 458
Robert Schaub 1.1 459 * **What**: before_state, after_state, user_id, reason, timestamp
460 * **Why**: Audit trail, legal requirement, reproducibility
Robert Schaub 1.2 461 * **Size**: 2 KB per edit
462 * **Growth**: Linear with edits (A portion of claims get edited)
Robert Schaub 1.1 463 * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
464 * **Decision**: ✅ STORE - Essential for accountability
Robert Schaub 1.2 465
Robert Schaub 1.1 466 ==== Flags (User Reports) ====
Robert Schaub 1.2 467
Robert Schaub 1.1 468 * **What**: entity_id, reported_by, issue_type, description, status
469 * **Why**: Error detection, system improvement triggers
Robert Schaub 1.2 470 * **Size**: 500 bytes per flag
Robert Schaub 1.1 471 * **Growth**: 5-high percentage of claims get flagged
472 * **Decision**: ✅ STORE - Essential for improvement
Robert Schaub 1.2 473
Robert Schaub 1.1 474 ==== ErrorPatterns (System Improvement) ====
Robert Schaub 1.2 475
Robert Schaub 1.1 476 * **What**: error_category, claim_id, description, root_cause, frequency, status
477 * **Why**: Learning loop, prevent recurring errors
Robert Schaub 1.2 478 * **Size**: 1 KB per pattern
Robert Schaub 1.1 479 * **Growth**: Slow (limited patterns, many fixed)
480 * **Decision**: ✅ STORE - Essential for learning
Robert Schaub 1.2 481
Robert Schaub 1.1 482 ==== QualityMetrics (Time Series) ====
Robert Schaub 1.2 483
Robert Schaub 1.1 484 * **What**: metric_type, category, value, target, timestamp
485 * **Why**: Trend analysis, cannot recreate historical metrics
Robert Schaub 1.2 486 * **Size**: 200 bytes per metric
Robert Schaub 1.1 487 * **Growth**: Hourly = 8,760 per year per metric type
488 * **Retention**: 2 years hot, then aggregate and archive
489 * **Decision**: ✅ STORE - Essential for monitoring
490 **STORE (Computed Once, Then Cached):**
Robert Schaub 1.2 491
Robert Schaub 1.1 492 ==== Analysis Summary ====
Robert Schaub 1.2 493
Robert Schaub 1.1 494 * **What**: Neutral text summary of claim analysis (200-500 words)
495 * **Computed**: Once by AKEL when claim first analyzed
496 * **Stored in**: Claim table (text field)
497 * **Recomputed**: Only when system significantly improves OR claim edited
498 * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
Robert Schaub 1.2 499 * **Size**: 2 KB per claim
Robert Schaub 1.1 500 * **Decision**: ✅ STORE (cached) - Cost-effective
Robert Schaub 1.2 501
Robert Schaub 1.1 502 ==== Confidence Score ====
Robert Schaub 1.2 503
Robert Schaub 1.1 504 * **What**: 0-100 score of analysis confidence
505 * **Computed**: Once by AKEL
506 * **Stored in**: Claim table (integer field)
507 * **Recomputed**: When evidence added, source track record changes significantly, or system improves
508 * **Why store**: Cheap to store, expensive to compute, users need it fast
509 * **Size**: 4 bytes per claim
510 * **Decision**: ✅ STORE (cached) - Performance critical
Robert Schaub 1.2 511
Robert Schaub 1.1 512 ==== Risk Score ====
Robert Schaub 1.2 513
Robert Schaub 1.1 514 * **What**: 0-100 score of claim risk level
515 * **Computed**: Once by AKEL
516 * **Stored in**: Claim table (integer field)
517 * **Recomputed**: When domain changes, evidence changes, or controversy detected
518 * **Why store**: Same as confidence score
519 * **Size**: 4 bytes per claim
520 * **Decision**: ✅ STORE (cached) - Performance critical
521 **COMPUTE DYNAMICALLY (Never Store):**
Robert Schaub 1.2 522
523 ==== Scenarios ====
524
525 ⚠️ CRITICAL DECISION
526
Robert Schaub 1.1 527 * **What**: 2-5 possible interpretations of claim with assumptions
528 * **Current design**: Stored in Scenario table
529 * **Alternative**: Compute on-demand when user views claim details
Robert Schaub 1.2 530 * **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim
Robert Schaub 1.1 531 * **Compute cost**: $0.005-0.01 per request (LLM API call)
Robert Schaub 1.2 532 * **Frequency**: Viewed in detail by 20% of users
Robert Schaub 1.1 533 * **Trade-off analysis**:
534 - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access
535 - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
536 * **Reproducibility**: Scenarios may improve as AI improves (good to recompute)
537 * **Speed**: Computed = 5-8 seconds delay, Stored = instant
538 * **Decision**: ✅ STORE (hybrid approach below)
539 **Scenario Strategy** (APPROVED):
Robert Schaub 1.2 540
Robert Schaub 1.1 541 1. **Store scenarios** initially when claim analyzed
542 2. **Mark as stale** when system improves significantly
543 3. **Recompute on next view** if marked stale
544 4. **Cache for 30 days** if frequently accessed
545 5. **Result**: Best of both worlds - speed + freshness
Robert Schaub 1.2 546
547 ==== Verdict Synthesis ====
548
549
550
Robert Schaub 1.1 551 * **What**: Final conclusion text synthesizing all scenarios
552 * **Compute cost**: $0.002-0.005 per request
553 * **Frequency**: Every time claim viewed
554 * **Why not store**: Changes as evidence/scenarios change, users want fresh analysis
555 * **Speed**: 2-3 seconds (acceptable)
556 **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale
557 * **Recommendation**: ✅ STORE cached version, mark stale when changes occur
Robert Schaub 1.2 558
Robert Schaub 1.1 559 ==== Search Results ====
Robert Schaub 1.2 560
Robert Schaub 1.1 561 * **What**: Lists of claims matching search query
562 * **Compute from**: Elasticsearch index
563 * **Cache**: 15 minutes in Redis for popular queries
564 * **Why not store permanently**: Constantly changing, infinite possible queries
Robert Schaub 1.2 565
Robert Schaub 1.1 566 ==== Aggregated Statistics ====
Robert Schaub 1.2 567
Robert Schaub 1.1 568 * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.
569 * **Compute from**: Database queries
570 * **Cache**: 1 hour in Redis
571 * **Why not store**: Can be derived, relatively cheap to compute
Robert Schaub 1.2 572
Robert Schaub 1.1 573 ==== User Reputation ====
Robert Schaub 1.2 574
Robert Schaub 1.1 575 * **What**: Score based on contributions
576 * **Current design**: Stored in User table
577 * **Alternative**: Compute from Edit table
578 * **Trade-off**:
579 - Stored: Fast, simple
580 - Computed: Always accurate, no denormalization
581 * **Frequency**: Read on every user action
582 * **Compute cost**: Simple COUNT query, milliseconds
583 * **Decision**: ✅ STORE - Performance critical, read-heavy
Robert Schaub 1.2 584
Robert Schaub 1.1 585 === Summary Table ===
Robert Schaub 1.2 586
587 | Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\
Robert Schaub 1.4 588 |-|-|-|||-|\\
Robert Schaub 1.2 589 | Claim core | ✅ | - | 1 KB | STORE | Essential |\\
590 | Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\
591 | Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\
592 | Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\
593 | Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\
594 | Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
595 | Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
596 | Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\
597 | Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\
598 | Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\
599 | ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\
600 | QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\
601 | Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\
Robert Schaub 1.1 602 | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |
Robert Schaub 1.2 603 **Total storage per claim**: 18 KB (without edits and flags)
Robert Schaub 1.1 604 **For 1 million claims**:
Robert Schaub 1.2 605
606 * **Storage**: 18 GB (manageable)
607 * **PostgreSQL**: $50/month (standard instance)
608 * **Redis cache**: $20/month (1 GB instance)
609 * **S3 archives**: $5/month (old edits)
610 * **Total**: $75/month infrastructure
Robert Schaub 1.1 611 **LLM cost savings by caching**:
612 * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
613 * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims
614 * Verdict stored: Save $0.003 per claim = $3K per 1M claims
Robert Schaub 1.2 615 * **Total savings**: $35K per 1M claims vs recomputing every time
616
Robert Schaub 1.1 617 === Recomputation Triggers ===
Robert Schaub 1.2 618
Robert Schaub 1.1 619 **When to mark cached data as stale and recompute:**
Robert Schaub 1.2 620
Robert Schaub 1.1 621 1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)
622 2. **Evidence added** → Recompute: scenarios, verdict, confidence score
623 3. **Source track record changes >10 points** → Recompute: confidence score, verdict
624 4. **System improvement deployed** → Mark affected claims stale, recompute on next view
625 5. **Controversy detected** (high flag rate) → Recompute: risk score
626 **Recomputation strategy**:
Robert Schaub 1.2 627
Robert Schaub 1.1 628 * **Eager**: Immediately recompute (for user edits)
629 * **Lazy**: Recompute on next view (for system improvements)
630 * **Batch**: Nightly re-evaluation of stale claims (if <1000)
Robert Schaub 1.2 631
Robert Schaub 1.1 632 === Database Size Projection ===
Robert Schaub 1.2 633
Robert Schaub 1.1 634 **Year 1**: 10K claims
Robert Schaub 1.2 635
Robert Schaub 1.1 636 * Storage: 180 MB
637 * Cost: $10/month
638 **Year 3**: 100K claims
639 * Storage: 1.8 GB
640 * Cost: $30/month
641 **Year 5**: 1M claims
642 * Storage: 18 GB
643 * Cost: $75/month
644 **Year 10**: 10M claims
645 * Storage: 180 GB
646 * Cost: $300/month
647 * Optimization: Archive old claims to S3 ($5/TB/month)
648 **Conclusion**: Storage costs are manageable, LLM cost savings are substantial.
Robert Schaub 1.2 649
Robert Schaub 1.1 650 == 3. Key Simplifications ==
Robert Schaub 1.2 651
Robert Schaub 1.1 652 * **Two content states only**: Published, Hidden
653 * **Three user roles only**: Reader, Contributor, Moderator
654 * **No complex versioning**: Linear edit history
655 * **Reputation-based permissions**: Not role hierarchy
656 * **Source track records**: Continuous evaluation
Robert Schaub 1.2 657
Robert Schaub 1.1 658 == 3. What Gets Stored in the Database ==
Robert Schaub 1.2 659
Robert Schaub 1.1 660 === 3.1 Primary Storage (PostgreSQL) ===
Robert Schaub 1.2 661
Robert Schaub 1.1 662 **Claims Table**:
Robert Schaub 1.2 663
Robert Schaub 1.1 664 * Current state only (latest version)
665 * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
666 **Evidence Table**:
667 * All evidence records
668 * Fields: id, claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at, archived
669 **Scenario Table**:
670 * All scenarios for each claim
671 * Fields: id, claim_id, description, assumptions (text array), confidence, created_by, created_at
672 **Source Table**:
673 * Track record database (continuously updated)
674 * Fields: id, name, domain, type, track_record_score, accuracy_history (JSON), correction_frequency, last_updated, claim_count, corrections_count
675 **User Table**:
676 * All user accounts
677 * Fields: id, username, email, role (Reader/Contributor/Moderator), reputation, created_at, last_active, contributions_count, flags_submitted, flags_accepted
678 **Edit Table**:
679 * Complete version history
680 * Fields: id, entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
681 **Flag Table**:
682 * User-reported issues
683 * Fields: id, entity_type, entity_id, reported_by, issue_type, description, status, resolved_by, resolution_note, created_at, resolved_at
684 **ErrorPattern Table**:
685 * System improvement queue
686 * Fields: id, error_category, claim_id, description, root_cause, frequency, status, created_at, fixed_at
687 **QualityMetric Table**:
688 * Time-series quality data
689 * Fields: id, metric_type, metric_category, value, target, timestamp
Robert Schaub 1.2 690
Robert Schaub 1.1 691 === 3.2 What's NOT Stored (Computed on-the-fly) ===
Robert Schaub 1.2 692
Robert Schaub 1.1 693 * **Verdicts**: Synthesized from evidence + scenarios when requested
694 * **Risk scores**: Recalculated based on current factors
695 * **Aggregated statistics**: Computed from base data
696 * **Search results**: Generated from Elasticsearch index
Robert Schaub 1.2 697
Robert Schaub 1.1 698 === 3.3 Cache Layer (Redis) ===
699
700 {{warning}}
701 **Implementation Status:** Redis caching is **NOT YET IMPLEMENTED**. Current implementation has no caching layer.
702 {{/warning}}
703
704 **Cached for performance (Planned)**:
Robert Schaub 1.2 705
Robert Schaub 1.1 706 * Frequently accessed claims (TTL: 1 hour)
707 * Search results (TTL: 15 minutes)
708 * User sessions (TTL: 24 hours)
709 * Source track records (TTL: 1 hour)
Robert Schaub 1.2 710
Robert Schaub 1.1 711 === 3.4 File Storage (S3) ===
Robert Schaub 1.2 712
Robert Schaub 1.1 713 **Archived content**:
Robert Schaub 1.2 714
Robert Schaub 1.1 715 * Old edit history (>3 months)
716 * Evidence documents (archived copies)
717 * Database backups
718 * Export files
Robert Schaub 1.2 719
Robert Schaub 1.1 720 === 3.5 Search Index (Elasticsearch) ===
Robert Schaub 1.2 721
Robert Schaub 1.1 722 **Indexed for search**:
Robert Schaub 1.2 723
Robert Schaub 1.1 724 * Claim assertions (full-text)
725 * Evidence excerpts (full-text)
726 * Scenario descriptions (full-text)
727 * Source names (autocomplete)
728 Synchronized from PostgreSQL via change data capture or periodic sync.
Robert Schaub 1.2 729
Robert Schaub 1.1 730 == 4. Related Pages ==
Robert Schaub 1.2 731
732 * [[Architecture>>Archive.FactHarbor 2026\.02\.08.Specification.Architecture.WebHome]]
Robert Schaub 1.6 733 * [[Requirements>>Archive.FactHarbor 2026\.02\.08.Specification.Requirements.WebHome]]
Robert Schaub 1.7 734 * [[Workflows>>Archive.FactHarbor 2026\.02\.08.Specification.Workflows.WebHome]]