Wiki source code of Data Model

Last modified by Robert Schaub on 2025/12/22 14:33

Hide last authors
Robert Schaub 1.1 1 = Data Model =
Robert Schaub 1.2 2
Robert Schaub 1.1 3 FactHarbor's data model is **simple, focused, designed for automated processing**.
Robert Schaub 1.2 4
Robert Schaub 1.1 5 == 1. Core Entities ==
Robert Schaub 1.2 6
Robert Schaub 1.1 7 === 1.1 Claim ===
Robert Schaub 1.2 8
Robert Schaub 1.1 9 Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count
Robert Schaub 1.2 10
Robert Schaub 1.1 11 ==== Performance Optimization: Denormalized Fields ====
Robert Schaub 1.2 12
Robert Schaub 1.1 13 **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
14 **Additional cached fields in claims table**:
Robert Schaub 1.2 15
Robert Schaub 1.1 16 * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores * Avoids joining evidence table for listing/preview * Updated when evidence is added/removed * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
17 * **source_names** (TEXT[]): Array of source names for quick display * Avoids joining through evidence to sources * Updated when sources change * Format: `["New York Times", "Nature Journal", ...]`
18 * **scenario_count** (INTEGER): Number of scenarios for this claim * Quick metric without counting rows * Updated when scenarios added/removed
19 * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed * Helps invalidate stale caches * Triggers background refresh if too old
20 **Update Strategy**:
21 * **Immediate**: Update on claim edit (user-facing)
22 * **Deferred**: Update via background job every hour (non-critical)
23 * **Invalidation**: Clear cache when source data changes significantly
24 **Trade-offs**:
25 * ✅ 70% fewer joins on common queries
26 * ✅ Much faster claim list/search pages
27 * ✅ Better user experience
Robert Schaub 1.2 28 * ⚠️ Small storage increase (10%)
Robert Schaub 1.1 29 * ⚠️ Need to keep caches in sync
Robert Schaub 1.2 30
Robert Schaub 1.1 31 === 1.2 Evidence ===
Robert Schaub 1.2 32
Robert Schaub 1.1 33 Fields: claim_id, source_id, excerpt, url, relevance_score, supports
Robert Schaub 1.2 34
Robert Schaub 1.1 35 === 1.3 Source ===
Robert Schaub 1.2 36
Robert Schaub 1.1 37 **Purpose**: Track reliability of information sources over time
38 **Fields**:
Robert Schaub 1.2 39
Robert Schaub 1.1 40 * **id** (UUID): Unique identifier
41 * **name** (text): Source name (e.g., "New York Times", "Nature Journal")
42 * **domain** (text): Website domain (e.g., "nytimes.com")
43 * **type** (enum): NewsOutlet, AcademicJournal, GovernmentAgency, etc.
44 * **track_record_score** (0-100): Overall reliability score
45 * **accuracy_history** (JSON): Historical accuracy data
46 * **correction_frequency** (float): How often source publishes corrections
47 * **last_updated** (timestamp): When track record last recalculated
48 **How It Works**:
49 * Initial score based on source type (70 for academic journals, 30 for unknown)
50 * Updated daily by background scheduler
51 * Formula: accuracy_rate (50%) + correction_policy (20%) + editorial_standards (15%) + bias_transparency (10%) + longevity (5%)
52 * Track Record Check in AKEL pipeline: Adjusts evidence confidence based on source quality
53 * Quality thresholds: 90+=Exceptional, 70-89=Reliable, 50-69=Acceptable, 30-49=Questionable, <30=Unreliable
54 **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
55 Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**
56 **Key**: Automated source reliability tracking
Robert Schaub 1.2 57
Robert Schaub 1.1 58 ==== Source Scoring Process (Separation of Concerns) ====
Robert Schaub 1.2 59
Robert Schaub 1.1 60 **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.
61 **The Problem**: * Source scores should influence claim verdicts
Robert Schaub 1.2 62
Robert Schaub 1.1 63 * Claim verdicts should update source scores
64 * But: Direct feedback creates circular dependency and potential feedback loops
65 **The Solution**: Temporal separation
Robert Schaub 1.2 66
Robert Schaub 1.1 67 ==== Weekly Background Job (Source Scoring) ====
Robert Schaub 1.2 68
Robert Schaub 1.1 69 Runs independently of claim analysis:
Robert Schaub 1.2 70 {{code language="python"}}def update_source_scores_weekly(): """ Background job: Calculate source reliability Never triggered by individual claim analysis """ # Analyze all claims from past week claims = get_claims_from_past_week() for source in get_all_sources(): # Calculate accuracy metrics correct_verdicts = count_correct_verdicts_citing(source, claims) total_citations = count_total_citations(source, claims) accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5 # Weight by claim importance weighted_score = calculate_weighted_score(source, claims) # Update source record source.track_record_score = weighted_score source.total_citations = total_citations source.last_updated = now() source.save() # Job runs: Sunday 2 AM UTC # Never during claim processing{{/code}}
71
Robert Schaub 1.1 72 ==== Real-Time Claim Analysis (AKEL) ====
Robert Schaub 1.2 73
Robert Schaub 1.1 74 Uses source scores but never updates them:
Robert Schaub 1.2 75 {{code language="python"}}def analyze_claim(claim_text): """ Real-time: Analyze claim using current source scores READ source scores, never UPDATE them """ # Gather evidence evidence_list = gather_evidence(claim_text) for evidence in evidence_list: # READ source score (snapshot from last weekly update) source = get_source(evidence.source_id) source_score = source.track_record_score # Use score to weight evidence evidence.weighted_relevance = evidence.relevance * source_score # Generate verdict using weighted evidence verdict = synthesize_verdict(evidence_list) # NEVER update source scores here # That happens in weekly background job return verdict{{/code}}
76
Robert Schaub 1.1 77 ==== Monthly Audit (Quality Assurance) ====
Robert Schaub 1.2 78
Robert Schaub 1.1 79 Moderator review of flagged source scores:
Robert Schaub 1.2 80
Robert Schaub 1.1 81 * Verify scores make sense
82 * Detect gaming attempts
83 * Identify systematic biases
84 * Manual adjustments if needed
85 **Key Principles**:
86 ✅ **Scoring and analysis are temporally separated**
87 * Source scoring: Weekly batch job
88 * Claim analysis: Real-time processing
89 * Never update scores during analysis
90 ✅ **One-way data flow during processing**
91 * Claims READ source scores
92 * Claims NEVER WRITE source scores
93 * Updates happen in background only
94 ✅ **Predictable update cycle**
95 * Sources update every Sunday 2 AM
96 * Claims always use last week's scores
97 * No mid-week score changes
98 ✅ **Audit trail**
99 * Log all score changes
100 * Track score history
101 * Explainable calculations
102 **Benefits**:
103 * No circular dependencies
104 * Predictable behavior
105 * Easier to reason about
106 * Simpler testing
107 * Clear audit trail
108 **Example Timeline**:
109 ```
110 Sunday 2 AM: Calculate source scores for past week → NYT score: 0.87 (up from 0.85) → Blog X score: 0.52 (down from 0.61)
111 Monday-Saturday: Claims processed using these scores → All claims this week use NYT=0.87 → All claims this week use Blog X=0.52
112 Next Sunday 2 AM: Recalculate scores including this week's claims → NYT score: 0.89 (trending up) → Blog X score: 0.48 (trending down)
113 ```
Robert Schaub 1.2 114
Robert Schaub 1.1 115 === 1.4 Scenario ===
Robert Schaub 1.2 116
Robert Schaub 1.1 117 **Purpose**: Different interpretations or contexts for evaluating claims
118 **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
119 **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)
120 **Fields**:
Robert Schaub 1.2 121
Robert Schaub 1.1 122 * **id** (UUID): Unique identifier
123 * **claim_id** (UUID): Foreign key to claim (one-to-many)
124 * **description** (text): Human-readable description of the scenario
125 * **assumptions** (JSONB): Key assumptions that define this scenario context
126 * **extracted_from** (UUID): Reference to evidence that this scenario was extracted from
127 * **created_at** (timestamp): When scenario was created
128 * **updated_at** (timestamp): Last modification
129 **How Found**: Evidence search → Extract context → Create scenario → Link to claim
130 **Example**: For claim "Vaccines reduce hospitalization":
131 * Scenario 1: "Clinical trials (healthy adults 18-65, original strain)" from trial paper
132 * Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data
133 * Scenario 3: "Immunocompromised patients" from specialist study
134 **V2.0 Evolution**: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality. === 1.5 Verdict === **Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence. **Core Fields**:
135 * **id** (UUID): Primary key
136 * **scenario_id** (UUID FK): The scenario being assessed
137 * **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")
138 * **confidence** (decimal 0-1): How confident we are in this assessment
139 * **explanation_summary** (text): Human-readable reasoning explaining the verdict
140 * **uncertainty_factors** (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")
141 * **created_at** (timestamp): When verdict was created
142 * **updated_at** (timestamp): Last modification **Change Tracking**: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states. **Relationship**: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity. **Example**:
143 For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":
144 * Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
145 * After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
146 * Edit entity records the complete before/after change with timestamp and reason **Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios. === 1.6 User ===
147 Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count
Robert Schaub 1.2 148
149 === User Reputation System ===
150
Robert Schaub 1.1 151 **V1.0 Approach**: Simple manual role assignment
152 **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
Robert Schaub 1.2 153
Robert Schaub 1.1 154 === Roles (Manual Assignment) ===
Robert Schaub 1.2 155
Robert Schaub 1.1 156 **reader** (default):
Robert Schaub 1.2 157
Robert Schaub 1.1 158 * View published claims and evidence
159 * Browse and search content
160 * No editing permissions
161 **contributor**:
162 * Submit new claims
163 * Suggest edits to existing content
164 * Add evidence
165 * Requires manual promotion by moderator/admin
166 **moderator**:
167 * Approve/reject contributor suggestions
168 * Flag inappropriate content
169 * Handle abuse reports
170 * Assigned by admins based on trust
171 **admin**:
172 * Manage users and roles
173 * System configuration
174 * Access to all features
175 * Founder-appointed initially
Robert Schaub 1.2 176
Robert Schaub 1.1 177 === Contribution Tracking (Simple) ===
Robert Schaub 1.2 178
Robert Schaub 1.1 179 **Basic metrics only**:
Robert Schaub 1.2 180
Robert Schaub 1.1 181 * `contributions_count`: Total number of contributions
182 * `created_at`: Account age
183 * `last_active`: Recent activity
184 **No complex calculations**:
185 * No point systems
186 * No automated privilege escalation
187 * No reputation decay
188 * No threshold-based promotions
Robert Schaub 1.2 189
Robert Schaub 1.1 190 === Promotion Process ===
Robert Schaub 1.2 191
Robert Schaub 1.1 192 **Manual review by moderators/admins**:
Robert Schaub 1.2 193
Robert Schaub 1.1 194 1. User demonstrates value through contributions
195 2. Moderator reviews user's contribution history
196 3. Moderator promotes user to contributor role
197 4. Admin promotes trusted contributors to moderator
198 **Criteria** (guidelines, not automated):
Robert Schaub 1.2 199
Robert Schaub 1.1 200 * Quality of contributions
201 * Consistency over time
202 * Collaborative behavior
203 * Understanding of project goals
Robert Schaub 1.2 204
Robert Schaub 1.1 205 === V2.0+ Evolution ===
Robert Schaub 1.2 206
Robert Schaub 1.1 207 **Add complex reputation when**:
Robert Schaub 1.2 208
Robert Schaub 1.1 209 * 100+ active contributors
210 * Manual role management becomes bottleneck
211 * Clear patterns of abuse emerge requiring automation
212 **Future features may include**:
213 * Automated point calculations
214 * Threshold-based promotions
215 * Reputation decay for inactive users
216 * Track record scoring for contributors
217 See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
Robert Schaub 1.2 218
Robert Schaub 1.1 219 === 1.7 Edit ===
Robert Schaub 1.2 220
Robert Schaub 1.1 221 **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
222 **Purpose**: Complete audit trail for all content changes
Robert Schaub 1.2 223
Robert Schaub 1.1 224 === Edit History Details ===
Robert Schaub 1.2 225
Robert Schaub 1.1 226 **What Gets Edited**:
Robert Schaub 1.2 227
Robert Schaub 1.1 228 * **Claims** (20% edited): assertion, domain, status, scores, analysis
229 * **Evidence** (10% edited): excerpt, relevance_score, supports
230 * **Scenarios** (5% edited): description, assumptions, confidence
231 * **Sources**: NOT versioned (continuous updates, not editorial decisions)
232 **Who Edits**:
233 * **Contributors** (rep sufficient): Corrections, additions
234 * **Trusted Contributors** (rep sufficient): Major improvements, approvals
235 * **Moderators**: Abuse handling, dispute resolution
236 * **System (AKEL)**: Re-analysis, automated improvements (user_id = NULL)
237 **Edit Types**:
238 * `CONTENT_CORRECTION`: User fixes factual error
239 * `CLARIFICATION`: Improved wording
240 * `SYSTEM_REANALYSIS`: AKEL re-processed claim
241 * `MODERATION_ACTION`: Hide/unhide for abuse
242 * `REVERT`: Rollback to previous version
243 **Retention Policy** (5 years total):
Robert Schaub 1.2 244
Robert Schaub 1.1 245 1. **Hot storage** (3 months): PostgreSQL, instant access
246 2. **Warm storage** (2 years): Partitioned, slower queries
247 3. **Cold storage** (3 years): S3 compressed, download required
248 4. **Deletion**: After 5 years (except legal holds)
Robert Schaub 1.2 249 **Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit)
Robert Schaub 1.1 250 **Use Cases**:
Robert Schaub 1.2 251
Robert Schaub 1.1 252 * View claim history timeline
253 * Detect vandalism patterns
254 * Learn from user corrections (system improvement)
255 * Legal compliance (audit trail)
256 * Rollback capability
257 See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases
Robert Schaub 1.2 258
Robert Schaub 1.1 259 === 1.8 Flag ===
Robert Schaub 1.2 260
Robert Schaub 1.1 261 Fields: entity_id, reported_by, issue_type, status, resolution_note
Robert Schaub 1.2 262
Robert Schaub 1.1 263 === 1.9 QualityMetric ===
Robert Schaub 1.2 264
Robert Schaub 1.1 265 **Fields**: metric_type, category, value, target, timestamp
266 **Purpose**: Time-series quality tracking
267 **Usage**:
Robert Schaub 1.2 268
Robert Schaub 1.1 269 * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times
270 * **Quality dashboard**: Real-time display with trend charts
271 * **Alerting**: Automatic alerts when metrics exceed thresholds
272 * **A/B testing**: Compare control vs treatment metrics
273 * **Improvement validation**: Measure before/after changes
274 **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
Robert Schaub 1.2 275
Robert Schaub 1.1 276 === 1.10 ErrorPattern ===
Robert Schaub 1.2 277
Robert Schaub 1.1 278 **Fields**: error_category, claim_id, description, root_cause, frequency, status
279 **Purpose**: Capture errors to trigger system improvements
280 **Usage**:
Robert Schaub 1.2 281
Robert Schaub 1.1 282 * **Error capture**: When users flag issues or system detects problems
283 * **Pattern analysis**: Weekly grouping by category and frequency
284 * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor
285 * **Metrics**: Track error rate reduction over time
Robert Schaub 1.4 286 **Example**: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}` == 1.4 Core Data Model ERD == {{include reference="Test.FactHarbor pre12 V0\.9\.70.Specification.Diagrams.Core Data Model ERD.WebHome"/}} == 1.5 User Class Diagram ==
Robert Schaub 1.5 287 {{include reference="Test.FactHarbor pre12 V0\.9\.70.Specification.Diagrams.User Class Diagram.WebHome"/}}
Robert Schaub 1.2 288
Robert Schaub 1.1 289 == 2. Versioning Strategy ==
Robert Schaub 1.2 290
Robert Schaub 1.1 291 **All Content Entities Are Versioned**:
Robert Schaub 1.2 292
Robert Schaub 1.1 293 * **Claim**: Every edit creates new version (V1→V2→V3...)
294 * **Evidence**: Changes tracked in edit history
295 * **Scenario**: Modifications versioned
296 **How Versioning Works**:
297 * Entity table stores **current state only**
298 * Edit table stores **all historical states** (before_state, after_state as JSON)
299 * Version number increments with each edit
300 * Complete audit trail maintained forever
301 **Unversioned Entities** (current state only, no history):
302 * **Source**: Track record continuously updated (not versioned history, just current score)
303 * **User**: Account state (reputation accumulated, not versioned)
304 * **QualityMetric**: Time-series data (each record is a point in time, not a version)
305 * **ErrorPattern**: System improvement queue (status tracked, not versioned)
306 **Example**:
307 ```
308 Claim V1: "The sky is blue" → User edits → Claim V2: "The sky is blue during daytime" → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
309 ```
Robert Schaub 1.2 310
Robert Schaub 1.1 311 == 2.5. Storage vs Computation Strategy ==
Robert Schaub 1.2 312
Robert Schaub 1.1 313 **Critical architectural decision**: What to persist in databases vs compute dynamically?
314 **Trade-off**:
Robert Schaub 1.2 315
Robert Schaub 1.1 316 * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs
317 * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible
Robert Schaub 1.2 318
Robert Schaub 1.1 319 === Recommendation: Hybrid Approach ===
Robert Schaub 1.2 320
Robert Schaub 1.1 321 **STORE (in PostgreSQL):**
Robert Schaub 1.2 322
Robert Schaub 1.1 323 ==== Claims (Current State + History) ====
Robert Schaub 1.2 324
Robert Schaub 1.1 325 * **What**: assertion, domain, status, created_at, updated_at, version
326 * **Why**: Core entity, must be persistent
327 * **Also store**: confidence_score (computed once, then cached)
Robert Schaub 1.2 328 * **Size**: 1 KB per claim
Robert Schaub 1.1 329 * **Growth**: Linear with claims
330 * **Decision**: ✅ STORE - Essential
Robert Schaub 1.2 331
Robert Schaub 1.1 332 ==== Evidence (All Records) ====
Robert Schaub 1.2 333
Robert Schaub 1.1 334 * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
335 * **Why**: Hard to re-gather, user contributions, reproducibility
Robert Schaub 1.2 336 * **Size**: 2 KB per evidence (with excerpt)
Robert Schaub 1.1 337 * **Growth**: 3-10 evidence per claim
338 * **Decision**: ✅ STORE - Essential for reproducibility
Robert Schaub 1.2 339
Robert Schaub 1.1 340 ==== Sources (Track Records) ====
Robert Schaub 1.2 341
Robert Schaub 1.1 342 * **What**: name, domain, track_record_score, accuracy_history, correction_frequency
343 * **Why**: Continuously updated, expensive to recompute
Robert Schaub 1.2 344 * **Size**: 500 bytes per source
Robert Schaub 1.1 345 * **Growth**: Slow (limited number of sources)
346 * **Decision**: ✅ STORE - Essential for quality
Robert Schaub 1.2 347
Robert Schaub 1.1 348 ==== Edit History (All Versions) ====
Robert Schaub 1.2 349
Robert Schaub 1.1 350 * **What**: before_state, after_state, user_id, reason, timestamp
351 * **Why**: Audit trail, legal requirement, reproducibility
Robert Schaub 1.2 352 * **Size**: 2 KB per edit
353 * **Growth**: Linear with edits (A portion of claims get edited)
Robert Schaub 1.1 354 * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
355 * **Decision**: ✅ STORE - Essential for accountability
Robert Schaub 1.2 356
Robert Schaub 1.1 357 ==== Flags (User Reports) ====
Robert Schaub 1.2 358
Robert Schaub 1.1 359 * **What**: entity_id, reported_by, issue_type, description, status
360 * **Why**: Error detection, system improvement triggers
Robert Schaub 1.2 361 * **Size**: 500 bytes per flag
Robert Schaub 1.1 362 * **Growth**: 5-high percentage of claims get flagged
363 * **Decision**: ✅ STORE - Essential for improvement
Robert Schaub 1.2 364
Robert Schaub 1.1 365 ==== ErrorPatterns (System Improvement) ====
Robert Schaub 1.2 366
Robert Schaub 1.1 367 * **What**: error_category, claim_id, description, root_cause, frequency, status
368 * **Why**: Learning loop, prevent recurring errors
Robert Schaub 1.2 369 * **Size**: 1 KB per pattern
Robert Schaub 1.1 370 * **Growth**: Slow (limited patterns, many fixed)
371 * **Decision**: ✅ STORE - Essential for learning
Robert Schaub 1.2 372
Robert Schaub 1.1 373 ==== QualityMetrics (Time Series) ====
Robert Schaub 1.2 374
Robert Schaub 1.1 375 * **What**: metric_type, category, value, target, timestamp
376 * **Why**: Trend analysis, cannot recreate historical metrics
Robert Schaub 1.2 377 * **Size**: 200 bytes per metric
Robert Schaub 1.1 378 * **Growth**: Hourly = 8,760 per year per metric type
379 * **Retention**: 2 years hot, then aggregate and archive
380 * **Decision**: ✅ STORE - Essential for monitoring
381 **STORE (Computed Once, Then Cached):**
Robert Schaub 1.2 382
Robert Schaub 1.1 383 ==== Analysis Summary ====
Robert Schaub 1.2 384
Robert Schaub 1.1 385 * **What**: Neutral text summary of claim analysis (200-500 words)
386 * **Computed**: Once by AKEL when claim first analyzed
387 * **Stored in**: Claim table (text field)
388 * **Recomputed**: Only when system significantly improves OR claim edited
389 * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
Robert Schaub 1.2 390 * **Size**: 2 KB per claim
Robert Schaub 1.1 391 * **Decision**: ✅ STORE (cached) - Cost-effective
Robert Schaub 1.2 392
Robert Schaub 1.1 393 ==== Confidence Score ====
Robert Schaub 1.2 394
Robert Schaub 1.1 395 * **What**: 0-100 score of analysis confidence
396 * **Computed**: Once by AKEL
397 * **Stored in**: Claim table (integer field)
398 * **Recomputed**: When evidence added, source track record changes significantly, or system improves
399 * **Why store**: Cheap to store, expensive to compute, users need it fast
400 * **Size**: 4 bytes per claim
401 * **Decision**: ✅ STORE (cached) - Performance critical
Robert Schaub 1.2 402
Robert Schaub 1.1 403 ==== Risk Score ====
Robert Schaub 1.2 404
Robert Schaub 1.1 405 * **What**: 0-100 score of claim risk level
406 * **Computed**: Once by AKEL
407 * **Stored in**: Claim table (integer field)
408 * **Recomputed**: When domain changes, evidence changes, or controversy detected
409 * **Why store**: Same as confidence score
410 * **Size**: 4 bytes per claim
411 * **Decision**: ✅ STORE (cached) - Performance critical
412 **COMPUTE DYNAMICALLY (Never Store):**
Robert Schaub 1.2 413
414 ==== Scenarios ====
415
416 ⚠️ CRITICAL DECISION
417
Robert Schaub 1.1 418 * **What**: 2-5 possible interpretations of claim with assumptions
419 * **Current design**: Stored in Scenario table
420 * **Alternative**: Compute on-demand when user views claim details
Robert Schaub 1.2 421 * **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim
Robert Schaub 1.1 422 * **Compute cost**: $0.005-0.01 per request (LLM API call)
Robert Schaub 1.2 423 * **Frequency**: Viewed in detail by 20% of users
Robert Schaub 1.1 424 * **Trade-off analysis**: - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
425 * **Reproducibility**: Scenarios may improve as AI improves (good to recompute)
426 * **Speed**: Computed = 5-8 seconds delay, Stored = instant
427 * **Decision**: ✅ STORE (hybrid approach below)
428 **Scenario Strategy** (APPROVED):
Robert Schaub 1.2 429
Robert Schaub 1.1 430 1. **Store scenarios** initially when claim analyzed
431 2. **Mark as stale** when system improves significantly
432 3. **Recompute on next view** if marked stale
433 4. **Cache for 30 days** if frequently accessed
434 5. **Result**: Best of both worlds - speed + freshness
Robert Schaub 1.2 435
436 ==== Verdict Synthesis ====
437
438 ~* **What**: Final conclusion text synthesizing all scenarios
439
Robert Schaub 1.1 440 * **Compute cost**: $0.002-0.005 per request
441 * **Frequency**: Every time claim viewed
442 * **Why not store**: Changes as evidence/scenarios change, users want fresh analysis
443 * **Speed**: 2-3 seconds (acceptable)
444 **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale
445 * **Recommendation**: ✅ STORE cached version, mark stale when changes occur
Robert Schaub 1.2 446
Robert Schaub 1.1 447 ==== Search Results ====
Robert Schaub 1.2 448
Robert Schaub 1.1 449 * **What**: Lists of claims matching search query
450 * **Compute from**: Elasticsearch index
451 * **Cache**: 15 minutes in Redis for popular queries
452 * **Why not store permanently**: Constantly changing, infinite possible queries
Robert Schaub 1.2 453
Robert Schaub 1.1 454 ==== Aggregated Statistics ====
Robert Schaub 1.2 455
Robert Schaub 1.1 456 * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.
457 * **Compute from**: Database queries
458 * **Cache**: 1 hour in Redis
459 * **Why not store**: Can be derived, relatively cheap to compute
Robert Schaub 1.2 460
Robert Schaub 1.1 461 ==== User Reputation ====
Robert Schaub 1.2 462
Robert Schaub 1.1 463 * **What**: Score based on contributions
464 * **Current design**: Stored in User table
465 * **Alternative**: Compute from Edit table
466 * **Trade-off**: - Stored: Fast, simple - Computed: Always accurate, no denormalization
467 * **Frequency**: Read on every user action
468 * **Compute cost**: Simple COUNT query, milliseconds
469 * **Decision**: ✅ STORE - Performance critical, read-heavy
Robert Schaub 1.2 470
Robert Schaub 1.1 471 === Summary Table ===
Robert Schaub 1.2 472
473 | Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\
Robert Schaub 1.4 474 |-|-|-|||-|\\
Robert Schaub 1.2 475 | Claim core | ✅ | - | 1 KB | STORE | Essential |\\
476 | Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\
477 | Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\
478 | Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\
479 | Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\
480 | Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
481 | Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
482 | Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\
483 | Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\
484 | Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\
485 | ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\
486 | QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\
487 | Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\
Robert Schaub 1.1 488 | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |
Robert Schaub 1.2 489 **Total storage per claim**: 18 KB (without edits and flags)
Robert Schaub 1.1 490 **For 1 million claims**:
Robert Schaub 1.2 491
492 * **Storage**: 18 GB (manageable)
493 * **PostgreSQL**: $50/month (standard instance)
494 * **Redis cache**: $20/month (1 GB instance)
495 * **S3 archives**: $5/month (old edits)
496 * **Total**: $75/month infrastructure
Robert Schaub 1.1 497 **LLM cost savings by caching**:
498 * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
499 * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims * Verdict stored: Save $0.003 per claim = $3K per 1M claims
Robert Schaub 1.2 500 * **Total savings**: $35K per 1M claims vs recomputing every time
501
Robert Schaub 1.1 502 === Recomputation Triggers ===
Robert Schaub 1.2 503
Robert Schaub 1.1 504 **When to mark cached data as stale and recompute:**
Robert Schaub 1.2 505
Robert Schaub 1.1 506 1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)
507 2. **Evidence added** → Recompute: scenarios, verdict, confidence score
508 3. **Source track record changes >10 points** → Recompute: confidence score, verdict
509 4. **System improvement deployed** → Mark affected claims stale, recompute on next view
510 5. **Controversy detected** (high flag rate) → Recompute: risk score
511 **Recomputation strategy**:
Robert Schaub 1.2 512
Robert Schaub 1.1 513 * **Eager**: Immediately recompute (for user edits)
514 * **Lazy**: Recompute on next view (for system improvements)
515 * **Batch**: Nightly re-evaluation of stale claims (if <1000)
Robert Schaub 1.2 516
Robert Schaub 1.1 517 === Database Size Projection ===
Robert Schaub 1.2 518
Robert Schaub 1.1 519 **Year 1**: 10K claims
Robert Schaub 1.2 520
Robert Schaub 1.1 521 * Storage: 180 MB
522 * Cost: $10/month
523 **Year 3**: 100K claims * Storage: 1.8 GB
524 * Cost: $30/month
525 **Year 5**: 1M claims
526 * Storage: 18 GB * Cost: $75/month
527 **Year 10**: 10M claims
528 * Storage: 180 GB
529 * Cost: $300/month
530 * Optimization: Archive old claims to S3 ($5/TB/month)
531 **Conclusion**: Storage costs are manageable, LLM cost savings are substantial.
Robert Schaub 1.2 532
Robert Schaub 1.1 533 == 3. Key Simplifications ==
Robert Schaub 1.2 534
Robert Schaub 1.1 535 * **Two content states only**: Published, Hidden
536 * **Three user roles only**: Reader, Contributor, Moderator
537 * **No complex versioning**: Linear edit history
538 * **Reputation-based permissions**: Not role hierarchy
539 * **Source track records**: Continuous evaluation
Robert Schaub 1.2 540
Robert Schaub 1.1 541 == 3. What Gets Stored in the Database ==
Robert Schaub 1.2 542
Robert Schaub 1.1 543 === 3.1 Primary Storage (PostgreSQL) ===
Robert Schaub 1.2 544
Robert Schaub 1.1 545 **Claims Table**:
Robert Schaub 1.2 546
Robert Schaub 1.1 547 * Current state only (latest version)
548 * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
549 **Evidence Table**:
550 * All evidence records
551 * Fields: id, claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at, archived
552 **Scenario Table**:
553 * All scenarios for each claim
554 * Fields: id, claim_id, description, assumptions (text array), confidence, created_by, created_at
555 **Source Table**:
556 * Track record database (continuously updated)
557 * Fields: id, name, domain, type, track_record_score, accuracy_history (JSON), correction_frequency, last_updated, claim_count, corrections_count
558 **User Table**:
559 * All user accounts
560 * Fields: id, username, email, role (Reader/Contributor/Moderator), reputation, created_at, last_active, contributions_count, flags_submitted, flags_accepted
561 **Edit Table**:
562 * Complete version history
563 * Fields: id, entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
564 **Flag Table**:
565 * User-reported issues
566 * Fields: id, entity_type, entity_id, reported_by, issue_type, description, status, resolved_by, resolution_note, created_at, resolved_at
567 **ErrorPattern Table**:
568 * System improvement queue
569 * Fields: id, error_category, claim_id, description, root_cause, frequency, status, created_at, fixed_at
570 **QualityMetric Table**:
571 * Time-series quality data
572 * Fields: id, metric_type, metric_category, value, target, timestamp
Robert Schaub 1.2 573
Robert Schaub 1.1 574 === 3.2 What's NOT Stored (Computed on-the-fly) ===
Robert Schaub 1.2 575
Robert Schaub 1.1 576 * **Verdicts**: Synthesized from evidence + scenarios when requested
577 * **Risk scores**: Recalculated based on current factors
578 * **Aggregated statistics**: Computed from base data
579 * **Search results**: Generated from Elasticsearch index
Robert Schaub 1.2 580
Robert Schaub 1.1 581 === 3.3 Cache Layer (Redis) ===
Robert Schaub 1.2 582
Robert Schaub 1.1 583 **Cached for performance**:
Robert Schaub 1.2 584
Robert Schaub 1.1 585 * Frequently accessed claims (TTL: 1 hour)
586 * Search results (TTL: 15 minutes)
587 * User sessions (TTL: 24 hours)
588 * Source track records (TTL: 1 hour)
Robert Schaub 1.2 589
Robert Schaub 1.1 590 === 3.4 File Storage (S3) ===
Robert Schaub 1.2 591
Robert Schaub 1.1 592 **Archived content**:
Robert Schaub 1.2 593
Robert Schaub 1.1 594 * Old edit history (>3 months)
595 * Evidence documents (archived copies)
596 * Database backups
597 * Export files
Robert Schaub 1.2 598
Robert Schaub 1.1 599 === 3.5 Search Index (Elasticsearch) ===
Robert Schaub 1.2 600
Robert Schaub 1.1 601 **Indexed for search**:
Robert Schaub 1.2 602
Robert Schaub 1.1 603 * Claim assertions (full-text)
604 * Evidence excerpts (full-text)
605 * Scenario descriptions (full-text)
606 * Source names (autocomplete)
607 Synchronized from PostgreSQL via change data capture or periodic sync.
Robert Schaub 1.2 608
Robert Schaub 1.1 609 == 4. Related Pages ==
Robert Schaub 1.2 610
611 * [[Architecture>>Test.FactHarbor pre12 V0\.9\.70.Specification.Architecture.WebHome]]
Robert Schaub 1.6 612 * [[Requirements>>Test.FactHarbor pre12 V0\.9\.70.Specification.Requirements.WebHome]]
Robert Schaub 1.7 613 * [[Workflows>>Test.FactHarbor pre12 V0\.9\.70.Specification.Workflows.WebHome]]