Wiki source code of Data Model

Last modified by Robert Schaub on 2025/12/24 11:48

Hide last authors
Robert Schaub 1.1 1 = Data Model =
Robert Schaub 1.2 2
Robert Schaub 1.1 3 FactHarbor's data model is **simple, focused, designed for automated processing**.
Robert Schaub 1.2 4
Robert Schaub 1.1 5 == 1. Core Entities ==
Robert Schaub 1.2 6
Robert Schaub 1.1 7 === 1.1 Claim ===
Robert Schaub 1.2 8
Robert Schaub 1.1 9 Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count
Robert Schaub 1.2 10
Robert Schaub 1.1 11 ==== Performance Optimization: Denormalized Fields ====
Robert Schaub 1.2 12
Robert Schaub 1.1 13 **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
14 **Additional cached fields in claims table**:
Robert Schaub 1.2 15
Robert Schaub 1.1 16 * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores
Robert Schaub 1.2 17 * Avoids joining evidence table for listing/preview
18 * Updated when evidence is added/removed
19 * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
Robert Schaub 1.1 20 * **source_names** (TEXT[]): Array of source names for quick display
Robert Schaub 1.2 21 * Avoids joining through evidence to sources
22 * Updated when sources change
23 * Format: `["New York Times", "Nature Journal", ...]`
Robert Schaub 1.1 24 * **scenario_count** (INTEGER): Number of scenarios for this claim
Robert Schaub 1.2 25 * Quick metric without counting rows
26 * Updated when scenarios added/removed
Robert Schaub 1.1 27 * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed
Robert Schaub 1.2 28 * Helps invalidate stale caches
29 * Triggers background refresh if too old
Robert Schaub 1.1 30 **Update Strategy**:
31 * **Immediate**: Update on claim edit (user-facing)
32 * **Deferred**: Update via background job every hour (non-critical)
33 * **Invalidation**: Clear cache when source data changes significantly
34 **Trade-offs**:
35 * ✅ 70% fewer joins on common queries
36 * ✅ Much faster claim list/search pages
37 * ✅ Better user experience
Robert Schaub 1.2 38 * ⚠️ Small storage increase (10%)
Robert Schaub 1.1 39 * ⚠️ Need to keep caches in sync
Robert Schaub 1.2 40
Robert Schaub 1.1 41 === 1.2 Evidence ===
Robert Schaub 1.2 42
Robert Schaub 1.1 43 Fields: claim_id, source_id, excerpt, url, relevance_score, supports
Robert Schaub 1.2 44
Robert Schaub 1.1 45 === 1.3 Source ===
Robert Schaub 1.2 46
Robert Schaub 1.1 47 **Purpose**: Track reliability of information sources over time
48 **Fields**:
Robert Schaub 1.2 49
Robert Schaub 1.1 50 * **id** (UUID): Unique identifier
51 * **name** (text): Source name (e.g., "New York Times", "Nature Journal")
52 * **domain** (text): Website domain (e.g., "nytimes.com")
53 * **type** (enum): NewsOutlet, AcademicJournal, GovernmentAgency, etc.
54 * **track_record_score** (0-100): Overall reliability score
55 * **accuracy_history** (JSON): Historical accuracy data
56 * **correction_frequency** (float): How often source publishes corrections
57 * **last_updated** (timestamp): When track record last recalculated
58 **How It Works**:
59 * Initial score based on source type (70 for academic journals, 30 for unknown)
60 * Updated daily by background scheduler
61 * Formula: accuracy_rate (50%) + correction_policy (20%) + editorial_standards (15%) + bias_transparency (10%) + longevity (5%)
62 * Track Record Check in AKEL pipeline: Adjusts evidence confidence based on source quality
63 * Quality thresholds: 90+=Exceptional, 70-89=Reliable, 50-69=Acceptable, 30-49=Questionable, <30=Unreliable
64 **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
65 Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**
66 **Key**: Automated source reliability tracking
Robert Schaub 1.2 67
Robert Schaub 1.1 68 ==== Source Scoring Process (Separation of Concerns) ====
Robert Schaub 1.2 69
Robert Schaub 1.1 70 **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.
71 **The Problem**:
Robert Schaub 1.2 72
Robert Schaub 1.1 73 * Source scores should influence claim verdicts
74 * Claim verdicts should update source scores
75 * But: Direct feedback creates circular dependency and potential feedback loops
76 **The Solution**: Temporal separation
Robert Schaub 1.2 77
Robert Schaub 1.1 78 ==== Weekly Background Job (Source Scoring) ====
Robert Schaub 1.2 79
Robert Schaub 1.1 80 Runs independently of claim analysis:
Robert Schaub 1.2 81 {{code language="python"}}def update_source_scores_weekly():
Robert Schaub 1.1 82 """
83 Background job: Calculate source reliability
84 Never triggered by individual claim analysis
85 """
86 # Analyze all claims from past week
87 claims = get_claims_from_past_week()
88 for source in get_all_sources():
89 # Calculate accuracy metrics
90 correct_verdicts = count_correct_verdicts_citing(source, claims)
91 total_citations = count_total_citations(source, claims)
92 accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5
93 # Weight by claim importance
94 weighted_score = calculate_weighted_score(source, claims)
95 # Update source record
96 source.track_record_score = weighted_score
97 source.total_citations = total_citations
98 source.last_updated = now()
99 source.save()
100 # Job runs: Sunday 2 AM UTC
Robert Schaub 1.2 101 # Never during claim processing{{/code}}
102
Robert Schaub 1.1 103 ==== Real-Time Claim Analysis (AKEL) ====
Robert Schaub 1.2 104
Robert Schaub 1.1 105 Uses source scores but never updates them:
Robert Schaub 1.2 106 {{code language="python"}}def analyze_claim(claim_text):
Robert Schaub 1.1 107 """
108 Real-time: Analyze claim using current source scores
109 READ source scores, never UPDATE them
110 """
111 # Gather evidence
112 evidence_list = gather_evidence(claim_text)
113 for evidence in evidence_list:
114 # READ source score (snapshot from last weekly update)
115 source = get_source(evidence.source_id)
116 source_score = source.track_record_score
117 # Use score to weight evidence
118 evidence.weighted_relevance = evidence.relevance * source_score
119 # Generate verdict using weighted evidence
120 verdict = synthesize_verdict(evidence_list)
121 # NEVER update source scores here
122 # That happens in weekly background job
Robert Schaub 1.2 123 return verdict{{/code}}
124
Robert Schaub 1.1 125 ==== Monthly Audit (Quality Assurance) ====
Robert Schaub 1.2 126
Robert Schaub 1.1 127 Moderator review of flagged source scores:
Robert Schaub 1.2 128
Robert Schaub 1.1 129 * Verify scores make sense
130 * Detect gaming attempts
131 * Identify systematic biases
132 * Manual adjustments if needed
133 **Key Principles**:
134 ✅ **Scoring and analysis are temporally separated**
135 * Source scoring: Weekly batch job
136 * Claim analysis: Real-time processing
137 * Never update scores during analysis
138 ✅ **One-way data flow during processing**
139 * Claims READ source scores
140 * Claims NEVER WRITE source scores
141 * Updates happen in background only
142 ✅ **Predictable update cycle**
143 * Sources update every Sunday 2 AM
144 * Claims always use last week's scores
145 * No mid-week score changes
146 ✅ **Audit trail**
147 * Log all score changes
148 * Track score history
149 * Explainable calculations
150 **Benefits**:
151 * No circular dependencies
152 * Predictable behavior
153 * Easier to reason about
154 * Simpler testing
155 * Clear audit trail
156 **Example Timeline**:
157 ```
158 Sunday 2 AM: Calculate source scores for past week
159 → NYT score: 0.87 (up from 0.85)
160 → Blog X score: 0.52 (down from 0.61)
161 Monday-Saturday: Claims processed using these scores
162 → All claims this week use NYT=0.87
163 → All claims this week use Blog X=0.52
164 Next Sunday 2 AM: Recalculate scores including this week's claims
165 → NYT score: 0.89 (trending up)
166 → Blog X score: 0.48 (trending down)
167 ```
Robert Schaub 1.2 168
Robert Schaub 1.1 169 === 1.4 Scenario ===
Robert Schaub 1.2 170
Robert Schaub 1.1 171 **Purpose**: Different interpretations or contexts for evaluating claims
172 **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
173 **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)
174 **Fields**:
Robert Schaub 1.2 175
Robert Schaub 1.1 176 * **id** (UUID): Unique identifier
177 * **claim_id** (UUID): Foreign key to claim (one-to-many)
178 * **description** (text): Human-readable description of the scenario
179 * **assumptions** (JSONB): Key assumptions that define this scenario context
180 * **extracted_from** (UUID): Reference to evidence that this scenario was extracted from
181 * **created_at** (timestamp): When scenario was created
182 * **updated_at** (timestamp): Last modification
183 **How Found**: Evidence search → Extract context → Create scenario → Link to claim
184 **Example**:
185 For claim "Vaccines reduce hospitalization":
186 * Scenario 1: "Clinical trials (healthy adults 18-65, original strain)" from trial paper
187 * Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data
188 * Scenario 3: "Immunocompromised patients" from specialist study
189 **V2.0 Evolution**: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality.
190
191 === 1.5 Verdict ===
192
193 **Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.
194
195 **Core Fields**:
Robert Schaub 1.2 196
Robert Schaub 1.1 197 * **id** (UUID): Primary key
198 * **scenario_id** (UUID FK): The scenario being assessed
199 * **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")
200 * **confidence** (decimal 0-1): How confident we are in this assessment
201 * **explanation_summary** (text): Human-readable reasoning explaining the verdict
202 * **uncertainty_factors** (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")
203 * **created_at** (timestamp): When verdict was created
204 * **updated_at** (timestamp): Last modification
205
206 **Change Tracking**: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states.
207
208 **Relationship**: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity.
209
210 **Example**:
211 For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":
Robert Schaub 1.2 212
Robert Schaub 1.1 213 * Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
214 * After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
215 * Edit entity records the complete before/after change with timestamp and reason
216
217 **Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios.
218
219 === 1.6 User ===
Robert Schaub 1.2 220
Robert Schaub 1.1 221 Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count
Robert Schaub 1.2 222
223 === User Reputation System ===
224
Robert Schaub 1.1 225 **V1.0 Approach**: Simple manual role assignment
226 **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
Robert Schaub 1.2 227
Robert Schaub 1.1 228 === Roles (Manual Assignment) ===
Robert Schaub 1.2 229
Robert Schaub 1.1 230 **reader** (default):
Robert Schaub 1.2 231
Robert Schaub 1.1 232 * View published claims and evidence
233 * Browse and search content
234 * No editing permissions
235 **contributor**:
236 * Submit new claims
237 * Suggest edits to existing content
238 * Add evidence
239 * Requires manual promotion by moderator/admin
240 **moderator**:
241 * Approve/reject contributor suggestions
242 * Flag inappropriate content
243 * Handle abuse reports
244 * Assigned by admins based on trust
245 **admin**:
246 * Manage users and roles
247 * System configuration
248 * Access to all features
249 * Founder-appointed initially
Robert Schaub 1.2 250
Robert Schaub 1.1 251 === Contribution Tracking (Simple) ===
Robert Schaub 1.2 252
Robert Schaub 1.1 253 **Basic metrics only**:
Robert Schaub 1.2 254
Robert Schaub 1.1 255 * `contributions_count`: Total number of contributions
256 * `created_at`: Account age
257 * `last_active`: Recent activity
258 **No complex calculations**:
259 * No point systems
260 * No automated privilege escalation
261 * No reputation decay
262 * No threshold-based promotions
Robert Schaub 1.2 263
Robert Schaub 1.1 264 === Promotion Process ===
Robert Schaub 1.2 265
Robert Schaub 1.1 266 **Manual review by moderators/admins**:
Robert Schaub 1.2 267
Robert Schaub 1.1 268 1. User demonstrates value through contributions
269 2. Moderator reviews user's contribution history
270 3. Moderator promotes user to contributor role
271 4. Admin promotes trusted contributors to moderator
272 **Criteria** (guidelines, not automated):
Robert Schaub 1.2 273
Robert Schaub 1.1 274 * Quality of contributions
275 * Consistency over time
276 * Collaborative behavior
277 * Understanding of project goals
Robert Schaub 1.2 278
Robert Schaub 1.1 279 === V2.0+ Evolution ===
Robert Schaub 1.2 280
Robert Schaub 1.1 281 **Add complex reputation when**:
Robert Schaub 1.2 282
Robert Schaub 1.1 283 * 100+ active contributors
284 * Manual role management becomes bottleneck
285 * Clear patterns of abuse emerge requiring automation
286 **Future features may include**:
287 * Automated point calculations
288 * Threshold-based promotions
289 * Reputation decay for inactive users
290 * Track record scoring for contributors
291 See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
Robert Schaub 1.2 292
Robert Schaub 1.1 293 === 1.7 Edit ===
Robert Schaub 1.2 294
Robert Schaub 1.1 295 **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
296 **Purpose**: Complete audit trail for all content changes
Robert Schaub 1.2 297
Robert Schaub 1.1 298 === Edit History Details ===
Robert Schaub 1.2 299
Robert Schaub 1.1 300 **What Gets Edited**:
Robert Schaub 1.2 301
Robert Schaub 1.1 302 * **Claims** (20% edited): assertion, domain, status, scores, analysis
303 * **Evidence** (10% edited): excerpt, relevance_score, supports
304 * **Scenarios** (5% edited): description, assumptions, confidence
305 * **Sources**: NOT versioned (continuous updates, not editorial decisions)
306 **Who Edits**:
307 * **Contributors** (rep sufficient): Corrections, additions
308 * **Trusted Contributors** (rep sufficient): Major improvements, approvals
309 * **Moderators**: Abuse handling, dispute resolution
310 * **System (AKEL)**: Re-analysis, automated improvements (user_id = NULL)
311 **Edit Types**:
312 * `CONTENT_CORRECTION`: User fixes factual error
313 * `CLARIFICATION`: Improved wording
314 * `SYSTEM_REANALYSIS`: AKEL re-processed claim
315 * `MODERATION_ACTION`: Hide/unhide for abuse
316 * `REVERT`: Rollback to previous version
317 **Retention Policy** (5 years total):
Robert Schaub 1.2 318
Robert Schaub 1.1 319 1. **Hot storage** (3 months): PostgreSQL, instant access
320 2. **Warm storage** (2 years): Partitioned, slower queries
321 3. **Cold storage** (3 years): S3 compressed, download required
322 4. **Deletion**: After 5 years (except legal holds)
Robert Schaub 1.2 323 **Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit)
Robert Schaub 1.1 324 **Use Cases**:
Robert Schaub 1.2 325
Robert Schaub 1.1 326 * View claim history timeline
327 * Detect vandalism patterns
328 * Learn from user corrections (system improvement)
329 * Legal compliance (audit trail)
330 * Rollback capability
331 See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases
Robert Schaub 1.2 332
Robert Schaub 1.1 333 === 1.8 Flag ===
Robert Schaub 1.2 334
Robert Schaub 1.1 335 Fields: entity_id, reported_by, issue_type, status, resolution_note
Robert Schaub 1.2 336
337 === 1.9 QualityMetric ===
338
Robert Schaub 1.1 339 **Fields**: metric_type, category, value, target, timestamp
340 **Purpose**: Time-series quality tracking
341 **Usage**:
Robert Schaub 1.2 342
Robert Schaub 1.1 343 * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times
344 * **Quality dashboard**: Real-time display with trend charts
345 * **Alerting**: Automatic alerts when metrics exceed thresholds
346 * **A/B testing**: Compare control vs treatment metrics
347 * **Improvement validation**: Measure before/after changes
348 **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
Robert Schaub 1.2 349
350 === 1.10 ErrorPattern ===
351
Robert Schaub 1.1 352 **Fields**: error_category, claim_id, description, root_cause, frequency, status
353 **Purpose**: Capture errors to trigger system improvements
354 **Usage**:
Robert Schaub 1.2 355
Robert Schaub 1.1 356 * **Error capture**: When users flag issues or system detects problems
357 * **Pattern analysis**: Weekly grouping by category and frequency
358 * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor
359 * **Metrics**: Track error rate reduction over time
360 **Example**: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}`
361
362 == 1.4 Core Data Model ERD ==
363
Robert Schaub 1.4 364 {{include reference="Test.FactHarbor V0\.9\.100.Specification.Diagrams.Core Data Model ERD.WebHome"/}}
Robert Schaub 1.1 365
366 == 1.5 User Class Diagram ==
Robert Schaub 1.2 367
Robert Schaub 1.5 368 {{include reference="Test.FactHarbor V0\.9\.100.Specification.Diagrams.User Class Diagram.WebHome"/}}
Robert Schaub 1.2 369
Robert Schaub 1.1 370 == 2. Versioning Strategy ==
Robert Schaub 1.2 371
Robert Schaub 1.1 372 **All Content Entities Are Versioned**:
Robert Schaub 1.2 373
Robert Schaub 1.1 374 * **Claim**: Every edit creates new version (V1→V2→V3...)
375 * **Evidence**: Changes tracked in edit history
376 * **Scenario**: Modifications versioned
377 **How Versioning Works**:
378 * Entity table stores **current state only**
379 * Edit table stores **all historical states** (before_state, after_state as JSON)
380 * Version number increments with each edit
381 * Complete audit trail maintained forever
382 **Unversioned Entities** (current state only, no history):
383 * **Source**: Track record continuously updated (not versioned history, just current score)
384 * **User**: Account state (reputation accumulated, not versioned)
385 * **QualityMetric**: Time-series data (each record is a point in time, not a version)
386 * **ErrorPattern**: System improvement queue (status tracked, not versioned)
387 **Example**:
388 ```
389 Claim V1: "The sky is blue"
390 → User edits →
391 Claim V2: "The sky is blue during daytime"
392 → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
393 ```
Robert Schaub 1.2 394
Robert Schaub 1.1 395 == 2.5. Storage vs Computation Strategy ==
Robert Schaub 1.2 396
Robert Schaub 1.1 397 **Critical architectural decision**: What to persist in databases vs compute dynamically?
398 **Trade-off**:
Robert Schaub 1.2 399
Robert Schaub 1.1 400 * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs
401 * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible
Robert Schaub 1.2 402
Robert Schaub 1.1 403 === Recommendation: Hybrid Approach ===
Robert Schaub 1.2 404
Robert Schaub 1.1 405 **STORE (in PostgreSQL):**
Robert Schaub 1.2 406
Robert Schaub 1.1 407 ==== Claims (Current State + History) ====
Robert Schaub 1.2 408
Robert Schaub 1.1 409 * **What**: assertion, domain, status, created_at, updated_at, version
410 * **Why**: Core entity, must be persistent
411 * **Also store**: confidence_score (computed once, then cached)
Robert Schaub 1.2 412 * **Size**: 1 KB per claim
Robert Schaub 1.1 413 * **Growth**: Linear with claims
414 * **Decision**: ✅ STORE - Essential
Robert Schaub 1.2 415
Robert Schaub 1.1 416 ==== Evidence (All Records) ====
Robert Schaub 1.2 417
Robert Schaub 1.1 418 * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
419 * **Why**: Hard to re-gather, user contributions, reproducibility
Robert Schaub 1.2 420 * **Size**: 2 KB per evidence (with excerpt)
Robert Schaub 1.1 421 * **Growth**: 3-10 evidence per claim
422 * **Decision**: ✅ STORE - Essential for reproducibility
Robert Schaub 1.2 423
Robert Schaub 1.1 424 ==== Sources (Track Records) ====
Robert Schaub 1.2 425
Robert Schaub 1.1 426 * **What**: name, domain, track_record_score, accuracy_history, correction_frequency
427 * **Why**: Continuously updated, expensive to recompute
Robert Schaub 1.2 428 * **Size**: 500 bytes per source
Robert Schaub 1.1 429 * **Growth**: Slow (limited number of sources)
430 * **Decision**: ✅ STORE - Essential for quality
Robert Schaub 1.2 431
Robert Schaub 1.1 432 ==== Edit History (All Versions) ====
Robert Schaub 1.2 433
Robert Schaub 1.1 434 * **What**: before_state, after_state, user_id, reason, timestamp
435 * **Why**: Audit trail, legal requirement, reproducibility
Robert Schaub 1.2 436 * **Size**: 2 KB per edit
437 * **Growth**: Linear with edits (A portion of claims get edited)
Robert Schaub 1.1 438 * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
439 * **Decision**: ✅ STORE - Essential for accountability
Robert Schaub 1.2 440
Robert Schaub 1.1 441 ==== Flags (User Reports) ====
Robert Schaub 1.2 442
Robert Schaub 1.1 443 * **What**: entity_id, reported_by, issue_type, description, status
444 * **Why**: Error detection, system improvement triggers
Robert Schaub 1.2 445 * **Size**: 500 bytes per flag
Robert Schaub 1.1 446 * **Growth**: 5-high percentage of claims get flagged
447 * **Decision**: ✅ STORE - Essential for improvement
Robert Schaub 1.2 448
Robert Schaub 1.1 449 ==== ErrorPatterns (System Improvement) ====
Robert Schaub 1.2 450
Robert Schaub 1.1 451 * **What**: error_category, claim_id, description, root_cause, frequency, status
452 * **Why**: Learning loop, prevent recurring errors
Robert Schaub 1.2 453 * **Size**: 1 KB per pattern
Robert Schaub 1.1 454 * **Growth**: Slow (limited patterns, many fixed)
455 * **Decision**: ✅ STORE - Essential for learning
Robert Schaub 1.2 456
Robert Schaub 1.1 457 ==== QualityMetrics (Time Series) ====
Robert Schaub 1.2 458
Robert Schaub 1.1 459 * **What**: metric_type, category, value, target, timestamp
460 * **Why**: Trend analysis, cannot recreate historical metrics
Robert Schaub 1.2 461 * **Size**: 200 bytes per metric
Robert Schaub 1.1 462 * **Growth**: Hourly = 8,760 per year per metric type
463 * **Retention**: 2 years hot, then aggregate and archive
464 * **Decision**: ✅ STORE - Essential for monitoring
465 **STORE (Computed Once, Then Cached):**
Robert Schaub 1.2 466
Robert Schaub 1.1 467 ==== Analysis Summary ====
Robert Schaub 1.2 468
Robert Schaub 1.1 469 * **What**: Neutral text summary of claim analysis (200-500 words)
470 * **Computed**: Once by AKEL when claim first analyzed
471 * **Stored in**: Claim table (text field)
472 * **Recomputed**: Only when system significantly improves OR claim edited
473 * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
Robert Schaub 1.2 474 * **Size**: 2 KB per claim
Robert Schaub 1.1 475 * **Decision**: ✅ STORE (cached) - Cost-effective
Robert Schaub 1.2 476
Robert Schaub 1.1 477 ==== Confidence Score ====
Robert Schaub 1.2 478
Robert Schaub 1.1 479 * **What**: 0-100 score of analysis confidence
480 * **Computed**: Once by AKEL
481 * **Stored in**: Claim table (integer field)
482 * **Recomputed**: When evidence added, source track record changes significantly, or system improves
483 * **Why store**: Cheap to store, expensive to compute, users need it fast
484 * **Size**: 4 bytes per claim
485 * **Decision**: ✅ STORE (cached) - Performance critical
Robert Schaub 1.2 486
Robert Schaub 1.1 487 ==== Risk Score ====
Robert Schaub 1.2 488
Robert Schaub 1.1 489 * **What**: 0-100 score of claim risk level
490 * **Computed**: Once by AKEL
491 * **Stored in**: Claim table (integer field)
492 * **Recomputed**: When domain changes, evidence changes, or controversy detected
493 * **Why store**: Same as confidence score
494 * **Size**: 4 bytes per claim
495 * **Decision**: ✅ STORE (cached) - Performance critical
496 **COMPUTE DYNAMICALLY (Never Store):**
Robert Schaub 1.2 497
498 ==== Scenarios ====
499
500 ⚠️ CRITICAL DECISION
501
Robert Schaub 1.1 502 * **What**: 2-5 possible interpretations of claim with assumptions
503 * **Current design**: Stored in Scenario table
504 * **Alternative**: Compute on-demand when user views claim details
Robert Schaub 1.2 505 * **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim
Robert Schaub 1.1 506 * **Compute cost**: $0.005-0.01 per request (LLM API call)
Robert Schaub 1.2 507 * **Frequency**: Viewed in detail by 20% of users
Robert Schaub 1.1 508 * **Trade-off analysis**:
509 - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access
510 - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
511 * **Reproducibility**: Scenarios may improve as AI improves (good to recompute)
512 * **Speed**: Computed = 5-8 seconds delay, Stored = instant
513 * **Decision**: ✅ STORE (hybrid approach below)
514 **Scenario Strategy** (APPROVED):
Robert Schaub 1.2 515
Robert Schaub 1.1 516 1. **Store scenarios** initially when claim analyzed
517 2. **Mark as stale** when system improves significantly
518 3. **Recompute on next view** if marked stale
519 4. **Cache for 30 days** if frequently accessed
520 5. **Result**: Best of both worlds - speed + freshness
Robert Schaub 1.2 521
522 ==== Verdict Synthesis ====
523
524
525
Robert Schaub 1.1 526 * **What**: Final conclusion text synthesizing all scenarios
527 * **Compute cost**: $0.002-0.005 per request
528 * **Frequency**: Every time claim viewed
529 * **Why not store**: Changes as evidence/scenarios change, users want fresh analysis
530 * **Speed**: 2-3 seconds (acceptable)
531 **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale
532 * **Recommendation**: ✅ STORE cached version, mark stale when changes occur
Robert Schaub 1.2 533
Robert Schaub 1.1 534 ==== Search Results ====
Robert Schaub 1.2 535
Robert Schaub 1.1 536 * **What**: Lists of claims matching search query
537 * **Compute from**: Elasticsearch index
538 * **Cache**: 15 minutes in Redis for popular queries
539 * **Why not store permanently**: Constantly changing, infinite possible queries
Robert Schaub 1.2 540
Robert Schaub 1.1 541 ==== Aggregated Statistics ====
Robert Schaub 1.2 542
Robert Schaub 1.1 543 * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.
544 * **Compute from**: Database queries
545 * **Cache**: 1 hour in Redis
546 * **Why not store**: Can be derived, relatively cheap to compute
Robert Schaub 1.2 547
Robert Schaub 1.1 548 ==== User Reputation ====
Robert Schaub 1.2 549
Robert Schaub 1.1 550 * **What**: Score based on contributions
551 * **Current design**: Stored in User table
552 * **Alternative**: Compute from Edit table
553 * **Trade-off**:
554 - Stored: Fast, simple
555 - Computed: Always accurate, no denormalization
556 * **Frequency**: Read on every user action
557 * **Compute cost**: Simple COUNT query, milliseconds
558 * **Decision**: ✅ STORE - Performance critical, read-heavy
Robert Schaub 1.2 559
Robert Schaub 1.1 560 === Summary Table ===
Robert Schaub 1.2 561
562 | Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\
Robert Schaub 1.4 563 |-|-|-|||-|\\
Robert Schaub 1.2 564 | Claim core | ✅ | - | 1 KB | STORE | Essential |\\
565 | Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\
566 | Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\
567 | Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\
568 | Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\
569 | Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
570 | Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
571 | Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\
572 | Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\
573 | Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\
574 | ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\
575 | QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\
576 | Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\
Robert Schaub 1.1 577 | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |
Robert Schaub 1.2 578 **Total storage per claim**: 18 KB (without edits and flags)
Robert Schaub 1.1 579 **For 1 million claims**:
Robert Schaub 1.2 580
581 * **Storage**: 18 GB (manageable)
582 * **PostgreSQL**: $50/month (standard instance)
583 * **Redis cache**: $20/month (1 GB instance)
584 * **S3 archives**: $5/month (old edits)
585 * **Total**: $75/month infrastructure
Robert Schaub 1.1 586 **LLM cost savings by caching**:
587 * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
588 * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims
589 * Verdict stored: Save $0.003 per claim = $3K per 1M claims
Robert Schaub 1.2 590 * **Total savings**: $35K per 1M claims vs recomputing every time
591
Robert Schaub 1.1 592 === Recomputation Triggers ===
Robert Schaub 1.2 593
Robert Schaub 1.1 594 **When to mark cached data as stale and recompute:**
Robert Schaub 1.2 595
Robert Schaub 1.1 596 1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)
597 2. **Evidence added** → Recompute: scenarios, verdict, confidence score
598 3. **Source track record changes >10 points** → Recompute: confidence score, verdict
599 4. **System improvement deployed** → Mark affected claims stale, recompute on next view
600 5. **Controversy detected** (high flag rate) → Recompute: risk score
601 **Recomputation strategy**:
Robert Schaub 1.2 602
Robert Schaub 1.1 603 * **Eager**: Immediately recompute (for user edits)
604 * **Lazy**: Recompute on next view (for system improvements)
605 * **Batch**: Nightly re-evaluation of stale claims (if <1000)
Robert Schaub 1.2 606
Robert Schaub 1.1 607 === Database Size Projection ===
Robert Schaub 1.2 608
Robert Schaub 1.1 609 **Year 1**: 10K claims
Robert Schaub 1.2 610
Robert Schaub 1.1 611 * Storage: 180 MB
612 * Cost: $10/month
613 **Year 3**: 100K claims
614 * Storage: 1.8 GB
615 * Cost: $30/month
616 **Year 5**: 1M claims
617 * Storage: 18 GB
618 * Cost: $75/month
619 **Year 10**: 10M claims
620 * Storage: 180 GB
621 * Cost: $300/month
622 * Optimization: Archive old claims to S3 ($5/TB/month)
623 **Conclusion**: Storage costs are manageable, LLM cost savings are substantial.
Robert Schaub 1.2 624
Robert Schaub 1.1 625 == 3. Key Simplifications ==
Robert Schaub 1.2 626
Robert Schaub 1.1 627 * **Two content states only**: Published, Hidden
628 * **Three user roles only**: Reader, Contributor, Moderator
629 * **No complex versioning**: Linear edit history
630 * **Reputation-based permissions**: Not role hierarchy
631 * **Source track records**: Continuous evaluation
Robert Schaub 1.2 632
Robert Schaub 1.1 633 == 3. What Gets Stored in the Database ==
Robert Schaub 1.2 634
Robert Schaub 1.1 635 === 3.1 Primary Storage (PostgreSQL) ===
Robert Schaub 1.2 636
Robert Schaub 1.1 637 **Claims Table**:
Robert Schaub 1.2 638
Robert Schaub 1.1 639 * Current state only (latest version)
640 * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
641 **Evidence Table**:
642 * All evidence records
643 * Fields: id, claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at, archived
644 **Scenario Table**:
645 * All scenarios for each claim
646 * Fields: id, claim_id, description, assumptions (text array), confidence, created_by, created_at
647 **Source Table**:
648 * Track record database (continuously updated)
649 * Fields: id, name, domain, type, track_record_score, accuracy_history (JSON), correction_frequency, last_updated, claim_count, corrections_count
650 **User Table**:
651 * All user accounts
652 * Fields: id, username, email, role (Reader/Contributor/Moderator), reputation, created_at, last_active, contributions_count, flags_submitted, flags_accepted
653 **Edit Table**:
654 * Complete version history
655 * Fields: id, entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
656 **Flag Table**:
657 * User-reported issues
658 * Fields: id, entity_type, entity_id, reported_by, issue_type, description, status, resolved_by, resolution_note, created_at, resolved_at
659 **ErrorPattern Table**:
660 * System improvement queue
661 * Fields: id, error_category, claim_id, description, root_cause, frequency, status, created_at, fixed_at
662 **QualityMetric Table**:
663 * Time-series quality data
664 * Fields: id, metric_type, metric_category, value, target, timestamp
Robert Schaub 1.2 665
Robert Schaub 1.1 666 === 3.2 What's NOT Stored (Computed on-the-fly) ===
Robert Schaub 1.2 667
Robert Schaub 1.1 668 * **Verdicts**: Synthesized from evidence + scenarios when requested
669 * **Risk scores**: Recalculated based on current factors
670 * **Aggregated statistics**: Computed from base data
671 * **Search results**: Generated from Elasticsearch index
Robert Schaub 1.2 672
Robert Schaub 1.1 673 === 3.3 Cache Layer (Redis) ===
Robert Schaub 1.2 674
Robert Schaub 1.1 675 **Cached for performance**:
Robert Schaub 1.2 676
Robert Schaub 1.1 677 * Frequently accessed claims (TTL: 1 hour)
678 * Search results (TTL: 15 minutes)
679 * User sessions (TTL: 24 hours)
680 * Source track records (TTL: 1 hour)
Robert Schaub 1.2 681
Robert Schaub 1.1 682 === 3.4 File Storage (S3) ===
Robert Schaub 1.2 683
Robert Schaub 1.1 684 **Archived content**:
Robert Schaub 1.2 685
Robert Schaub 1.1 686 * Old edit history (>3 months)
687 * Evidence documents (archived copies)
688 * Database backups
689 * Export files
Robert Schaub 1.2 690
Robert Schaub 1.1 691 === 3.5 Search Index (Elasticsearch) ===
Robert Schaub 1.2 692
Robert Schaub 1.1 693 **Indexed for search**:
Robert Schaub 1.2 694
Robert Schaub 1.1 695 * Claim assertions (full-text)
696 * Evidence excerpts (full-text)
697 * Scenario descriptions (full-text)
698 * Source names (autocomplete)
699 Synchronized from PostgreSQL via change data capture or periodic sync.
Robert Schaub 1.2 700
Robert Schaub 1.1 701 == 4. Related Pages ==
Robert Schaub 1.2 702
703 * [[Architecture>>Test.FactHarbor V0\.9\.100.Specification.Architecture.WebHome]]
Robert Schaub 1.6 704 * [[Requirements>>Test.FactHarbor V0\.9\.100.Specification.Requirements.WebHome]]
Robert Schaub 1.7 705 * [[Workflows>>Test.FactHarbor V0\.9\.100.Specification.Workflows.WebHome]]