Wiki source code of Data Model

Last modified by Robert Schaub on 2026/02/08 08:32

Show last authors
1 = Data Model =
2
3 {{warning}}
4 **Implementation Status (Updated January 2026)**
5
6 This specification describes the **target** normalized data model. Current implementation (v2.6.33) differs significantly:
7
8 * **Storage**: All data stored as **JSON blobs in SQLite**, not normalized PostgreSQL tables
9 * **Scenarios**: **Replaced with KeyFactors** - decomposition questions, not separate entities
10 * **Caching**: Redis cache **not implemented**; no claim caching yet
11 * **Source Scoring**: Uses external MBFC bundle, not internal track record calculation
12 * **User System**: Not implemented (no authentication in current version)
13
14 This specification remains valuable as the target architecture for future versions.
15
16 See `Docs/STATUS/Documentation_Inconsistencies.md` for full comparison.
17 {{/warning}}
18
19 FactHarbor's data model is **simple, focused, designed for automated processing**.
20
21 == 1. Core Entities ==
22
23 === 1.1 Claim ===
24
25 Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count
26
27 ==== Performance Optimization: Denormalized Fields ====
28
29 **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
30 **Additional cached fields in claims table**:
31
32 * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores
33 * Avoids joining evidence table for listing/preview
34 * Updated when evidence is added/removed
35 * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
36 * **source_names** (TEXT[]): Array of source names for quick display
37 * Avoids joining through evidence to sources
38 * Updated when sources change
39 * Format: `["New York Times", "Nature Journal", ...]`
40 * **scenario_count** (INTEGER): Number of scenarios for this claim
41 * Quick metric without counting rows
42 * Updated when scenarios added/removed
43 * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed
44 * Helps invalidate stale caches
45 * Triggers background refresh if too old
46 **Update Strategy**:
47 * **Immediate**: Update on claim edit (user-facing)
48 * **Deferred**: Update via background job every hour (non-critical)
49 * **Invalidation**: Clear cache when source data changes significantly
50 **Trade-offs**:
51 * ✅ 70% fewer joins on common queries
52 * ✅ Much faster claim list/search pages
53 * ✅ Better user experience
54 * ⚠️ Small storage increase (10%)
55 * ⚠️ Need to keep caches in sync
56
57 === 1.2 Evidence ===
58
59 Fields: claim_id, source_id, excerpt, url, relevance_score, supports
60
61 === 1.3 Source ===
62
63 **Purpose**: Track reliability of information sources over time
64 **Fields**:
65
66 * **id** (UUID): Unique identifier
67 * **name** (text): Source name (e.g., "New York Times", "Nature Journal")
68 * **domain** (text): Website domain (e.g., "nytimes.com")
69 * **type** (enum): NewsOutlet, AcademicJournal, GovernmentAgency, etc.
70 * **track_record_score** (0-100): Overall reliability score
71 * **accuracy_history** (JSON): Historical accuracy data
72 * **correction_frequency** (float): How often source publishes corrections
73 * **last_updated** (timestamp): When track record last recalculated
74 **How It Works**:
75 * Initial score based on source type (70 for academic journals, 30 for unknown)
76 * Updated daily by background scheduler
77 * Formula: accuracy_rate (50%) + correction_policy (20%) + editorial_standards (15%) + bias_transparency (10%) + longevity (5%)
78 * Track Record Check in AKEL pipeline: Adjusts evidence confidence based on source quality
79 * Quality thresholds: 90+=Exceptional, 70-89=Reliable, 50-69=Acceptable, 30-49=Questionable, <30=Unreliable
80
81 {{info}}
82 **Current Implementation (v2.6.33):** Source reliability uses external **MBFC (Media Bias/Fact Check) bundle** instead of internal track record calculation. Scores are loaded from a configurable JSON file. See `Docs/ARCHITECTURE/Source_Reliability.md`.
83 {{/info}}
84
85 **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
86 Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**
87 **Key**: Automated source reliability tracking
88
89 ==== Source Scoring Process (Separation of Concerns) ====
90
91 **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.
92 **The Problem**:
93
94 * Source scores should influence claim verdicts
95 * Claim verdicts should update source scores
96 * But: Direct feedback creates circular dependency and potential feedback loops
97 **The Solution**: Temporal separation
98
99 ==== Weekly Background Job (Source Scoring) ====
100
101 Runs independently of claim analysis:
102 {{code language="python"}}def update_source_scores_weekly():
103 """
104 Background job: Calculate source reliability
105 Never triggered by individual claim analysis
106 """
107 # Analyze all claims from past week
108 claims = get_claims_from_past_week()
109 for source in get_all_sources():
110 # Calculate accuracy metrics
111 correct_verdicts = count_correct_verdicts_citing(source, claims)
112 total_citations = count_total_citations(source, claims)
113 accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5
114 # Weight by claim importance
115 weighted_score = calculate_weighted_score(source, claims)
116 # Update source record
117 source.track_record_score = weighted_score
118 source.total_citations = total_citations
119 source.last_updated = now()
120 source.save()
121 # Job runs: Sunday 2 AM UTC
122 # Never during claim processing{{/code}}
123
124 ==== Real-Time Claim Analysis (AKEL) ====
125
126 Uses source scores but never updates them:
127 {{code language="python"}}def analyze_claim(claim_text):
128 """
129 Real-time: Analyze claim using current source scores
130 READ source scores, never UPDATE them
131 """
132 # Gather evidence
133 evidence_list = gather_evidence(claim_text)
134 for evidence in evidence_list:
135 # READ source score (snapshot from last weekly update)
136 source = get_source(evidence.source_id)
137 source_score = source.track_record_score
138 # Use score to weight evidence
139 evidence.weighted_relevance = evidence.relevance * source_score
140 # Generate verdict using weighted evidence
141 verdict = synthesize_verdict(evidence_list)
142 # NEVER update source scores here
143 # That happens in weekly background job
144 return verdict{{/code}}
145
146 ==== Monthly Audit (Quality Assurance) ====
147
148 Moderator review of flagged source scores:
149
150 * Verify scores make sense
151 * Detect gaming attempts
152 * Identify systematic biases
153 * Manual adjustments if needed
154 **Key Principles**:
155 ✅ **Scoring and analysis are temporally separated**
156 * Source scoring: Weekly batch job
157 * Claim analysis: Real-time processing
158 * Never update scores during analysis
159 ✅ **One-way data flow during processing**
160 * Claims READ source scores
161 * Claims NEVER WRITE source scores
162 * Updates happen in background only
163 ✅ **Predictable update cycle**
164 * Sources update every Sunday 2 AM
165 * Claims always use last week's scores
166 * No mid-week score changes
167 ✅ **Audit trail**
168 * Log all score changes
169 * Track score history
170 * Explainable calculations
171 **Benefits**:
172 * No circular dependencies
173 * Predictable behavior
174 * Easier to reason about
175 * Simpler testing
176 * Clear audit trail
177 **Example Timeline**:
178 ```
179 Sunday 2 AM: Calculate source scores for past week
180 → NYT score: 0.87 (up from 0.85)
181 → Blog X score: 0.52 (down from 0.61)
182 Monday-Saturday: Claims processed using these scores
183 → All claims this week use NYT=0.87
184 → All claims this week use Blog X=0.52
185 Next Sunday 2 AM: Recalculate scores including this week's claims
186 → NYT score: 0.89 (trending up)
187 → Blog X score: 0.48 (trending down)
188 ```
189
190 === 1.4 Scenario ===
191
192 {{warning}}
193 **Implementation Change:** Scenarios were **replaced with KeyFactors** in the current implementation. KeyFactors are decomposition questions discovered during the understanding phase, not separate stored entities. See `Docs/ARCHITECTURE/KeyFactors_Design.md` for the design rationale.
194 {{/warning}}
195
196 **Purpose**: Different interpretations or contexts for evaluating claims
197 **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
198 **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)
199 **Fields**:
200
201 * **id** (UUID): Unique identifier
202 * **claim_id** (UUID): Foreign key to claim (one-to-many)
203 * **description** (text): Human-readable description of the scenario
204 * **assumptions** (JSONB): Key assumptions that define this scenario context
205 * **extracted_from** (UUID): Reference to evidence that this scenario was extracted from
206 * **created_at** (timestamp): When scenario was created
207 * **updated_at** (timestamp): Last modification
208 **How Found**: Evidence search → Extract context → Create scenario → Link to claim
209 **Example**:
210 For claim "Vaccines reduce hospitalization":
211 * Scenario 1: "Clinical trials (healthy adults 18-65, original strain)" from trial paper
212 * Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data
213 * Scenario 3: "Immunocompromised patients" from specialist study
214 **V2.0 Evolution**: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality.
215
216 === 1.5 Verdict ===
217
218 **Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.
219
220 **Core Fields**:
221
222 * **id** (UUID): Primary key
223 * **scenario_id** (UUID FK): The scenario being assessed
224 * **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")
225 * **confidence** (decimal 0-1): How confident we are in this assessment
226 * **explanation_summary** (text): Human-readable reasoning explaining the verdict
227 * **uncertainty_factors** (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")
228 * **created_at** (timestamp): When verdict was created
229 * **updated_at** (timestamp): Last modification
230
231 **Change Tracking**: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states.
232
233 **Relationship**: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity.
234
235 **Example**:
236 For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":
237
238 * Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
239 * After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
240 * Edit entity records the complete before/after change with timestamp and reason
241
242 **Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios.
243
244 === 1.6 User ===
245
246 Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count
247
248 === User Reputation System ===
249
250 **V1.0 Approach**: Simple manual role assignment
251 **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
252
253 === Roles (Manual Assignment) ===
254
255 **reader** (default):
256
257 * View published claims and evidence
258 * Browse and search content
259 * No editing permissions
260 **contributor**:
261 * Submit new claims
262 * Suggest edits to existing content
263 * Add evidence
264 * Requires manual promotion by moderator/admin
265 **moderator**:
266 * Approve/reject contributor suggestions
267 * Flag inappropriate content
268 * Handle abuse reports
269 * Assigned by admins based on trust
270 **admin**:
271 * Manage users and roles
272 * System configuration
273 * Access to all features
274 * Founder-appointed initially
275
276 === Contribution Tracking (Simple) ===
277
278 **Basic metrics only**:
279
280 * `contributions_count`: Total number of contributions
281 * `created_at`: Account age
282 * `last_active`: Recent activity
283 **No complex calculations**:
284 * No point systems
285 * No automated privilege escalation
286 * No reputation decay
287 * No threshold-based promotions
288
289 === Promotion Process ===
290
291 **Manual review by moderators/admins**:
292
293 1. User demonstrates value through contributions
294 2. Moderator reviews user's contribution history
295 3. Moderator promotes user to contributor role
296 4. Admin promotes trusted contributors to moderator
297 **Criteria** (guidelines, not automated):
298
299 * Quality of contributions
300 * Consistency over time
301 * Collaborative behavior
302 * Understanding of project goals
303
304 === V2.0+ Evolution ===
305
306 **Add complex reputation when**:
307
308 * 100+ active contributors
309 * Manual role management becomes bottleneck
310 * Clear patterns of abuse emerge requiring automation
311 **Future features may include**:
312 * Automated point calculations
313 * Threshold-based promotions
314 * Reputation decay for inactive users
315 * Track record scoring for contributors
316 See [[When to Add Complexity>>FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
317
318 === 1.7 Edit ===
319
320 **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
321 **Purpose**: Complete audit trail for all content changes
322
323 === Edit History Details ===
324
325 **What Gets Edited**:
326
327 * **Claims** (20% edited): assertion, domain, status, scores, analysis
328 * **Evidence** (10% edited): excerpt, relevance_score, supports
329 * **Scenarios** (5% edited): description, assumptions, confidence
330 * **Sources**: NOT versioned (continuous updates, not editorial decisions)
331 **Who Edits**:
332 * **Contributors** (rep sufficient): Corrections, additions
333 * **Trusted Contributors** (rep sufficient): Major improvements, approvals
334 * **Moderators**: Abuse handling, dispute resolution
335 * **System (AKEL)**: Re-analysis, automated improvements (user_id = NULL)
336 **Edit Types**:
337 * `CONTENT_CORRECTION`: User fixes factual error
338 * `CLARIFICATION`: Improved wording
339 * `SYSTEM_REANALYSIS`: AKEL re-processed claim
340 * `MODERATION_ACTION`: Hide/unhide for abuse
341 * `REVERT`: Rollback to previous version
342 **Retention Policy** (5 years total):
343
344 1. **Hot storage** (3 months): PostgreSQL, instant access
345 2. **Warm storage** (2 years): Partitioned, slower queries
346 3. **Cold storage** (3 years): S3 compressed, download required
347 4. **Deletion**: After 5 years (except legal holds)
348 **Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit)
349 **Use Cases**:
350
351 * View claim history timeline
352 * Detect vandalism patterns
353 * Learn from user corrections (system improvement)
354 * Legal compliance (audit trail)
355 * Rollback capability
356 See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases
357
358 === 1.8 Flag ===
359
360 Fields: entity_id, reported_by, issue_type, status, resolution_note
361
362 === 1.9 QualityMetric ===
363
364 **Fields**: metric_type, category, value, target, timestamp
365 **Purpose**: Time-series quality tracking
366 **Usage**:
367
368 * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times
369 * **Quality dashboard**: Real-time display with trend charts
370 * **Alerting**: Automatic alerts when metrics exceed thresholds
371 * **A/B testing**: Compare control vs treatment metrics
372 * **Improvement validation**: Measure before/after changes
373 **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
374
375 === 1.10 ErrorPattern ===
376
377 **Fields**: error_category, claim_id, description, root_cause, frequency, status
378 **Purpose**: Capture errors to trigger system improvements
379 **Usage**:
380
381 * **Error capture**: When users flag issues or system detects problems
382 * **Pattern analysis**: Weekly grouping by category and frequency
383 * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor
384 * **Metrics**: Track error rate reduction over time
385 **Example**: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}`
386
387 == 1.4 Core Data Model ERD ==
388
389 {{include reference="Archive.FactHarbor 2026\.02\.08.Specification.Diagrams.Core Data Model ERD.WebHome"/}}
390
391 == 1.5 User Class Diagram ==
392
393 {{include reference="Archive.FactHarbor 2026\.02\.08.Specification.Diagrams.User Class Diagram.WebHome"/}}
394
395 == 2. Versioning Strategy ==
396
397 **All Content Entities Are Versioned**:
398
399 * **Claim**: Every edit creates new version (V1→V2→V3...)
400 * **Evidence**: Changes tracked in edit history
401 * **Scenario**: Modifications versioned
402 **How Versioning Works**:
403 * Entity table stores **current state only**
404 * Edit table stores **all historical states** (before_state, after_state as JSON)
405 * Version number increments with each edit
406 * Complete audit trail maintained forever
407 **Unversioned Entities** (current state only, no history):
408 * **Source**: Track record continuously updated (not versioned history, just current score)
409 * **User**: Account state (reputation accumulated, not versioned)
410 * **QualityMetric**: Time-series data (each record is a point in time, not a version)
411 * **ErrorPattern**: System improvement queue (status tracked, not versioned)
412 **Example**:
413 ```
414 Claim V1: "The sky is blue"
415 → User edits →
416 Claim V2: "The sky is blue during daytime"
417 → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
418 ```
419
420 == 2.5. Storage vs Computation Strategy ==
421
422 **Critical architectural decision**: What to persist in databases vs compute dynamically?
423 **Trade-off**:
424
425 * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs
426 * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible
427
428 === Recommendation: Hybrid Approach ===
429
430 **STORE (in PostgreSQL):**
431
432 ==== Claims (Current State + History) ====
433
434 * **What**: assertion, domain, status, created_at, updated_at, version
435 * **Why**: Core entity, must be persistent
436 * **Also store**: confidence_score (computed once, then cached)
437 * **Size**: 1 KB per claim
438 * **Growth**: Linear with claims
439 * **Decision**: ✅ STORE - Essential
440
441 ==== Evidence (All Records) ====
442
443 * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
444 * **Why**: Hard to re-gather, user contributions, reproducibility
445 * **Size**: 2 KB per evidence (with excerpt)
446 * **Growth**: 3-10 evidence per claim
447 * **Decision**: ✅ STORE - Essential for reproducibility
448
449 ==== Sources (Track Records) ====
450
451 * **What**: name, domain, track_record_score, accuracy_history, correction_frequency
452 * **Why**: Continuously updated, expensive to recompute
453 * **Size**: 500 bytes per source
454 * **Growth**: Slow (limited number of sources)
455 * **Decision**: ✅ STORE - Essential for quality
456
457 ==== Edit History (All Versions) ====
458
459 * **What**: before_state, after_state, user_id, reason, timestamp
460 * **Why**: Audit trail, legal requirement, reproducibility
461 * **Size**: 2 KB per edit
462 * **Growth**: Linear with edits (A portion of claims get edited)
463 * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
464 * **Decision**: ✅ STORE - Essential for accountability
465
466 ==== Flags (User Reports) ====
467
468 * **What**: entity_id, reported_by, issue_type, description, status
469 * **Why**: Error detection, system improvement triggers
470 * **Size**: 500 bytes per flag
471 * **Growth**: 5-high percentage of claims get flagged
472 * **Decision**: ✅ STORE - Essential for improvement
473
474 ==== ErrorPatterns (System Improvement) ====
475
476 * **What**: error_category, claim_id, description, root_cause, frequency, status
477 * **Why**: Learning loop, prevent recurring errors
478 * **Size**: 1 KB per pattern
479 * **Growth**: Slow (limited patterns, many fixed)
480 * **Decision**: ✅ STORE - Essential for learning
481
482 ==== QualityMetrics (Time Series) ====
483
484 * **What**: metric_type, category, value, target, timestamp
485 * **Why**: Trend analysis, cannot recreate historical metrics
486 * **Size**: 200 bytes per metric
487 * **Growth**: Hourly = 8,760 per year per metric type
488 * **Retention**: 2 years hot, then aggregate and archive
489 * **Decision**: ✅ STORE - Essential for monitoring
490 **STORE (Computed Once, Then Cached):**
491
492 ==== Analysis Summary ====
493
494 * **What**: Neutral text summary of claim analysis (200-500 words)
495 * **Computed**: Once by AKEL when claim first analyzed
496 * **Stored in**: Claim table (text field)
497 * **Recomputed**: Only when system significantly improves OR claim edited
498 * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
499 * **Size**: 2 KB per claim
500 * **Decision**: ✅ STORE (cached) - Cost-effective
501
502 ==== Confidence Score ====
503
504 * **What**: 0-100 score of analysis confidence
505 * **Computed**: Once by AKEL
506 * **Stored in**: Claim table (integer field)
507 * **Recomputed**: When evidence added, source track record changes significantly, or system improves
508 * **Why store**: Cheap to store, expensive to compute, users need it fast
509 * **Size**: 4 bytes per claim
510 * **Decision**: ✅ STORE (cached) - Performance critical
511
512 ==== Risk Score ====
513
514 * **What**: 0-100 score of claim risk level
515 * **Computed**: Once by AKEL
516 * **Stored in**: Claim table (integer field)
517 * **Recomputed**: When domain changes, evidence changes, or controversy detected
518 * **Why store**: Same as confidence score
519 * **Size**: 4 bytes per claim
520 * **Decision**: ✅ STORE (cached) - Performance critical
521 **COMPUTE DYNAMICALLY (Never Store):**
522
523 ==== Scenarios ====
524
525 ⚠️ CRITICAL DECISION
526
527 * **What**: 2-5 possible interpretations of claim with assumptions
528 * **Current design**: Stored in Scenario table
529 * **Alternative**: Compute on-demand when user views claim details
530 * **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim
531 * **Compute cost**: $0.005-0.01 per request (LLM API call)
532 * **Frequency**: Viewed in detail by 20% of users
533 * **Trade-off analysis**:
534 - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access
535 - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
536 * **Reproducibility**: Scenarios may improve as AI improves (good to recompute)
537 * **Speed**: Computed = 5-8 seconds delay, Stored = instant
538 * **Decision**: ✅ STORE (hybrid approach below)
539 **Scenario Strategy** (APPROVED):
540
541 1. **Store scenarios** initially when claim analyzed
542 2. **Mark as stale** when system improves significantly
543 3. **Recompute on next view** if marked stale
544 4. **Cache for 30 days** if frequently accessed
545 5. **Result**: Best of both worlds - speed + freshness
546
547 ==== Verdict Synthesis ====
548
549
550
551 * **What**: Final conclusion text synthesizing all scenarios
552 * **Compute cost**: $0.002-0.005 per request
553 * **Frequency**: Every time claim viewed
554 * **Why not store**: Changes as evidence/scenarios change, users want fresh analysis
555 * **Speed**: 2-3 seconds (acceptable)
556 **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale
557 * **Recommendation**: ✅ STORE cached version, mark stale when changes occur
558
559 ==== Search Results ====
560
561 * **What**: Lists of claims matching search query
562 * **Compute from**: Elasticsearch index
563 * **Cache**: 15 minutes in Redis for popular queries
564 * **Why not store permanently**: Constantly changing, infinite possible queries
565
566 ==== Aggregated Statistics ====
567
568 * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.
569 * **Compute from**: Database queries
570 * **Cache**: 1 hour in Redis
571 * **Why not store**: Can be derived, relatively cheap to compute
572
573 ==== User Reputation ====
574
575 * **What**: Score based on contributions
576 * **Current design**: Stored in User table
577 * **Alternative**: Compute from Edit table
578 * **Trade-off**:
579 - Stored: Fast, simple
580 - Computed: Always accurate, no denormalization
581 * **Frequency**: Read on every user action
582 * **Compute cost**: Simple COUNT query, milliseconds
583 * **Decision**: ✅ STORE - Performance critical, read-heavy
584
585 === Summary Table ===
586
587 | Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\
588 |-|-|-|||-|\\
589 | Claim core | ✅ | - | 1 KB | STORE | Essential |\\
590 | Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\
591 | Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\
592 | Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\
593 | Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\
594 | Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
595 | Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
596 | Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\
597 | Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\
598 | Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\
599 | ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\
600 | QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\
601 | Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\
602 | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |
603 **Total storage per claim**: 18 KB (without edits and flags)
604 **For 1 million claims**:
605
606 * **Storage**: 18 GB (manageable)
607 * **PostgreSQL**: $50/month (standard instance)
608 * **Redis cache**: $20/month (1 GB instance)
609 * **S3 archives**: $5/month (old edits)
610 * **Total**: $75/month infrastructure
611 **LLM cost savings by caching**:
612 * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
613 * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims
614 * Verdict stored: Save $0.003 per claim = $3K per 1M claims
615 * **Total savings**: $35K per 1M claims vs recomputing every time
616
617 === Recomputation Triggers ===
618
619 **When to mark cached data as stale and recompute:**
620
621 1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)
622 2. **Evidence added** → Recompute: scenarios, verdict, confidence score
623 3. **Source track record changes >10 points** → Recompute: confidence score, verdict
624 4. **System improvement deployed** → Mark affected claims stale, recompute on next view
625 5. **Controversy detected** (high flag rate) → Recompute: risk score
626 **Recomputation strategy**:
627
628 * **Eager**: Immediately recompute (for user edits)
629 * **Lazy**: Recompute on next view (for system improvements)
630 * **Batch**: Nightly re-evaluation of stale claims (if <1000)
631
632 === Database Size Projection ===
633
634 **Year 1**: 10K claims
635
636 * Storage: 180 MB
637 * Cost: $10/month
638 **Year 3**: 100K claims
639 * Storage: 1.8 GB
640 * Cost: $30/month
641 **Year 5**: 1M claims
642 * Storage: 18 GB
643 * Cost: $75/month
644 **Year 10**: 10M claims
645 * Storage: 180 GB
646 * Cost: $300/month
647 * Optimization: Archive old claims to S3 ($5/TB/month)
648 **Conclusion**: Storage costs are manageable, LLM cost savings are substantial.
649
650 == 3. Key Simplifications ==
651
652 * **Two content states only**: Published, Hidden
653 * **Three user roles only**: Reader, Contributor, Moderator
654 * **No complex versioning**: Linear edit history
655 * **Reputation-based permissions**: Not role hierarchy
656 * **Source track records**: Continuous evaluation
657
658 == 3. What Gets Stored in the Database ==
659
660 === 3.1 Primary Storage (PostgreSQL) ===
661
662 **Claims Table**:
663
664 * Current state only (latest version)
665 * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
666 **Evidence Table**:
667 * All evidence records
668 * Fields: id, claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at, archived
669 **Scenario Table**:
670 * All scenarios for each claim
671 * Fields: id, claim_id, description, assumptions (text array), confidence, created_by, created_at
672 **Source Table**:
673 * Track record database (continuously updated)
674 * Fields: id, name, domain, type, track_record_score, accuracy_history (JSON), correction_frequency, last_updated, claim_count, corrections_count
675 **User Table**:
676 * All user accounts
677 * Fields: id, username, email, role (Reader/Contributor/Moderator), reputation, created_at, last_active, contributions_count, flags_submitted, flags_accepted
678 **Edit Table**:
679 * Complete version history
680 * Fields: id, entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
681 **Flag Table**:
682 * User-reported issues
683 * Fields: id, entity_type, entity_id, reported_by, issue_type, description, status, resolved_by, resolution_note, created_at, resolved_at
684 **ErrorPattern Table**:
685 * System improvement queue
686 * Fields: id, error_category, claim_id, description, root_cause, frequency, status, created_at, fixed_at
687 **QualityMetric Table**:
688 * Time-series quality data
689 * Fields: id, metric_type, metric_category, value, target, timestamp
690
691 === 3.2 What's NOT Stored (Computed on-the-fly) ===
692
693 * **Verdicts**: Synthesized from evidence + scenarios when requested
694 * **Risk scores**: Recalculated based on current factors
695 * **Aggregated statistics**: Computed from base data
696 * **Search results**: Generated from Elasticsearch index
697
698 === 3.3 Cache Layer (Redis) ===
699
700 {{warning}}
701 **Implementation Status:** Redis caching is **NOT YET IMPLEMENTED**. Current implementation has no caching layer.
702 {{/warning}}
703
704 **Cached for performance (Planned)**:
705
706 * Frequently accessed claims (TTL: 1 hour)
707 * Search results (TTL: 15 minutes)
708 * User sessions (TTL: 24 hours)
709 * Source track records (TTL: 1 hour)
710
711 === 3.4 File Storage (S3) ===
712
713 **Archived content**:
714
715 * Old edit history (>3 months)
716 * Evidence documents (archived copies)
717 * Database backups
718 * Export files
719
720 === 3.5 Search Index (Elasticsearch) ===
721
722 **Indexed for search**:
723
724 * Claim assertions (full-text)
725 * Evidence excerpts (full-text)
726 * Scenario descriptions (full-text)
727 * Source names (autocomplete)
728 Synchronized from PostgreSQL via change data capture or periodic sync.
729
730 == 4. Related Pages ==
731
732 * [[Architecture>>Archive.FactHarbor 2026\.02\.08.Specification.Architecture.WebHome]]
733 * [[Requirements>>Archive.FactHarbor 2026\.02\.08.Specification.Requirements.WebHome]]
734 * [[Workflows>>Archive.FactHarbor 2026\.02\.08.Specification.Workflows.WebHome]]