Wiki source code of Data Model

Version 3.1 by Robert Schaub on 2025/12/19 14:41

Hide last authors
Robert Schaub 1.1 1 = Data Model =
2 FactHarbor's data model is **simple, focused, designed for automated processing**.
3 == 1. Core Entities ==
4 === 1.1 Claim ===
5 Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count
6 ==== Performance Optimization: Denormalized Fields ====
7 **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
8 **Additional cached fields in claims table**:
9 * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores
10 * Avoids joining evidence table for listing/preview
11 * Updated when evidence is added/removed
12 * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
13 * **source_names** (TEXT[]): Array of source names for quick display
14 * Avoids joining through evidence to sources
15 * Updated when sources change
16 * Format: `["New York Times", "Nature Journal", ...]`
17 * **scenario_count** (INTEGER): Number of scenarios for this claim
18 * Quick metric without counting rows
19 * Updated when scenarios added/removed
20 * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed
21 * Helps invalidate stale caches
22 * Triggers background refresh if too old
23 **Update Strategy**:
24 * **Immediate**: Update on claim edit (user-facing)
25 * **Deferred**: Update via background job every hour (non-critical)
26 * **Invalidation**: Clear cache when source data changes significantly
27 **Trade-offs**:
28 * ✅ 70% fewer joins on common queries
29 * ✅ Much faster claim list/search pages
30 * ✅ Better user experience
31 * ⚠️ Small storage increase (~10%)
32 * ⚠️ Need to keep caches in sync
33 === 1.2 Evidence ===
34 Fields: claim_id, source_id, excerpt, url, relevance_score, supports
35 === 1.3 Source ===
36 **Purpose**: Track reliability of information sources over time
37 **Fields**:
38 * **id** (UUID): Unique identifier
39 * **name** (text): Source name (e.g., "New York Times", "Nature Journal")
40 * **domain** (text): Website domain (e.g., "nytimes.com")
41 * **type** (enum): NewsOutlet, AcademicJournal, GovernmentAgency, etc.
42 * **track_record_score** (0-100): Overall reliability score
43 * **accuracy_history** (JSON): Historical accuracy data
44 * **correction_frequency** (float): How often source publishes corrections
45 * **last_updated** (timestamp): When track record last recalculated
46 **How It Works**:
47 * Initial score based on source type (70 for academic journals, 30 for unknown)
48 * Updated daily by background scheduler
49 * Formula: accuracy_rate (50%) + correction_policy (20%) + editorial_standards (15%) + bias_transparency (10%) + longevity (5%)
50 * Track Record Check in AKEL pipeline: Adjusts evidence confidence based on source quality
51 * Quality thresholds: 90+=Exceptional, 70-89=Reliable, 50-69=Acceptable, 30-49=Questionable, <30=Unreliable
52 **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
53 Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**
54 **Key**: Automated source reliability tracking
55 ==== Source Scoring Process (Separation of Concerns) ====
56 **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.
57 **The Problem**:
58 * Source scores should influence claim verdicts
59 * Claim verdicts should update source scores
60 * But: Direct feedback creates circular dependency and potential feedback loops
61 **The Solution**: Temporal separation
62 ==== Weekly Background Job (Source Scoring) ====
63 Runs independently of claim analysis:
64 {{code language="python"}}
65 def update_source_scores_weekly():
66 """
67 Background job: Calculate source reliability
68 Never triggered by individual claim analysis
69 """
70 # Analyze all claims from past week
71 claims = get_claims_from_past_week()
72 for source in get_all_sources():
73 # Calculate accuracy metrics
74 correct_verdicts = count_correct_verdicts_citing(source, claims)
75 total_citations = count_total_citations(source, claims)
76 accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5
77 # Weight by claim importance
78 weighted_score = calculate_weighted_score(source, claims)
79 # Update source record
80 source.track_record_score = weighted_score
81 source.total_citations = total_citations
82 source.last_updated = now()
83 source.save()
84 # Job runs: Sunday 2 AM UTC
85 # Never during claim processing
86 {{/code}}
87 ==== Real-Time Claim Analysis (AKEL) ====
88 Uses source scores but never updates them:
89 {{code language="python"}}
90 def analyze_claim(claim_text):
91 """
92 Real-time: Analyze claim using current source scores
93 READ source scores, never UPDATE them
94 """
95 # Gather evidence
96 evidence_list = gather_evidence(claim_text)
97 for evidence in evidence_list:
98 # READ source score (snapshot from last weekly update)
99 source = get_source(evidence.source_id)
100 source_score = source.track_record_score
101 # Use score to weight evidence
102 evidence.weighted_relevance = evidence.relevance * source_score
103 # Generate verdict using weighted evidence
104 verdict = synthesize_verdict(evidence_list)
105 # NEVER update source scores here
106 # That happens in weekly background job
107 return verdict
108 {{/code}}
109 ==== Monthly Audit (Quality Assurance) ====
110 Moderator review of flagged source scores:
111 * Verify scores make sense
112 * Detect gaming attempts
113 * Identify systematic biases
114 * Manual adjustments if needed
115 **Key Principles**:
116 ✅ **Scoring and analysis are temporally separated**
117 * Source scoring: Weekly batch job
118 * Claim analysis: Real-time processing
119 * Never update scores during analysis
120 ✅ **One-way data flow during processing**
121 * Claims READ source scores
122 * Claims NEVER WRITE source scores
123 * Updates happen in background only
124 ✅ **Predictable update cycle**
125 * Sources update every Sunday 2 AM
126 * Claims always use last week's scores
127 * No mid-week score changes
128 ✅ **Audit trail**
129 * Log all score changes
130 * Track score history
131 * Explainable calculations
132 **Benefits**:
133 * No circular dependencies
134 * Predictable behavior
135 * Easier to reason about
136 * Simpler testing
137 * Clear audit trail
138 **Example Timeline**:
139 ```
140 Sunday 2 AM: Calculate source scores for past week
141 → NYT score: 0.87 (up from 0.85)
142 → Blog X score: 0.52 (down from 0.61)
143 Monday-Saturday: Claims processed using these scores
144 → All claims this week use NYT=0.87
145 → All claims this week use Blog X=0.52
146 Next Sunday 2 AM: Recalculate scores including this week's claims
147 → NYT score: 0.89 (trending up)
148 → Blog X score: 0.48 (trending down)
149 ```
150 === 1.4 Scenario ===
151 **Purpose**: Different interpretations or contexts for evaluating claims
152 **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
153 **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)
154 **Fields**:
155 * **id** (UUID): Unique identifier
156 * **claim_id** (UUID): Foreign key to claim (one-to-many)
157 * **description** (text): Human-readable description of the scenario
158 * **assumptions** (JSONB): Key assumptions that define this scenario context
159 * **extracted_from** (UUID): Reference to evidence that this scenario was extracted from
160 * **created_at** (timestamp): When scenario was created
161 * **updated_at** (timestamp): Last modification
162 **How Found**: Evidence search → Extract context → Create scenario → Link to claim
163 **Example**:
164 For claim "Vaccines reduce hospitalization":
165 * Scenario 1: "Clinical trials (healthy adults 18-65, original strain)" from trial paper
166 * Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data
167 * Scenario 3: "Immunocompromised patients" from specialist study
168 **V2.0 Evolution**: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality.
Robert Schaub 3.1 169
170 === 1.5 Verdict ===
171
172 **Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.
173
174 **Core Fields**:
175 * **id** (UUID): Primary key
176 * **scenario_id** (UUID FK): The scenario being assessed
177 * **created_at** (timestamp): When verdict was first created
178
179 **Versioned via VERDICT_VERSION**: Verdicts evolve as new evidence emerges or analysis improves. Each version captures:
180 * **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")
181 * **confidence** (decimal 0-1): How confident we are in this assessment
182 * **explanation_summary** (text): Human-readable reasoning explaining the verdict
183 * **uncertainty_factors** (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")
184 * **created_at** (timestamp): When this version was generated
185
186 **Relationship**: Each Scenario has multiple Verdicts over time (as understanding evolves). Each Verdict has multiple versions.
187
188 **Example**:
189 For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":
190 * Initial verdict (v1): likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
191 * Updated verdict (v2): likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
192
193 **Key Design**: Separating Verdict from Scenario allows tracking how our understanding evolves without losing history.
194
195 === 1.6 User ===
Robert Schaub 1.1 196 Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count
197 === User Reputation System ==
198 **V1.0 Approach**: Simple manual role assignment
199 **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
200 === Roles (Manual Assignment) ===
201 **reader** (default):
202 * View published claims and evidence
203 * Browse and search content
204 * No editing permissions
205 **contributor**:
206 * Submit new claims
207 * Suggest edits to existing content
208 * Add evidence
209 * Requires manual promotion by moderator/admin
210 **moderator**:
211 * Approve/reject contributor suggestions
212 * Flag inappropriate content
213 * Handle abuse reports
214 * Assigned by admins based on trust
215 **admin**:
216 * Manage users and roles
217 * System configuration
218 * Access to all features
219 * Founder-appointed initially
220 === Contribution Tracking (Simple) ===
221 **Basic metrics only**:
222 * `contributions_count`: Total number of contributions
223 * `created_at`: Account age
224 * `last_active`: Recent activity
225 **No complex calculations**:
226 * No point systems
227 * No automated privilege escalation
228 * No reputation decay
229 * No threshold-based promotions
230 === Promotion Process ===
231 **Manual review by moderators/admins**:
232 1. User demonstrates value through contributions
233 2. Moderator reviews user's contribution history
234 3. Moderator promotes user to contributor role
235 4. Admin promotes trusted contributors to moderator
236 **Criteria** (guidelines, not automated):
237 * Quality of contributions
238 * Consistency over time
239 * Collaborative behavior
240 * Understanding of project goals
241 === V2.0+ Evolution ===
242 **Add complex reputation when**:
243 * 100+ active contributors
244 * Manual role management becomes bottleneck
245 * Clear patterns of abuse emerge requiring automation
246 **Future features may include**:
247 * Automated point calculations
248 * Threshold-based promotions
249 * Reputation decay for inactive users
250 * Track record scoring for contributors
Robert Schaub 3.1 251 See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
252 === 1.7 Edit ===
Robert Schaub 1.1 253 **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
254 **Purpose**: Complete audit trail for all content changes
255 === Edit History Details ===
256 **What Gets Edited**:
257 * **Claims** (20% edited): assertion, domain, status, scores, analysis
258 * **Evidence** (10% edited): excerpt, relevance_score, supports
259 * **Scenarios** (5% edited): description, assumptions, confidence
260 * **Sources**: NOT versioned (continuous updates, not editorial decisions)
261 **Who Edits**:
262 * **Contributors** (rep sufficient): Corrections, additions
263 * **Trusted Contributors** (rep sufficient): Major improvements, approvals
264 * **Moderators**: Abuse handling, dispute resolution
265 * **System (AKEL)**: Re-analysis, automated improvements (user_id = NULL)
266 **Edit Types**:
267 * `CONTENT_CORRECTION`: User fixes factual error
268 * `CLARIFICATION`: Improved wording
269 * `SYSTEM_REANALYSIS`: AKEL re-processed claim
270 * `MODERATION_ACTION`: Hide/unhide for abuse
271 * `REVERT`: Rollback to previous version
272 **Retention Policy** (5 years total):
273 1. **Hot storage** (3 months): PostgreSQL, instant access
274 2. **Warm storage** (2 years): Partitioned, slower queries
275 3. **Cold storage** (3 years): S3 compressed, download required
276 4. **Deletion**: After 5 years (except legal holds)
277 **Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit)
278 **Use Cases**:
279 * View claim history timeline
280 * Detect vandalism patterns
281 * Learn from user corrections (system improvement)
282 * Legal compliance (audit trail)
283 * Rollback capability
284 See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases
Robert Schaub 3.1 285 === 1.8 Flag ===
Robert Schaub 1.1 286 Fields: entity_id, reported_by, issue_type, status, resolution_note
Robert Schaub 3.1 287 === 1.9 QualityMetric ===
Robert Schaub 1.1 288 **Fields**: metric_type, category, value, target, timestamp
289 **Purpose**: Time-series quality tracking
290 **Usage**:
291 * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times
292 * **Quality dashboard**: Real-time display with trend charts
293 * **Alerting**: Automatic alerts when metrics exceed thresholds
294 * **A/B testing**: Compare control vs treatment metrics
295 * **Improvement validation**: Measure before/after changes
296 **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
Robert Schaub 3.1 297 === 1.10 ErrorPattern ===
Robert Schaub 1.1 298 **Fields**: error_category, claim_id, description, root_cause, frequency, status
299 **Purpose**: Capture errors to trigger system improvements
300 **Usage**:
301 * **Error capture**: When users flag issues or system detects problems
302 * **Pattern analysis**: Weekly grouping by category and frequency
303 * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor
304 * **Metrics**: Track error rate reduction over time
305 **Example**: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}`
Robert Schaub 2.1 306
307 == 1.4 Core Data Model ERD ==
308
Robert Schaub 3.1 309 {{include reference="Test.FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}}
Robert Schaub 2.1 310
Robert Schaub 1.1 311 == 1.5 User Class Diagram ==
Robert Schaub 3.1 312 {{include reference="Test.FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}}
Robert Schaub 1.1 313 == 2. Versioning Strategy ==
314 **All Content Entities Are Versioned**:
315 * **Claim**: Every edit creates new version (V1→V2→V3...)
316 * **Evidence**: Changes tracked in edit history
317 * **Scenario**: Modifications versioned
318 **How Versioning Works**:
319 * Entity table stores **current state only**
320 * Edit table stores **all historical states** (before_state, after_state as JSON)
321 * Version number increments with each edit
322 * Complete audit trail maintained forever
323 **Unversioned Entities** (current state only, no history):
324 * **Source**: Track record continuously updated (not versioned history, just current score)
325 * **User**: Account state (reputation accumulated, not versioned)
326 * **QualityMetric**: Time-series data (each record is a point in time, not a version)
327 * **ErrorPattern**: System improvement queue (status tracked, not versioned)
328 **Example**:
329 ```
330 Claim V1: "The sky is blue"
331 → User edits →
332 Claim V2: "The sky is blue during daytime"
333 → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
334 ```
335 == 2.5. Storage vs Computation Strategy ==
336 **Critical architectural decision**: What to persist in databases vs compute dynamically?
337 **Trade-off**:
338 * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs
339 * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible
340 === Recommendation: Hybrid Approach ===
341 **STORE (in PostgreSQL):**
342 ==== Claims (Current State + History) ====
343 * **What**: assertion, domain, status, created_at, updated_at, version
344 * **Why**: Core entity, must be persistent
345 * **Also store**: confidence_score (computed once, then cached)
346 * **Size**: ~1 KB per claim
347 * **Growth**: Linear with claims
348 * **Decision**: ✅ STORE - Essential
349 ==== Evidence (All Records) ====
350 * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
351 * **Why**: Hard to re-gather, user contributions, reproducibility
352 * **Size**: ~2 KB per evidence (with excerpt)
353 * **Growth**: 3-10 evidence per claim
354 * **Decision**: ✅ STORE - Essential for reproducibility
355 ==== Sources (Track Records) ====
356 * **What**: name, domain, track_record_score, accuracy_history, correction_frequency
357 * **Why**: Continuously updated, expensive to recompute
358 * **Size**: ~500 bytes per source
359 * **Growth**: Slow (limited number of sources)
360 * **Decision**: ✅ STORE - Essential for quality
361 ==== Edit History (All Versions) ====
362 * **What**: before_state, after_state, user_id, reason, timestamp
363 * **Why**: Audit trail, legal requirement, reproducibility
364 * **Size**: ~2 KB per edit
365 * **Growth**: Linear with edits (~A portion of claims get edited)
366 * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
367 * **Decision**: ✅ STORE - Essential for accountability
368 ==== Flags (User Reports) ====
369 * **What**: entity_id, reported_by, issue_type, description, status
370 * **Why**: Error detection, system improvement triggers
371 * **Size**: ~500 bytes per flag
372 * **Growth**: 5-high percentage of claims get flagged
373 * **Decision**: ✅ STORE - Essential for improvement
374 ==== ErrorPatterns (System Improvement) ====
375 * **What**: error_category, claim_id, description, root_cause, frequency, status
376 * **Why**: Learning loop, prevent recurring errors
377 * **Size**: ~1 KB per pattern
378 * **Growth**: Slow (limited patterns, many fixed)
379 * **Decision**: ✅ STORE - Essential for learning
380 ==== QualityMetrics (Time Series) ====
381 * **What**: metric_type, category, value, target, timestamp
382 * **Why**: Trend analysis, cannot recreate historical metrics
383 * **Size**: ~200 bytes per metric
384 * **Growth**: Hourly = 8,760 per year per metric type
385 * **Retention**: 2 years hot, then aggregate and archive
386 * **Decision**: ✅ STORE - Essential for monitoring
387 **STORE (Computed Once, Then Cached):**
388 ==== Analysis Summary ====
389 * **What**: Neutral text summary of claim analysis (200-500 words)
390 * **Computed**: Once by AKEL when claim first analyzed
391 * **Stored in**: Claim table (text field)
392 * **Recomputed**: Only when system significantly improves OR claim edited
393 * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
394 * **Size**: ~2 KB per claim
395 * **Decision**: ✅ STORE (cached) - Cost-effective
396 ==== Confidence Score ====
397 * **What**: 0-100 score of analysis confidence
398 * **Computed**: Once by AKEL
399 * **Stored in**: Claim table (integer field)
400 * **Recomputed**: When evidence added, source track record changes significantly, or system improves
401 * **Why store**: Cheap to store, expensive to compute, users need it fast
402 * **Size**: 4 bytes per claim
403 * **Decision**: ✅ STORE (cached) - Performance critical
404 ==== Risk Score ====
405 * **What**: 0-100 score of claim risk level
406 * **Computed**: Once by AKEL
407 * **Stored in**: Claim table (integer field)
408 * **Recomputed**: When domain changes, evidence changes, or controversy detected
409 * **Why store**: Same as confidence score
410 * **Size**: 4 bytes per claim
411 * **Decision**: ✅ STORE (cached) - Performance critical
412 **COMPUTE DYNAMICALLY (Never Store):**
413 ==== Scenarios ==== ⚠️ CRITICAL DECISION
414 * **What**: 2-5 possible interpretations of claim with assumptions
415 * **Current design**: Stored in Scenario table
416 * **Alternative**: Compute on-demand when user views claim details
417 * **Storage cost**: ~1 KB per scenario × 3 scenarios average = ~3 KB per claim
418 * **Compute cost**: $0.005-0.01 per request (LLM API call)
419 * **Frequency**: Viewed in detail by ~20% of users
420 * **Trade-off analysis**:
421 - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access
422 - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
423 * **Reproducibility**: Scenarios may improve as AI improves (good to recompute)
424 * **Speed**: Computed = 5-8 seconds delay, Stored = instant
425 * **Decision**: ✅ STORE (hybrid approach below)
426 **Scenario Strategy** (APPROVED):
427 1. **Store scenarios** initially when claim analyzed
428 2. **Mark as stale** when system improves significantly
429 3. **Recompute on next view** if marked stale
430 4. **Cache for 30 days** if frequently accessed
431 5. **Result**: Best of both worlds - speed + freshness
432 ==== Verdict Synthesis ====
433 * **What**: Final conclusion text synthesizing all scenarios
434 * **Compute cost**: $0.002-0.005 per request
435 * **Frequency**: Every time claim viewed
436 * **Why not store**: Changes as evidence/scenarios change, users want fresh analysis
437 * **Speed**: 2-3 seconds (acceptable)
438 **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale
439 * **Recommendation**: ✅ STORE cached version, mark stale when changes occur
440 ==== Search Results ====
441 * **What**: Lists of claims matching search query
442 * **Compute from**: Elasticsearch index
443 * **Cache**: 15 minutes in Redis for popular queries
444 * **Why not store permanently**: Constantly changing, infinite possible queries
445 ==== Aggregated Statistics ====
446 * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.
447 * **Compute from**: Database queries
448 * **Cache**: 1 hour in Redis
449 * **Why not store**: Can be derived, relatively cheap to compute
450 ==== User Reputation ====
451 * **What**: Score based on contributions
452 * **Current design**: Stored in User table
453 * **Alternative**: Compute from Edit table
454 * **Trade-off**:
455 - Stored: Fast, simple
456 - Computed: Always accurate, no denormalization
457 * **Frequency**: Read on every user action
458 * **Compute cost**: Simple COUNT query, milliseconds
459 * **Decision**: ✅ STORE - Performance critical, read-heavy
460 === Summary Table ===
461 | Data Type | Storage | Compute | Size per Claim | Decision | Rationale |
462 |-----------|---------|---------|----------------|----------|-----------|
463 | Claim core | ✅ | - | 1 KB | STORE | Essential |
464 | Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |
465 | Sources | ✅ | - | 500 B (shared) | STORE | Track record |
466 | Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |
467 | Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |
468 | Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |
469 | Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |
470 | Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |
471 | Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |
472 | Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |
473 | ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |
474 | QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |
475 | Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |
476 | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |
477 **Total storage per claim**: ~18 KB (without edits and flags)
478 **For 1 million claims**:
479 * **Storage**: ~18 GB (manageable)
480 * **PostgreSQL**: ~$50/month (standard instance)
481 * **Redis cache**: ~$20/month (1 GB instance)
482 * **S3 archives**: ~$5/month (old edits)
483 * **Total**: ~$75/month infrastructure
484 **LLM cost savings by caching**:
485 * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
486 * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims
487 * Verdict stored: Save $0.003 per claim = $3K per 1M claims
488 * **Total savings**: ~$35K per 1M claims vs recomputing every time
489 === Recomputation Triggers ===
490 **When to mark cached data as stale and recompute:**
491 1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)
492 2. **Evidence added** → Recompute: scenarios, verdict, confidence score
493 3. **Source track record changes >10 points** → Recompute: confidence score, verdict
494 4. **System improvement deployed** → Mark affected claims stale, recompute on next view
495 5. **Controversy detected** (high flag rate) → Recompute: risk score
496 **Recomputation strategy**:
497 * **Eager**: Immediately recompute (for user edits)
498 * **Lazy**: Recompute on next view (for system improvements)
499 * **Batch**: Nightly re-evaluation of stale claims (if <1000)
500 === Database Size Projection ===
501 **Year 1**: 10K claims
502 * Storage: 180 MB
503 * Cost: $10/month
504 **Year 3**: 100K claims
505 * Storage: 1.8 GB
506 * Cost: $30/month
507 **Year 5**: 1M claims
508 * Storage: 18 GB
509 * Cost: $75/month
510 **Year 10**: 10M claims
511 * Storage: 180 GB
512 * Cost: $300/month
513 * Optimization: Archive old claims to S3 ($5/TB/month)
514 **Conclusion**: Storage costs are manageable, LLM cost savings are substantial.
515 == 3. Key Simplifications ==
516 * **Two content states only**: Published, Hidden
517 * **Three user roles only**: Reader, Contributor, Moderator
518 * **No complex versioning**: Linear edit history
519 * **Reputation-based permissions**: Not role hierarchy
520 * **Source track records**: Continuous evaluation
521 == 3. What Gets Stored in the Database ==
522 === 3.1 Primary Storage (PostgreSQL) ===
523 **Claims Table**:
524 * Current state only (latest version)
525 * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
526 **Evidence Table**:
527 * All evidence records
528 * Fields: id, claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at, archived
529 **Scenario Table**:
530 * All scenarios for each claim
531 * Fields: id, claim_id, description, assumptions (text array), confidence, created_by, created_at
532 **Source Table**:
533 * Track record database (continuously updated)
534 * Fields: id, name, domain, type, track_record_score, accuracy_history (JSON), correction_frequency, last_updated, claim_count, corrections_count
535 **User Table**:
536 * All user accounts
537 * Fields: id, username, email, role (Reader/Contributor/Moderator), reputation, created_at, last_active, contributions_count, flags_submitted, flags_accepted
538 **Edit Table**:
539 * Complete version history
540 * Fields: id, entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
541 **Flag Table**:
542 * User-reported issues
543 * Fields: id, entity_type, entity_id, reported_by, issue_type, description, status, resolved_by, resolution_note, created_at, resolved_at
544 **ErrorPattern Table**:
545 * System improvement queue
546 * Fields: id, error_category, claim_id, description, root_cause, frequency, status, created_at, fixed_at
547 **QualityMetric Table**:
548 * Time-series quality data
549 * Fields: id, metric_type, metric_category, value, target, timestamp
550 === 3.2 What's NOT Stored (Computed on-the-fly) ===
551 * **Verdicts**: Synthesized from evidence + scenarios when requested
552 * **Risk scores**: Recalculated based on current factors
553 * **Aggregated statistics**: Computed from base data
554 * **Search results**: Generated from Elasticsearch index
555 === 3.3 Cache Layer (Redis) ===
556 **Cached for performance**:
557 * Frequently accessed claims (TTL: 1 hour)
558 * Search results (TTL: 15 minutes)
559 * User sessions (TTL: 24 hours)
560 * Source track records (TTL: 1 hour)
561 === 3.4 File Storage (S3) ===
562 **Archived content**:
563 * Old edit history (>3 months)
564 * Evidence documents (archived copies)
565 * Database backups
566 * Export files
567 === 3.5 Search Index (Elasticsearch) ===
568 **Indexed for search**:
569 * Claim assertions (full-text)
570 * Evidence excerpts (full-text)
571 * Scenario descriptions (full-text)
572 * Source names (autocomplete)
573 Synchronized from PostgreSQL via change data capture or periodic sync.
574 == 4. Related Pages ==
Robert Schaub 3.1 575 * [[Architecture>>Test.FactHarbor.Specification.Architecture.WebHome]]
576 * [[Requirements>>Test.FactHarbor.Specification.Requirements.WebHome]]
577 * [[Workflows>>Test.FactHarbor.Specification.Workflows.WebHome]]