Data Model - XWiki

1

= Data Model =

2

3

FactHarbor's data model is **simple, focused, designed for automated processing**.

4

5

== 1. Core Entities ==

=== 1.1 Claim ===

Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count

10

11

==== Performance Optimization: Denormalized Fields ====

12

13

**Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.

14

**Additional cached fields in claims table**:

15

16

* **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores * Avoids joining evidence table for listing/preview * Updated when evidence is added/removed * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`

17

* **source_names** (TEXT[]): Array of source names for quick display * Avoids joining through evidence to sources * Updated when sources change * Format: `["New York Times", "Nature Journal", ...]`

18

* **scenario_count** (INTEGER): Number of scenarios for this claim * Quick metric without counting rows * Updated when scenarios added/removed

19

* **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed * Helps invalidate stale caches * Triggers background refresh if too old

20

**Update Strategy**:

21

* **Immediate**: Update on claim edit (user-facing)

22

* **Deferred**: Update via background job every hour (non-critical)

23

* **Invalidation**: Clear cache when source data changes significantly

24

**Trade-offs**:

25

* ✅ 70% fewer joins on common queries

26

* ✅ Much faster claim list/search pages

27

* ✅ Better user experience

28

* ⚠️ Small storage increase (10%)

29

* ⚠️ Need to keep caches in sync

=== 1.2 Evidence ===

Fields: claim_id, source_id, excerpt, url, relevance_score, supports

=== 1.3 Source ===

**Purpose**: Track reliability of information sources over time

38

**Fields**:

39

40

* **id** (UUID): Unique identifier

41

* **name** (text): Source name (e.g., "New York Times", "Nature Journal")

42

* **domain** (text): Website domain (e.g., "nytimes.com")

43

* **type** (enum): NewsOutlet, AcademicJournal, GovernmentAgency, etc.

44

* **track_record_score** (0-100): Overall reliability score

45

* **accuracy_history** (JSON): Historical accuracy data

46

* **correction_frequency** (float): How often source publishes corrections

47

* **last_updated** (timestamp): When track record last recalculated

48

**How It Works**:

49

* Initial score based on source type (70 for academic journals, 30 for unknown)

50

* Updated daily by background scheduler

51

* Formula: accuracy_rate (50%) + correction_policy (20%) + editorial_standards (15%) + bias_transparency (10%) + longevity (5%)

52

* Track Record Check in AKEL pipeline: Adjusts evidence confidence based on source quality

53

* Quality thresholds: 90+=Exceptional, 70-89=Reliable, 50-69=Acceptable, 30-49=Questionable, <30=Unreliable

54

**See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage

55

Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**

56

**Key**: Automated source reliability tracking

57

58

==== Source Scoring Process (Separation of Concerns) ====

59

60

**Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.

61

**The Problem**: * Source scores should influence claim verdicts

62

63

* Claim verdicts should update source scores

64

* But: Direct feedback creates circular dependency and potential feedback loops

65

**The Solution**: Temporal separation

66

67

==== Weekly Background Job (Source Scoring) ====

68

69

Runs independently of claim analysis:

70

{{code language="python"}}def update_source_scores_weekly(): """ Background job: Calculate source reliability Never triggered by individual claim analysis """ # Analyze all claims from past week claims = get_claims_from_past_week() for source in get_all_sources(): # Calculate accuracy metrics correct_verdicts = count_correct_verdicts_citing(source, claims) total_citations = count_total_citations(source, claims) accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5 # Weight by claim importance weighted_score = calculate_weighted_score(source, claims) # Update source record source.track_record_score = weighted_score source.total_citations = total_citations source.last_updated = now() source.save() # Job runs: Sunday 2 AM UTC # Never during claim processing{{/code}}

71

72

==== Real-Time Claim Analysis (AKEL) ====

73

74

Uses source scores but never updates them:

75

{{code language="python"}}def analyze_claim(claim_text): """ Real-time: Analyze claim using current source scores READ source scores, never UPDATE them """ # Gather evidence evidence_list = gather_evidence(claim_text) for evidence in evidence_list: # READ source score (snapshot from last weekly update) source = get_source(evidence.source_id) source_score = source.track_record_score # Use score to weight evidence evidence.weighted_relevance = evidence.relevance * source_score # Generate verdict using weighted evidence verdict = synthesize_verdict(evidence_list) # NEVER update source scores here # That happens in weekly background job return verdict{{/code}}

76

77

==== Monthly Audit (Quality Assurance) ====

78

79

Moderator review of flagged source scores:

80

81

* Verify scores make sense

82

* Detect gaming attempts

83

* Identify systematic biases

84

* Manual adjustments if needed

85

**Key Principles**:

86

✅ **Scoring and analysis are temporally separated**

87

* Source scoring: Weekly batch job

88

* Claim analysis: Real-time processing

89

* Never update scores during analysis

90

✅ **One-way data flow during processing**

91

* Claims READ source scores

92

* Claims NEVER WRITE source scores

93

* Updates happen in background only

94

✅ **Predictable update cycle**

95

* Sources update every Sunday 2 AM

96

* Claims always use last week's scores

97

* No mid-week score changes

98

✅ **Audit trail**

99

* Log all score changes

100

* Track score history

101

* Explainable calculations

102

**Benefits**:

103

* No circular dependencies

104

* Predictable behavior

105

* Easier to reason about

106

* Simpler testing

107

* Clear audit trail

108

**Example Timeline**:

109

```

110

Sunday 2 AM: Calculate source scores for past week → NYT score: 0.87 (up from 0.85) → Blog X score: 0.52 (down from 0.61)

111

Monday-Saturday: Claims processed using these scores → All claims this week use NYT=0.87 → All claims this week use Blog X=0.52

112

Next Sunday 2 AM: Recalculate scores including this week's claims → NYT score: 0.89 (trending up) → Blog X score: 0.48 (trending down)

```

=== 1.4 Scenario ===

**Purpose**: Different interpretations or contexts for evaluating claims

118

**Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.

119

**Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)

120

**Fields**:

121

122

* **id** (UUID): Unique identifier

123

* **claim_id** (UUID): Foreign key to claim (one-to-many)

124

* **description** (text): Human-readable description of the scenario

125

* **assumptions** (JSONB): Key assumptions that define this scenario context

126

* **extracted_from** (UUID): Reference to evidence that this scenario was extracted from

127

* **created_at** (timestamp): When scenario was created

128

* **updated_at** (timestamp): Last modification

129

**How Found**: Evidence search → Extract context → Create scenario → Link to claim

130

**Example**: For claim "Vaccines reduce hospitalization":

131

* Scenario 1: "Clinical trials (healthy adults 18-65, original strain)" from trial paper

132

* Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data

133

* Scenario 3: "Immunocompromised patients" from specialist study

134

**V2.0 Evolution**: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality. === 1.5 Verdict === **Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence. **Core Fields**:

135

* **id** (UUID): Primary key

136

* **scenario_id** (UUID FK): The scenario being assessed

137

* **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")

138

* **confidence** (decimal 0-1): How confident we are in this assessment

139

* **explanation_summary** (text): Human-readable reasoning explaining the verdict

140

* **uncertainty_factors** (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")

141

* **created_at** (timestamp): When verdict was created

142

* **updated_at** (timestamp): Last modification **Change Tracking**: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states. **Relationship**: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity. **Example**:

143

For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":

144

* Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]

145

* After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]

146

* Edit entity records the complete before/after change with timestamp and reason **Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios. === 1.6 User ===

147

Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count

148

149

=== User Reputation System ===

150

151

**V1.0 Approach**: Simple manual role assignment

152

**Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.

153

154

=== Roles (Manual Assignment) ===

155

156

**reader** (default):

157

158

* View published claims and evidence

159

* Browse and search content

160

* No editing permissions

161

**contributor**:

162

* Submit new claims

163

* Suggest edits to existing content

164

* Add evidence

165

* Requires manual promotion by moderator/admin

166

**moderator**:

167

* Approve/reject contributor suggestions

168

* Flag inappropriate content

169

* Handle abuse reports

170

* Assigned by admins based on trust

171

**admin**:

172

* Manage users and roles

173

* System configuration

174

* Access to all features

175

* Founder-appointed initially

176

177

=== Contribution Tracking (Simple) ===

178

179

**Basic metrics only**:

180

181

* `contributions_count`: Total number of contributions

182

* `created_at`: Account age

183

* `last_active`: Recent activity

184

**No complex calculations**:

185

* No point systems

186

* No automated privilege escalation

187

* No reputation decay

188

* No threshold-based promotions

189

190

=== Promotion Process ===

191

192

**Manual review by moderators/admins**:

193

194

1. User demonstrates value through contributions

195

2. Moderator reviews user's contribution history

196

3. Moderator promotes user to contributor role

197

4. Admin promotes trusted contributors to moderator

198

**Criteria** (guidelines, not automated):

199

200

* Quality of contributions

201

* Consistency over time

202

* Collaborative behavior

203

* Understanding of project goals

204

205

=== V2.0+ Evolution ===

206

207

**Add complex reputation when**:

208

209

* 100+ active contributors

210

* Manual role management becomes bottleneck

211

* Clear patterns of abuse emerge requiring automation

212

**Future features may include**:

213

* Automated point calculations

214

* Threshold-based promotions

215

* Reputation decay for inactive users

216

* Track record scoring for contributors

217

See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers.

=== 1.7 Edit ===

**Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at

222

**Purpose**: Complete audit trail for all content changes

223

224

=== Edit History Details ===

225

226

**What Gets Edited**:

227

228

* **Claims** (20% edited): assertion, domain, status, scores, analysis

229

* **Evidence** (10% edited): excerpt, relevance_score, supports

230

* **Scenarios** (5% edited): description, assumptions, confidence

231

* **Sources**: NOT versioned (continuous updates, not editorial decisions)

232

**Who Edits**:

233

* **Contributors** (rep sufficient): Corrections, additions

234

* **Trusted Contributors** (rep sufficient): Major improvements, approvals

235

* **Moderators**: Abuse handling, dispute resolution

236

* **System (AKEL)**: Re-analysis, automated improvements (user_id = NULL)

237

**Edit Types**:

238

* `CONTENT_CORRECTION`: User fixes factual error

239

* `CLARIFICATION`: Improved wording

240

* `SYSTEM_REANALYSIS`: AKEL re-processed claim

241

* `MODERATION_ACTION`: Hide/unhide for abuse

242

* `REVERT`: Rollback to previous version

243

**Retention Policy** (5 years total):

244

245

1. **Hot storage** (3 months): PostgreSQL, instant access

246

2. **Warm storage** (2 years): Partitioned, slower queries

247

3. **Cold storage** (3 years): S3 compressed, download required

248

4. **Deletion**: After 5 years (except legal holds)

249

**Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit)

250

**Use Cases**:

251

252

* View claim history timeline

253

* Detect vandalism patterns

254

* Learn from user corrections (system improvement)

255

* Legal compliance (audit trail)

256

* Rollback capability

257

See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases

=== 1.8 Flag ===

Fields: entity_id, reported_by, issue_type, status, resolution_note

262

263

=== 1.9 QualityMetric ===

264

265

**Fields**: metric_type, category, value, target, timestamp

266

**Purpose**: Time-series quality tracking

267

**Usage**:

268

269

* **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times

270

* **Quality dashboard**: Real-time display with trend charts

271

* **Alerting**: Automatic alerts when metrics exceed thresholds

272

* **A/B testing**: Compare control vs treatment metrics

273

* **Improvement validation**: Measure before/after changes

274

**Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`

275

276

=== 1.10 ErrorPattern ===

277

278

**Fields**: error_category, claim_id, description, root_cause, frequency, status

279

**Purpose**: Capture errors to trigger system improvements

280

**Usage**:

281

282

* **Error capture**: When users flag issues or system detects problems

283

* **Pattern analysis**: Weekly grouping by category and frequency

284

* **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor

285

* **Metrics**: Track error rate reduction over time

286

**Example**: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}` == 1.4 Core Data Model ERD == {{include reference="Test.FactHarbor pre12 V0\.9\.70.Specification.Diagrams.Core Data Model ERD.WebHome"/}} == 1.5 User Class Diagram ==

287

{{include reference="Test.FactHarbor pre12 V0\.9\.70.Specification.Diagrams.User Class Diagram.WebHome"/}}

288

289

== 2. Versioning Strategy ==

290

291

**All Content Entities Are Versioned**:

292

293

* **Claim**: Every edit creates new version (V1→V2→V3...)

294

* **Evidence**: Changes tracked in edit history

295

* **Scenario**: Modifications versioned

296

**How Versioning Works**:

297

* Entity table stores **current state only**

298

* Edit table stores **all historical states** (before_state, after_state as JSON)

299

* Version number increments with each edit

300

* Complete audit trail maintained forever

301

**Unversioned Entities** (current state only, no history):

302

* **Source**: Track record continuously updated (not versioned history, just current score)

303

* **User**: Account state (reputation accumulated, not versioned)

304

* **QualityMetric**: Time-series data (each record is a point in time, not a version)

305

* **ErrorPattern**: System improvement queue (status tracked, not versioned)

306

**Example**:

307

```

308

Claim V1: "The sky is blue" → User edits → Claim V2: "The sky is blue during daytime" → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}

309

```

310

311

== 2.5. Storage vs Computation Strategy ==

312

313

**Critical architectural decision**: What to persist in databases vs compute dynamically?

314

**Trade-off**:

315

316

* **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs

317

* **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible

318

319

=== Recommendation: Hybrid Approach ===

320

321

**STORE (in PostgreSQL):**

322

323

==== Claims (Current State + History) ====

324

325

* **What**: assertion, domain, status, created_at, updated_at, version

326

* **Why**: Core entity, must be persistent

327

* **Also store**: confidence_score (computed once, then cached)

328

* **Size**: 1 KB per claim

329

* **Growth**: Linear with claims

330

* **Decision**: ✅ STORE - Essential

331

332

==== Evidence (All Records) ====

333

334

* **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at

335

* **Why**: Hard to re-gather, user contributions, reproducibility

336

* **Size**: 2 KB per evidence (with excerpt)

337

* **Growth**: 3-10 evidence per claim

338

* **Decision**: ✅ STORE - Essential for reproducibility

339

340

==== Sources (Track Records) ====

341

342

* **What**: name, domain, track_record_score, accuracy_history, correction_frequency

343

* **Why**: Continuously updated, expensive to recompute

344

* **Size**: 500 bytes per source

345

* **Growth**: Slow (limited number of sources)

346

* **Decision**: ✅ STORE - Essential for quality

347

348

==== Edit History (All Versions) ====

349

350

* **What**: before_state, after_state, user_id, reason, timestamp

351

* **Why**: Audit trail, legal requirement, reproducibility

352

* **Size**: 2 KB per edit

353

* **Growth**: Linear with edits (A portion of claims get edited)

354

* **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total

355

* **Decision**: ✅ STORE - Essential for accountability

356

357

==== Flags (User Reports) ====

358

359

* **What**: entity_id, reported_by, issue_type, description, status

360

* **Why**: Error detection, system improvement triggers

361

* **Size**: 500 bytes per flag

362

* **Growth**: 5-high percentage of claims get flagged

363

* **Decision**: ✅ STORE - Essential for improvement

364

365

==== ErrorPatterns (System Improvement) ====

366

367

* **What**: error_category, claim_id, description, root_cause, frequency, status

368

* **Why**: Learning loop, prevent recurring errors

369

* **Size**: 1 KB per pattern

370

* **Growth**: Slow (limited patterns, many fixed)

371

* **Decision**: ✅ STORE - Essential for learning

372

373

==== QualityMetrics (Time Series) ====

374

375

* **What**: metric_type, category, value, target, timestamp

376

* **Why**: Trend analysis, cannot recreate historical metrics

377

* **Size**: 200 bytes per metric

378

* **Growth**: Hourly = 8,760 per year per metric type

379

* **Retention**: 2 years hot, then aggregate and archive

380

* **Decision**: ✅ STORE - Essential for monitoring

381

**STORE (Computed Once, Then Cached):**

382

383

==== Analysis Summary ====

384

385

* **What**: Neutral text summary of claim analysis (200-500 words)

386

* **Computed**: Once by AKEL when claim first analyzed

387

* **Stored in**: Claim table (text field)

388

* **Recomputed**: Only when system significantly improves OR claim edited

389

* **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often

390

* **Size**: 2 KB per claim

391

* **Decision**: ✅ STORE (cached) - Cost-effective

392

393

==== Confidence Score ====

394

395

* **What**: 0-100 score of analysis confidence

396

* **Computed**: Once by AKEL

397

* **Stored in**: Claim table (integer field)

398

* **Recomputed**: When evidence added, source track record changes significantly, or system improves

399

* **Why store**: Cheap to store, expensive to compute, users need it fast

400

* **Size**: 4 bytes per claim

401

* **Decision**: ✅ STORE (cached) - Performance critical

==== Risk Score ====

* **What**: 0-100 score of claim risk level

406

* **Computed**: Once by AKEL

407

* **Stored in**: Claim table (integer field)

408

* **Recomputed**: When domain changes, evidence changes, or controversy detected

409

* **Why store**: Same as confidence score

410

* **Size**: 4 bytes per claim

411

* **Decision**: ✅ STORE (cached) - Performance critical

412

**COMPUTE DYNAMICALLY (Never Store):**

==== Scenarios ====

⚠️ CRITICAL DECISION

* **What**: 2-5 possible interpretations of claim with assumptions

419

* **Current design**: Stored in Scenario table

420

* **Alternative**: Compute on-demand when user views claim details

421

* **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim

422

* **Compute cost**: $0.005-0.01 per request (LLM API call)

423

* **Frequency**: Viewed in detail by 20% of users

424

* **Trade-off analysis**: - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs

425

* **Reproducibility**: Scenarios may improve as AI improves (good to recompute)

426

* **Speed**: Computed = 5-8 seconds delay, Stored = instant

427

* **Decision**: ✅ STORE (hybrid approach below)

428

**Scenario Strategy** (APPROVED):

429

430

1. **Store scenarios** initially when claim analyzed

431

2. **Mark as stale** when system improves significantly

432

3. **Recompute on next view** if marked stale

433

4. **Cache for 30 days** if frequently accessed

434

5. **Result**: Best of both worlds - speed + freshness

435

436

==== Verdict Synthesis ====

437

438

~* **What**: Final conclusion text synthesizing all scenarios

439

440

* **Compute cost**: $0.002-0.005 per request

441

* **Frequency**: Every time claim viewed

442

* **Why not store**: Changes as evidence/scenarios change, users want fresh analysis

443

* **Speed**: 2-3 seconds (acceptable)

444

**Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale

445

* **Recommendation**: ✅ STORE cached version, mark stale when changes occur

446

447

==== Search Results ====

448

449

* **What**: Lists of claims matching search query

450

* **Compute from**: Elasticsearch index

451

* **Cache**: 15 minutes in Redis for popular queries

452

* **Why not store permanently**: Constantly changing, infinite possible queries

453

454

==== Aggregated Statistics ====

455

456

* **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.

457

* **Compute from**: Database queries

458

* **Cache**: 1 hour in Redis

459

* **Why not store**: Can be derived, relatively cheap to compute

460

461

==== User Reputation ====

462

463

* **What**: Score based on contributions

464

* **Current design**: Stored in User table

465

* **Alternative**: Compute from Edit table

466

* **Trade-off**: - Stored: Fast, simple - Computed: Always accurate, no denormalization

467

* **Frequency**: Read on every user action

468

* **Compute cost**: Simple COUNT query, milliseconds

469

* **Decision**: ✅ STORE - Performance critical, read-heavy

470

471

=== Summary Table ===

|-|-|-|||-|\\

| Claim core | ✅ | - | 1 KB | STORE | Essential |\\

476

| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\

477

| Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\

478

| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\

479

| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\

480

| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\

481

| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\

482

| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\

483

| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\

484

| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\

485

| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\

486

| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\

487

| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\

488

| Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |

489

**Total storage per claim**: 18 KB (without edits and flags)

490

**For 1 million claims**:

491

492

* **Storage**: 18 GB (manageable)

493

* **PostgreSQL**: $50/month (standard instance)

494

* **Redis cache**: $20/month (1 GB instance)

495

* **S3 archives**: $5/month (old edits)

496

* **Total**: $75/month infrastructure

497

**LLM cost savings by caching**:

498

* Analysis summary stored: Save $0.03 per claim = $30K per 1M claims

499

* Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims * Verdict stored: Save $0.003 per claim = $3K per 1M claims

500

* **Total savings**: $35K per 1M claims vs recomputing every time

501

502

=== Recomputation Triggers ===

503

504

**When to mark cached data as stale and recompute:**

505

506

1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)

507

2. **Evidence added** → Recompute: scenarios, verdict, confidence score

508

3. **Source track record changes >10 points** → Recompute: confidence score, verdict

509

4. **System improvement deployed** → Mark affected claims stale, recompute on next view

510

5. **Controversy detected** (high flag rate) → Recompute: risk score

511

**Recomputation strategy**:

512

513

* **Eager**: Immediately recompute (for user edits)

514

* **Lazy**: Recompute on next view (for system improvements)

515

* **Batch**: Nightly re-evaluation of stale claims (if <1000)

516

517

=== Database Size Projection ===

518

519

**Year 1**: 10K claims

* Storage: 180 MB

* Cost: $10/month

**Year 3**: 100K claims * Storage: 1.8 GB

524

* Cost: $30/month

525

**Year 5**: 1M claims

526

* Storage: 18 GB * Cost: $75/month

527

**Year 10**: 10M claims

528

* Storage: 180 GB

529

* Cost: $300/month

530

* Optimization: Archive old claims to S3 ($5/TB/month)

531

**Conclusion**: Storage costs are manageable, LLM cost savings are substantial.

532

533

== 3. Key Simplifications ==

534

535

* **Two content states only**: Published, Hidden

536

* **Three user roles only**: Reader, Contributor, Moderator

537

* **No complex versioning**: Linear edit history

538

* **Reputation-based permissions**: Not role hierarchy

539

* **Source track records**: Continuous evaluation

540

541

== 3. What Gets Stored in the Database ==

542

543

=== 3.1 Primary Storage (PostgreSQL) ===

**Claims Table**:

* Current state only (latest version)

548

* Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at

549

**Evidence Table**:

550

* All evidence records

551

* Fields: id, claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at, archived

552

**Scenario Table**:

553

* All scenarios for each claim

554

* Fields: id, claim_id, description, assumptions (text array), confidence, created_by, created_at

555

**Source Table**:

556

* Track record database (continuously updated)

557

* Fields: id, name, domain, type, track_record_score, accuracy_history (JSON), correction_frequency, last_updated, claim_count, corrections_count

558

**User Table**:

559

* All user accounts

560

* Fields: id, username, email, role (Reader/Contributor/Moderator), reputation, created_at, last_active, contributions_count, flags_submitted, flags_accepted

561

**Edit Table**:

562

* Complete version history

563

* Fields: id, entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at

564

**Flag Table**:

565

* User-reported issues

566

* Fields: id, entity_type, entity_id, reported_by, issue_type, description, status, resolved_by, resolution_note, created_at, resolved_at