Data Model - XWiki

1

= Data Model =

2

FactHarbor's data model is **simple, focused, designed for automated processing**.

3

== 1. Core Entities ==

4

=== 1.1 Claim ===

5

Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count

6

==== Performance Optimization: Denormalized Fields ====

7

**Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.

8

**Additional cached fields in claims table**:

9

* **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores

10

* Avoids joining evidence table for listing/preview

11

* Updated when evidence is added/removed

12

* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`

13

* **source_names** (TEXT[]): Array of source names for quick display

14

* Avoids joining through evidence to sources

15

* Updated when sources change

16

* Format: `["New York Times", "Nature Journal", ...]`

17

* **scenario_count** (INTEGER): Number of scenarios for this claim

18

* Quick metric without counting rows

19

* Updated when scenarios added/removed

20

* **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed

21

* Helps invalidate stale caches

22

* Triggers background refresh if too old

23

**Update Strategy**:

24

* **Immediate**: Update on claim edit (user-facing)

25

* **Deferred**: Update via background job every hour (non-critical)

26

* **Invalidation**: Clear cache when source data changes significantly

27

**Trade-offs**:

28

* ✅ 70% fewer joins on common queries

29

* ✅ Much faster claim list/search pages

30

* ✅ Better user experience

31

* ⚠️ Small storage increase (~10%)

32

* ⚠️ Need to keep caches in sync

33

=== 1.2 Evidence ===

34

Fields: claim_id, source_id, excerpt, url, relevance_score, supports

35

=== 1.3 Source ===

36

**Purpose**: Track reliability of information sources over time

37

**Fields**:

38

* **id** (UUID): Unique identifier

39

* **name** (text): Source name (e.g., "New York Times", "Nature Journal")

40

* **domain** (text): Website domain (e.g., "nytimes.com")

41

* **type** (enum): NewsOutlet, AcademicJournal, GovernmentAgency, etc.

42

* **track_record_score** (0-100): Overall reliability score

43

* **accuracy_history** (JSON): Historical accuracy data

44

* **correction_frequency** (float): How often source publishes corrections

45

* **last_updated** (timestamp): When track record last recalculated

46

**How It Works**:

47

* Initial score based on source type (70 for academic journals, 30 for unknown)

48

* Updated daily by background scheduler

49

* Formula: accuracy_rate (50%) + correction_policy (20%) + editorial_standards (15%) + bias_transparency (10%) + longevity (5%)

50

* Track Record Check in AKEL pipeline: Adjusts evidence confidence based on source quality

51

* Quality thresholds: 90+=Exceptional, 70-89=Reliable, 50-69=Acceptable, 30-49=Questionable, <30=Unreliable

52

**See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage

53

Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**

54

**Key**: Automated source reliability tracking

55

==== Source Scoring Process (Separation of Concerns) ====

56

**Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.

57

**The Problem**:

58

* Source scores should influence claim verdicts

59

* Claim verdicts should update source scores

60

* But: Direct feedback creates circular dependency and potential feedback loops

61

**The Solution**: Temporal separation

62

==== Weekly Background Job (Source Scoring) ====

63

Runs independently of claim analysis:

64

65

def update_source_scores_weekly():

66

"""

67

Background job: Calculate source reliability

68

Never triggered by individual claim analysis

69

"""

70

# Analyze all claims from past week

71

claims = get_claims_from_past_week()

72

for source in get_all_sources():

73

# Calculate accuracy metrics

74

correct_verdicts = count_correct_verdicts_citing(source, claims)

75

total_citations = count_total_citations(source, claims)

76

accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5

77

# Weight by claim importance

78

weighted_score = calculate_weighted_score(source, claims)

79

# Update source record

80

source.track_record_score = weighted_score

81

source.total_citations = total_citations

82

source.last_updated = now()

83

source.save()

84

# Job runs: Sunday 2 AM UTC

85

# Never during claim processing

86

87

==== Real-Time Claim Analysis (AKEL) ====

88

Uses source scores but never updates them:

89

90

def analyze_claim(claim_text):

91

"""

92

Real-time: Analyze claim using current source scores

93

READ source scores, never UPDATE them

94

"""

95

# Gather evidence

96

evidence_list = gather_evidence(claim_text)

97

for evidence in evidence_list:

98

# READ source score (snapshot from last weekly update)

99

source = get_source(evidence.source_id)

100

source_score = source.track_record_score

101

# Use score to weight evidence

102

evidence.weighted_relevance = evidence.relevance * source_score

103

# Generate verdict using weighted evidence

104

verdict = synthesize_verdict(evidence_list)

105

# NEVER update source scores here

106

# That happens in weekly background job

107

return verdict

108

109

==== Monthly Audit (Quality Assurance) ====

110

Moderator review of flagged source scores:

111

* Verify scores make sense

112

* Detect gaming attempts

113

* Identify systematic biases

114

* Manual adjustments if needed

115

**Key Principles**:

116

✅ **Scoring and analysis are temporally separated**

117

* Source scoring: Weekly batch job

118

* Claim analysis: Real-time processing

119

* Never update scores during analysis

120

✅ **One-way data flow during processing**

121

* Claims READ source scores

122

* Claims NEVER WRITE source scores

123

* Updates happen in background only

124

✅ **Predictable update cycle**

125

* Sources update every Sunday 2 AM

126

* Claims always use last week's scores

127

* No mid-week score changes

128

✅ **Audit trail**

129

* Log all score changes

130

* Track score history

131

* Explainable calculations

132

**Benefits**:

133

* No circular dependencies

134

* Predictable behavior

135

* Easier to reason about

136

* Simpler testing

137

* Clear audit trail

138

**Example Timeline**:

139

```

140

Sunday 2 AM: Calculate source scores for past week

141

→ NYT score: 0.87 (up from 0.85)

142

→ Blog X score: 0.52 (down from 0.61)

143

Monday-Saturday: Claims processed using these scores

144

→ All claims this week use NYT=0.87

145

→ All claims this week use Blog X=0.52

146

Next Sunday 2 AM: Recalculate scores including this week's claims

147

→ NYT score: 0.89 (trending up)

148

→ Blog X score: 0.48 (trending down)

149

```

150

=== 1.4 Scenario ===

151

**Purpose**: Different interpretations or contexts for evaluating claims

152

**Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.

153

**Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)

154

**Fields**:

155

* **id** (UUID): Unique identifier

156

* **claim_id** (UUID): Foreign key to claim (one-to-many)

157

* **description** (text): Human-readable description of the scenario

158

* **assumptions** (JSONB): Key assumptions that define this scenario context

159

* **extracted_from** (UUID): Reference to evidence that this scenario was extracted from

160

* **created_at** (timestamp): When scenario was created

161

* **updated_at** (timestamp): Last modification

162

**How Found**: Evidence search → Extract context → Create scenario → Link to claim

163

**Example**:

164

For claim "Vaccines reduce hospitalization":

165

* Scenario 1: "Clinical trials (healthy adults 18-65, original strain)" from trial paper

166

* Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data

167

* Scenario 3: "Immunocompromised patients" from specialist study

168

**V2.0 Evolution**: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality.

=== 1.5 Verdict ===

**Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.

173

174

**Core Fields**:

175

* **id** (UUID): Primary key

176

* **scenario_id** (UUID FK): The scenario being assessed

177

* **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")

178

* **confidence** (decimal 0-1): How confident we are in this assessment

179

* **explanation_summary** (text): Human-readable reasoning explaining the verdict

180

* **uncertainty_factors** (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")

181

* **created_at** (timestamp): When verdict was created

182

* **updated_at** (timestamp): Last modification

183

184

**Change Tracking**: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states.

185

186

**Relationship**: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity.

187

188

**Example**:

189

For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":

190

* Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]

191

* After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]

192

* Edit entity records the complete before/after change with timestamp and reason

193

194

**Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios.

195

196

=== 1.6 User ===

197

Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count

198

=== User Reputation System ==

199

**V1.0 Approach**: Simple manual role assignment

200

**Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.

201

=== Roles (Manual Assignment) ===

202

**reader** (default):

203

* View published claims and evidence

204

* Browse and search content

205

* No editing permissions

206

**contributor**:

207

* Submit new claims

208

* Suggest edits to existing content

209

* Add evidence

210

* Requires manual promotion by moderator/admin

211

**moderator**:

212

* Approve/reject contributor suggestions

213

* Flag inappropriate content

214

* Handle abuse reports

215

* Assigned by admins based on trust

216

**admin**:

217

* Manage users and roles

218

* System configuration

219

* Access to all features

220

* Founder-appointed initially

221

=== Contribution Tracking (Simple) ===

222

**Basic metrics only**:

223

* `contributions_count`: Total number of contributions

224

* `created_at`: Account age

225

* `last_active`: Recent activity

226

**No complex calculations**:

227

* No point systems

228

* No automated privilege escalation

229

* No reputation decay

230

* No threshold-based promotions

231

=== Promotion Process ===

232

**Manual review by moderators/admins**:

233

1. User demonstrates value through contributions

234

2. Moderator reviews user's contribution history

235

3. Moderator promotes user to contributor role

236

4. Admin promotes trusted contributors to moderator

237

**Criteria** (guidelines, not automated):

238

* Quality of contributions

239

* Consistency over time

240

* Collaborative behavior

241

* Understanding of project goals

242

=== V2.0+ Evolution ===

243

**Add complex reputation when**:

244

* 100+ active contributors

245

* Manual role management becomes bottleneck

246

* Clear patterns of abuse emerge requiring automation

247

**Future features may include**:

248

* Automated point calculations

249

* Threshold-based promotions

250

* Reputation decay for inactive users

251

* Track record scoring for contributors

252

See [[When to Add Complexity>>FactHarbor.Specification.When-to-Add-Complexity]] for triggers.

253

=== 1.7 Edit ===

254

**Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at

255

**Purpose**: Complete audit trail for all content changes

256

=== Edit History Details ===

257

**What Gets Edited**:

258

* **Claims** (20% edited): assertion, domain, status, scores, analysis

259

* **Evidence** (10% edited): excerpt, relevance_score, supports

260

* **Scenarios** (5% edited): description, assumptions, confidence

261

* **Sources**: NOT versioned (continuous updates, not editorial decisions)

262

**Who Edits**:

263

* **Contributors** (rep sufficient): Corrections, additions

264

* **Trusted Contributors** (rep sufficient): Major improvements, approvals

265

* **Moderators**: Abuse handling, dispute resolution

266

* **System (AKEL)**: Re-analysis, automated improvements (user_id = NULL)

267

**Edit Types**:

268

* `CONTENT_CORRECTION`: User fixes factual error

269

* `CLARIFICATION`: Improved wording

270

* `SYSTEM_REANALYSIS`: AKEL re-processed claim

271

* `MODERATION_ACTION`: Hide/unhide for abuse

272

* `REVERT`: Rollback to previous version

273

**Retention Policy** (5 years total):

274

1. **Hot storage** (3 months): PostgreSQL, instant access

275

2. **Warm storage** (2 years): Partitioned, slower queries

276

3. **Cold storage** (3 years): S3 compressed, download required

277

4. **Deletion**: After 5 years (except legal holds)

278

**Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit)

279

**Use Cases**:

280

* View claim history timeline

281

* Detect vandalism patterns

282

* Learn from user corrections (system improvement)

283

* Legal compliance (audit trail)

284

* Rollback capability

285

See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases

286

=== 1.8 Flag ===

287

Fields: entity_id, reported_by, issue_type, status, resolution_note

288

=== 1.9 QualityMetric ===

289

**Fields**: metric_type, category, value, target, timestamp

290

**Purpose**: Time-series quality tracking

291

**Usage**:

292

* **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times

293

* **Quality dashboard**: Real-time display with trend charts

294

* **Alerting**: Automatic alerts when metrics exceed thresholds

295

* **A/B testing**: Compare control vs treatment metrics

296

* **Improvement validation**: Measure before/after changes

297

**Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`

298

=== 1.10 ErrorPattern ===

299

**Fields**: error_category, claim_id, description, root_cause, frequency, status

300

**Purpose**: Capture errors to trigger system improvements

301

**Usage**:

302

* **Error capture**: When users flag issues or system detects problems

303

* **Pattern analysis**: Weekly grouping by category and frequency

304

* **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor

305

* **Metrics**: Track error rate reduction over time

306

**Example**: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}`

307

308

== 1.4 Core Data Model ERD ==

309

310

{{include reference="FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}}

311

312

== 1.5 User Class Diagram ==

313

{{include reference="FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}}

314

== 2. Versioning Strategy ==

315

**All Content Entities Are Versioned**:

316

* **Claim**: Every edit creates new version (V1→V2→V3...)

317

* **Evidence**: Changes tracked in edit history

318

* **Scenario**: Modifications versioned

319

**How Versioning Works**:

320

* Entity table stores **current state only**

321

* Edit table stores **all historical states** (before_state, after_state as JSON)

322

* Version number increments with each edit

323

* Complete audit trail maintained forever

324

**Unversioned Entities** (current state only, no history):

325

* **Source**: Track record continuously updated (not versioned history, just current score)

326

* **User**: Account state (reputation accumulated, not versioned)

327

* **QualityMetric**: Time-series data (each record is a point in time, not a version)

328

* **ErrorPattern**: System improvement queue (status tracked, not versioned)

329

**Example**:

330

```

331

Claim V1: "The sky is blue"

332

→ User edits →

333

Claim V2: "The sky is blue during daytime"

334

→ EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}

335

```

336

== 2.5. Storage vs Computation Strategy ==

337

**Critical architectural decision**: What to persist in databases vs compute dynamically?

338

**Trade-off**:

339

* **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs

340

* **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible

341

=== Recommendation: Hybrid Approach ===

342

**STORE (in PostgreSQL):**

343

==== Claims (Current State + History) ====

344

* **What**: assertion, domain, status, created_at, updated_at, version

345

* **Why**: Core entity, must be persistent

346

* **Also store**: confidence_score (computed once, then cached)

347

* **Size**: ~1 KB per claim

348

* **Growth**: Linear with claims

349

* **Decision**: ✅ STORE - Essential

350

==== Evidence (All Records) ====

351

* **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at

352

* **Why**: Hard to re-gather, user contributions, reproducibility

353

* **Size**: ~2 KB per evidence (with excerpt)

354

* **Growth**: 3-10 evidence per claim

355

* **Decision**: ✅ STORE - Essential for reproducibility

356

==== Sources (Track Records) ====

357

* **What**: name, domain, track_record_score, accuracy_history, correction_frequency

358

* **Why**: Continuously updated, expensive to recompute

359

* **Size**: ~500 bytes per source

360

* **Growth**: Slow (limited number of sources)

361

* **Decision**: ✅ STORE - Essential for quality

362

==== Edit History (All Versions) ====

363

* **What**: before_state, after_state, user_id, reason, timestamp

364

* **Why**: Audit trail, legal requirement, reproducibility

365

* **Size**: ~2 KB per edit

366

* **Growth**: Linear with edits (~A portion of claims get edited)

367

* **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total

368

* **Decision**: ✅ STORE - Essential for accountability

369

==== Flags (User Reports) ====

370

* **What**: entity_id, reported_by, issue_type, description, status

371

* **Why**: Error detection, system improvement triggers

372

* **Size**: ~500 bytes per flag

373

* **Growth**: 5-high percentage of claims get flagged

374

* **Decision**: ✅ STORE - Essential for improvement

375

==== ErrorPatterns (System Improvement) ====

376

* **What**: error_category, claim_id, description, root_cause, frequency, status

377

* **Why**: Learning loop, prevent recurring errors

378

* **Size**: ~1 KB per pattern

379

* **Growth**: Slow (limited patterns, many fixed)

380

* **Decision**: ✅ STORE - Essential for learning

381

==== QualityMetrics (Time Series) ====

382

* **What**: metric_type, category, value, target, timestamp

383

* **Why**: Trend analysis, cannot recreate historical metrics

384

* **Size**: ~200 bytes per metric

385

* **Growth**: Hourly = 8,760 per year per metric type

386

* **Retention**: 2 years hot, then aggregate and archive

387

* **Decision**: ✅ STORE - Essential for monitoring

388

**STORE (Computed Once, Then Cached):**

389

==== Analysis Summary ====

390

* **What**: Neutral text summary of claim analysis (200-500 words)

391

* **Computed**: Once by AKEL when claim first analyzed

392

* **Stored in**: Claim table (text field)

393

* **Recomputed**: Only when system significantly improves OR claim edited

394

* **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often

395

* **Size**: ~2 KB per claim

396

* **Decision**: ✅ STORE (cached) - Cost-effective

397

==== Confidence Score ====

398

* **What**: 0-100 score of analysis confidence

399

* **Computed**: Once by AKEL

400

* **Stored in**: Claim table (integer field)

401

* **Recomputed**: When evidence added, source track record changes significantly, or system improves

402

* **Why store**: Cheap to store, expensive to compute, users need it fast

403

* **Size**: 4 bytes per claim

404

* **Decision**: ✅ STORE (cached) - Performance critical

405

==== Risk Score ====

406

* **What**: 0-100 score of claim risk level

407

* **Computed**: Once by AKEL

408

* **Stored in**: Claim table (integer field)

409

* **Recomputed**: When domain changes, evidence changes, or controversy detected

410

* **Why store**: Same as confidence score

411

* **Size**: 4 bytes per claim

412

* **Decision**: ✅ STORE (cached) - Performance critical

413

**COMPUTE DYNAMICALLY (Never Store):**

414

==== Scenarios ==== ⚠️ CRITICAL DECISION

415

* **What**: 2-5 possible interpretations of claim with assumptions

416

* **Current design**: Stored in Scenario table

417

* **Alternative**: Compute on-demand when user views claim details

418

* **Storage cost**: ~1 KB per scenario × 3 scenarios average = ~3 KB per claim

419

* **Compute cost**: $0.005-0.01 per request (LLM API call)

420

* **Frequency**: Viewed in detail by ~20% of users

421

* **Trade-off analysis**:

422

- IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access

423

- IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs

424

* **Reproducibility**: Scenarios may improve as AI improves (good to recompute)

425

* **Speed**: Computed = 5-8 seconds delay, Stored = instant

426

* **Decision**: ✅ STORE (hybrid approach below)

427

**Scenario Strategy** (APPROVED):

428

1. **Store scenarios** initially when claim analyzed

429

2. **Mark as stale** when system improves significantly

430

3. **Recompute on next view** if marked stale

431

4. **Cache for 30 days** if frequently accessed

432

5. **Result**: Best of both worlds - speed + freshness

433

==== Verdict Synthesis ====

434

* **What**: Final conclusion text synthesizing all scenarios

435

* **Compute cost**: $0.002-0.005 per request

436

* **Frequency**: Every time claim viewed

437

* **Why not store**: Changes as evidence/scenarios change, users want fresh analysis

438

* **Speed**: 2-3 seconds (acceptable)

439

**Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale

440

* **Recommendation**: ✅ STORE cached version, mark stale when changes occur

441

==== Search Results ====

442

* **What**: Lists of claims matching search query

443

* **Compute from**: Elasticsearch index

444

* **Cache**: 15 minutes in Redis for popular queries

445

* **Why not store permanently**: Constantly changing, infinite possible queries

446

==== Aggregated Statistics ====

447

* **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.

448

* **Compute from**: Database queries

449

* **Cache**: 1 hour in Redis

450

* **Why not store**: Can be derived, relatively cheap to compute

451

==== User Reputation ====

452

* **What**: Score based on contributions

453

* **Current design**: Stored in User table

454

* **Alternative**: Compute from Edit table

455

* **Trade-off**:

456

- Stored: Fast, simple

457

- Computed: Always accurate, no denormalization

458

* **Frequency**: Read on every user action

459

* **Compute cost**: Simple COUNT query, milliseconds

460

* **Decision**: ✅ STORE - Performance critical, read-heavy

461

=== Summary Table ===

462

463

|-----------|---------|---------|----------------|----------|-----------|

464

| Claim core | ✅ | - | 1 KB | STORE | Essential |

| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |

477

| Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |

478

**Total storage per claim**: ~18 KB (without edits and flags)

479

**For 1 million claims**:

480

* **Storage**: ~18 GB (manageable)

481

* **PostgreSQL**: ~$50/month (standard instance)

482

* **Redis cache**: ~$20/month (1 GB instance)

483

* **S3 archives**: ~$5/month (old edits)

484

* **Total**: ~$75/month infrastructure

485

**LLM cost savings by caching**:

486

* Analysis summary stored: Save $0.03 per claim = $30K per 1M claims

487

* Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims

488

* Verdict stored: Save $0.003 per claim = $3K per 1M claims

489

* **Total savings**: ~$35K per 1M claims vs recomputing every time

490

=== Recomputation Triggers ===

491

**When to mark cached data as stale and recompute:**

492

1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)

493

2. **Evidence added** → Recompute: scenarios, verdict, confidence score

494

3. **Source track record changes >10 points** → Recompute: confidence score, verdict

495

4. **System improvement deployed** → Mark affected claims stale, recompute on next view

496

5. **Controversy detected** (high flag rate) → Recompute: risk score

497

**Recomputation strategy**:

498

* **Eager**: Immediately recompute (for user edits)

499

* **Lazy**: Recompute on next view (for system improvements)

500

* **Batch**: Nightly re-evaluation of stale claims (if <1000)

501

=== Database Size Projection ===

502

**Year 1**: 10K claims

503

* Storage: 180 MB

504

* Cost: $10/month

505

**Year 3**: 100K claims

506

* Storage: 1.8 GB

507

* Cost: $30/month

508

**Year 5**: 1M claims

509

* Storage: 18 GB

510

* Cost: $75/month

511

**Year 10**: 10M claims

512

* Storage: 180 GB

513

* Cost: $300/month

514

* Optimization: Archive old claims to S3 ($5/TB/month)

515

**Conclusion**: Storage costs are manageable, LLM cost savings are substantial.

516

== 3. Key Simplifications ==

517

* **Two content states only**: Published, Hidden

518

* **Three user roles only**: Reader, Contributor, Moderator

519

* **No complex versioning**: Linear edit history

520

* **Reputation-based permissions**: Not role hierarchy

521

* **Source track records**: Continuous evaluation

522

== 3. What Gets Stored in the Database ==

523

=== 3.1 Primary Storage (PostgreSQL) ===

524

**Claims Table**:

525

* Current state only (latest version)

526

* Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at

527

**Evidence Table**:

528

* All evidence records

529

* Fields: id, claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at, archived

530

**Scenario Table**:

531

* All scenarios for each claim

532

* Fields: id, claim_id, description, assumptions (text array), confidence, created_by, created_at

533

**Source Table**:

534

* Track record database (continuously updated)

535

* Fields: id, name, domain, type, track_record_score, accuracy_history (JSON), correction_frequency, last_updated, claim_count, corrections_count

536

**User Table**:

537

* All user accounts

538

* Fields: id, username, email, role (Reader/Contributor/Moderator), reputation, created_at, last_active, contributions_count, flags_submitted, flags_accepted

539

**Edit Table**:

540

* Complete version history

541

* Fields: id, entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at

542

**Flag Table**:

543

* User-reported issues

544

* Fields: id, entity_type, entity_id, reported_by, issue_type, description, status, resolved_by, resolution_note, created_at, resolved_at