Data Model - XWiki

= Data Model =

**Implementation Status (Updated January 2026)**

This specification describes the **target** normalized data model. Current implementation (v2.6.33) differs significantly:

7

8

* **Storage**: All data stored as **JSON blobs in SQLite**, not normalized PostgreSQL tables

9

* **Scenarios**: **Replaced with KeyFactors** - decomposition questions, not separate entities

10

* **Caching**: Redis cache **not implemented**; no claim caching yet

11

* **Source Scoring**: Uses external MBFC bundle, not internal track record calculation

12

* **User System**: Not implemented (no authentication in current version)

13

14

This specification remains valuable as the target architecture for future versions.

15

16

See `Docs/STATUS/Documentation_Inconsistencies.md` for full comparison.

17

18

19

FactHarbor's data model is **simple, focused, designed for automated processing**.

20

== 1. Core Entities ==

21

=== 1.1 Claim ===

22

Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count

23

==== Performance Optimization: Denormalized Fields ====

24

**Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.

25

**Additional cached fields in claims table**:

26

* **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores

27

* Avoids joining evidence table for listing/preview

28

* Updated when evidence is added/removed

29

* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`

30

* **source_names** (TEXT[]): Array of source names for quick display

31

* Avoids joining through evidence to sources

32

* Updated when sources change

33

* Format: `["New York Times", "Nature Journal", ...]`

34

* **scenario_count** (INTEGER): Number of scenarios for this claim

35

* Quick metric without counting rows

36

* Updated when scenarios added/removed

37

* **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed

38

* Helps invalidate stale caches

39

* Triggers background refresh if too old

40

**Update Strategy**:

41

* **Immediate**: Update on claim edit (user-facing)

42

* **Deferred**: Update via background job every hour (non-critical)

43

* **Invalidation**: Clear cache when source data changes significantly

44

**Trade-offs**:

45

* ✅ 70% fewer joins on common queries

46

* ✅ Much faster claim list/search pages

47

* ✅ Better user experience

48

* ⚠️ Small storage increase (~10%)

49

* ⚠️ Need to keep caches in sync

50

=== 1.2 Evidence ===

51

Fields: claim_id, source_id, excerpt, url, relevance_score, supports

52

=== 1.3 Source ===

53

**Purpose**: Track reliability of information sources over time

54

**Fields**:

55

* **id** (UUID): Unique identifier

56

* **name** (text): Source name (e.g., "New York Times", "Nature Journal")

57

* **domain** (text): Website domain (e.g., "nytimes.com")

58

* **type** (enum): NewsOutlet, AcademicJournal, GovernmentAgency, etc.

59

* **track_record_score** (0-100): Overall reliability score

60

* **accuracy_history** (JSON): Historical accuracy data

61

* **correction_frequency** (float): How often source publishes corrections

62

* **last_updated** (timestamp): When track record last recalculated

63

**How It Works**:

64

* Initial score based on source type (70 for academic journals, 30 for unknown)

65

* Updated daily by background scheduler

66

* Formula: accuracy_rate (50%) + correction_policy (20%) + editorial_standards (15%) + bias_transparency (10%) + longevity (5%)

67

* Track Record Check in AKEL pipeline: Adjusts evidence confidence based on source quality

68

* Quality thresholds: 90+=Exceptional, 70-89=Reliable, 50-69=Acceptable, 30-49=Questionable, <30=Unreliable

69

70

71

**Current Implementation (v2.6.33):** Source reliability uses external **MBFC (Media Bias/Fact Check) bundle** instead of internal track record calculation. Scores are loaded from a configurable JSON file. See `Docs/ARCHITECTURE/Source_Reliability.md`.

72

73

74

**See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage

75

Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**

76

**Key**: Automated source reliability tracking

77

==== Source Scoring Process (Separation of Concerns) ====

78

**Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.

79

**The Problem**:

80

* Source scores should influence claim verdicts

81

* Claim verdicts should update source scores

82

* But: Direct feedback creates circular dependency and potential feedback loops

83

**The Solution**: Temporal separation

84

==== Weekly Background Job (Source Scoring) ====

85

Runs independently of claim analysis:

86

87

def update_source_scores_weekly():

88

"""

89

Background job: Calculate source reliability

90

Never triggered by individual claim analysis

91

"""

92

# Analyze all claims from past week

93

claims = get_claims_from_past_week()

94

for source in get_all_sources():

95

# Calculate accuracy metrics

96

correct_verdicts = count_correct_verdicts_citing(source, claims)

97

total_citations = count_total_citations(source, claims)

98

accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5

99

# Weight by claim importance

100

weighted_score = calculate_weighted_score(source, claims)

101

# Update source record

102

source.track_record_score = weighted_score

103

source.total_citations = total_citations

104

source.last_updated = now()

105

source.save()

106

# Job runs: Sunday 2 AM UTC

107

# Never during claim processing

108

109

==== Real-Time Claim Analysis (AKEL) ====

110

Uses source scores but never updates them:

111

112

def analyze_claim(claim_text):

113

"""

114

Real-time: Analyze claim using current source scores

115

READ source scores, never UPDATE them

116

"""

117

# Gather evidence

118

evidence_list = gather_evidence(claim_text)

119

for evidence in evidence_list:

120

# READ source score (snapshot from last weekly update)

121

source = get_source(evidence.source_id)

122

source_score = source.track_record_score

123

# Use score to weight evidence

124

evidence.weighted_relevance = evidence.relevance * source_score

125

# Generate verdict using weighted evidence

126

verdict = synthesize_verdict(evidence_list)

127

# NEVER update source scores here

128

# That happens in weekly background job

129

return verdict

130

131

==== Monthly Audit (Quality Assurance) ====

132

Moderator review of flagged source scores:

133

* Verify scores make sense

134

* Detect gaming attempts

135

* Identify systematic biases

136

* Manual adjustments if needed

137

**Key Principles**:

138

✅ **Scoring and analysis are temporally separated**

139

* Source scoring: Weekly batch job

140

* Claim analysis: Real-time processing

141

* Never update scores during analysis

142

✅ **One-way data flow during processing**

143

* Claims READ source scores

144

* Claims NEVER WRITE source scores

145

* Updates happen in background only

146

✅ **Predictable update cycle**

147

* Sources update every Sunday 2 AM

148

* Claims always use last week's scores

149

* No mid-week score changes

150

✅ **Audit trail**

151

* Log all score changes

152

* Track score history

153

* Explainable calculations

154

**Benefits**:

155

* No circular dependencies

156

* Predictable behavior

157

* Easier to reason about

158

* Simpler testing

159

* Clear audit trail

160

**Example Timeline**:

161

```

162

Sunday 2 AM: Calculate source scores for past week

163

→ NYT score: 0.87 (up from 0.85)

164

→ Blog X score: 0.52 (down from 0.61)

165

Monday-Saturday: Claims processed using these scores

166

→ All claims this week use NYT=0.87

167

→ All claims this week use Blog X=0.52

168

Next Sunday 2 AM: Recalculate scores including this week's claims

169

→ NYT score: 0.89 (trending up)

170

→ Blog X score: 0.48 (trending down)

```

=== 1.4 Scenario ===

**Implementation Change:** Scenarios were **replaced with KeyFactors** in the current implementation. KeyFactors are decomposition questions discovered during the understanding phase, not separate stored entities. See `Docs/ARCHITECTURE/KeyFactors_Design.md` for the design rationale.

176

177

178

**Purpose**: Different interpretations or contexts for evaluating claims

179

**Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.

180

**Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)

181

**Fields**:

182

* **id** (UUID): Unique identifier

183

* **claim_id** (UUID): Foreign key to claim (one-to-many)

184

* **description** (text): Human-readable description of the scenario

185

* **assumptions** (JSONB): Key assumptions that define this scenario context

186

* **extracted_from** (UUID): Reference to evidence that this scenario was extracted from

187

* **created_at** (timestamp): When scenario was created

188

* **updated_at** (timestamp): Last modification

189

**How Found**: Evidence search → Extract context → Create scenario → Link to claim

190

**Example**:

191

For claim "Vaccines reduce hospitalization":

192

* Scenario 1: "Clinical trials (healthy adults 18-65, original strain)" from trial paper

193

* Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data

194

* Scenario 3: "Immunocompromised patients" from specialist study

195

**V2.0 Evolution**: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality.

=== 1.5 Verdict ===

**Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.

200

201

**Core Fields**:

202

* **id** (UUID): Primary key

203

* **scenario_id** (UUID FK): The scenario being assessed

204

* **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")

205

* **confidence** (decimal 0-1): How confident we are in this assessment

206

* **explanation_summary** (text): Human-readable reasoning explaining the verdict

207

* **uncertainty_factors** (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")

208

* **created_at** (timestamp): When verdict was created

209

* **updated_at** (timestamp): Last modification

210

211

**Change Tracking**: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states.

212

213

**Relationship**: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity.

214

215

**Example**:

216

For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":

217

* Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]

218

* After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]

219

* Edit entity records the complete before/after change with timestamp and reason

220

221

**Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios.

222

223

=== 1.6 User ===

224

Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count

225

=== User Reputation System ==

226

**V1.0 Approach**: Simple manual role assignment

227

**Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.

228

=== Roles (Manual Assignment) ===

229

**reader** (default):

230

* View published claims and evidence

231

* Browse and search content

232

* No editing permissions

233

**contributor**:

234

* Submit new claims

235

* Suggest edits to existing content

236

* Add evidence

237

* Requires manual promotion by moderator/admin

238

**moderator**:

239

* Approve/reject contributor suggestions

240

* Flag inappropriate content

241

* Handle abuse reports

242

* Assigned by admins based on trust

243

**admin**:

244

* Manage users and roles

245

* System configuration

246

* Access to all features

247

* Founder-appointed initially

248

=== Contribution Tracking (Simple) ===

249

**Basic metrics only**:

250

* `contributions_count`: Total number of contributions

251

* `created_at`: Account age

252

* `last_active`: Recent activity

253

**No complex calculations**:

254

* No point systems

255

* No automated privilege escalation

256

* No reputation decay

257

* No threshold-based promotions

258

=== Promotion Process ===

259

**Manual review by moderators/admins**:

260

1. User demonstrates value through contributions

261

2. Moderator reviews user's contribution history

262

3. Moderator promotes user to contributor role

263

4. Admin promotes trusted contributors to moderator

264

**Criteria** (guidelines, not automated):

265

* Quality of contributions

266

* Consistency over time

267

* Collaborative behavior

268

* Understanding of project goals

269

=== V2.0+ Evolution ===

270

**Add complex reputation when**:

271

* 100+ active contributors

272

* Manual role management becomes bottleneck

273

* Clear patterns of abuse emerge requiring automation

274

**Future features may include**:

275

* Automated point calculations

276

* Threshold-based promotions

277

* Reputation decay for inactive users

278

* Track record scoring for contributors

279

See [[When to Add Complexity>>FactHarbor.Specification.When-to-Add-Complexity]] for triggers.

280

=== 1.7 Edit ===

281

**Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at

282

**Purpose**: Complete audit trail for all content changes

283

=== Edit History Details ===

284

**What Gets Edited**:

285

* **Claims** (20% edited): assertion, domain, status, scores, analysis

286

* **Evidence** (10% edited): excerpt, relevance_score, supports

287

* **Scenarios** (5% edited): description, assumptions, confidence

288

* **Sources**: NOT versioned (continuous updates, not editorial decisions)

289

**Who Edits**:

290

* **Contributors** (rep sufficient): Corrections, additions

291

* **Trusted Contributors** (rep sufficient): Major improvements, approvals

292

* **Moderators**: Abuse handling, dispute resolution

293

* **System (AKEL)**: Re-analysis, automated improvements (user_id = NULL)

294

**Edit Types**:

295

* `CONTENT_CORRECTION`: User fixes factual error

296

* `CLARIFICATION`: Improved wording

297

* `SYSTEM_REANALYSIS`: AKEL re-processed claim

298

* `MODERATION_ACTION`: Hide/unhide for abuse

299

* `REVERT`: Rollback to previous version

300

**Retention Policy** (5 years total):

301

1. **Hot storage** (3 months): PostgreSQL, instant access

302

2. **Warm storage** (2 years): Partitioned, slower queries

303

3. **Cold storage** (3 years): S3 compressed, download required

304

4. **Deletion**: After 5 years (except legal holds)

305

**Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit)

306

**Use Cases**:

307

* View claim history timeline

308

* Detect vandalism patterns

309

* Learn from user corrections (system improvement)

310

* Legal compliance (audit trail)

311

* Rollback capability

312

See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases

313

=== 1.8 Flag ===

314

Fields: entity_id, reported_by, issue_type, status, resolution_note

315

=== 1.9 QualityMetric ===

316

**Fields**: metric_type, category, value, target, timestamp

317

**Purpose**: Time-series quality tracking

318

**Usage**:

319

* **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times

320

* **Quality dashboard**: Real-time display with trend charts

321

* **Alerting**: Automatic alerts when metrics exceed thresholds

322

* **A/B testing**: Compare control vs treatment metrics

323

* **Improvement validation**: Measure before/after changes

324

**Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`

325

=== 1.10 ErrorPattern ===

326

**Fields**: error_category, claim_id, description, root_cause, frequency, status

327

**Purpose**: Capture errors to trigger system improvements

328

**Usage**:

329

* **Error capture**: When users flag issues or system detects problems

330

* **Pattern analysis**: Weekly grouping by category and frequency

331

* **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor

332

* **Metrics**: Track error rate reduction over time

333

**Example**: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}`

334

335

== 1.4 Core Data Model ERD ==

336

337

{{include reference="FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}}

338

339

== 1.5 User Class Diagram ==

340

{{include reference="FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}}

341

== 2. Versioning Strategy ==

342

**All Content Entities Are Versioned**:

343

* **Claim**: Every edit creates new version (V1→V2→V3...)

344

* **Evidence**: Changes tracked in edit history

345

* **Scenario**: Modifications versioned

346

**How Versioning Works**:

347

* Entity table stores **current state only**

348

* Edit table stores **all historical states** (before_state, after_state as JSON)

349

* Version number increments with each edit

350

* Complete audit trail maintained forever

351

**Unversioned Entities** (current state only, no history):

352

* **Source**: Track record continuously updated (not versioned history, just current score)

353

* **User**: Account state (reputation accumulated, not versioned)

354

* **QualityMetric**: Time-series data (each record is a point in time, not a version)

355

* **ErrorPattern**: System improvement queue (status tracked, not versioned)

356

**Example**:

357

```

358

Claim V1: "The sky is blue"

359

→ User edits →

360

Claim V2: "The sky is blue during daytime"

361

→ EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}

362

```

363

== 2.5. Storage vs Computation Strategy ==

364

**Critical architectural decision**: What to persist in databases vs compute dynamically?

365

**Trade-off**:

366

* **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs

367

* **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible

368

=== Recommendation: Hybrid Approach ===

369

**STORE (in PostgreSQL):**

370

==== Claims (Current State + History) ====

371

* **What**: assertion, domain, status, created_at, updated_at, version

372

* **Why**: Core entity, must be persistent

373

* **Also store**: confidence_score (computed once, then cached)

374

* **Size**: ~1 KB per claim

375

* **Growth**: Linear with claims

376

* **Decision**: ✅ STORE - Essential

377

==== Evidence (All Records) ====

378

* **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at

379

* **Why**: Hard to re-gather, user contributions, reproducibility

380

* **Size**: ~2 KB per evidence (with excerpt)

381

* **Growth**: 3-10 evidence per claim

382

* **Decision**: ✅ STORE - Essential for reproducibility

383

==== Sources (Track Records) ====

384

* **What**: name, domain, track_record_score, accuracy_history, correction_frequency

385

* **Why**: Continuously updated, expensive to recompute

386

* **Size**: ~500 bytes per source

387

* **Growth**: Slow (limited number of sources)

388

* **Decision**: ✅ STORE - Essential for quality

389

==== Edit History (All Versions) ====

390

* **What**: before_state, after_state, user_id, reason, timestamp

391

* **Why**: Audit trail, legal requirement, reproducibility

392

* **Size**: ~2 KB per edit

393

* **Growth**: Linear with edits (~A portion of claims get edited)

394

* **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total

395

* **Decision**: ✅ STORE - Essential for accountability

396

==== Flags (User Reports) ====

397

* **What**: entity_id, reported_by, issue_type, description, status

398

* **Why**: Error detection, system improvement triggers

399

* **Size**: ~500 bytes per flag

400

* **Growth**: 5-high percentage of claims get flagged

401

* **Decision**: ✅ STORE - Essential for improvement

402

==== ErrorPatterns (System Improvement) ====

403

* **What**: error_category, claim_id, description, root_cause, frequency, status

404

* **Why**: Learning loop, prevent recurring errors

405

* **Size**: ~1 KB per pattern

406

* **Growth**: Slow (limited patterns, many fixed)

407

* **Decision**: ✅ STORE - Essential for learning

408

==== QualityMetrics (Time Series) ====

409

* **What**: metric_type, category, value, target, timestamp

410

* **Why**: Trend analysis, cannot recreate historical metrics

411

* **Size**: ~200 bytes per metric

412

* **Growth**: Hourly = 8,760 per year per metric type

413

* **Retention**: 2 years hot, then aggregate and archive

414

* **Decision**: ✅ STORE - Essential for monitoring

415

**STORE (Computed Once, Then Cached):**

416

==== Analysis Summary ====

417

* **What**: Neutral text summary of claim analysis (200-500 words)

418

* **Computed**: Once by AKEL when claim first analyzed

419

* **Stored in**: Claim table (text field)

420

* **Recomputed**: Only when system significantly improves OR claim edited

421

* **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often

422

* **Size**: ~2 KB per claim

423

* **Decision**: ✅ STORE (cached) - Cost-effective

424

==== Confidence Score ====

425

* **What**: 0-100 score of analysis confidence

426

* **Computed**: Once by AKEL

427

* **Stored in**: Claim table (integer field)

428

* **Recomputed**: When evidence added, source track record changes significantly, or system improves

429

* **Why store**: Cheap to store, expensive to compute, users need it fast

430

* **Size**: 4 bytes per claim

431

* **Decision**: ✅ STORE (cached) - Performance critical

432

==== Risk Score ====

433

* **What**: 0-100 score of claim risk level

434

* **Computed**: Once by AKEL

435

* **Stored in**: Claim table (integer field)

436

* **Recomputed**: When domain changes, evidence changes, or controversy detected

437

* **Why store**: Same as confidence score

438

* **Size**: 4 bytes per claim

439

* **Decision**: ✅ STORE (cached) - Performance critical

440

**COMPUTE DYNAMICALLY (Never Store):**

441

==== Scenarios ==== ⚠️ CRITICAL DECISION

442

* **What**: 2-5 possible interpretations of claim with assumptions

443

* **Current design**: Stored in Scenario table

444

* **Alternative**: Compute on-demand when user views claim details

445

* **Storage cost**: ~1 KB per scenario × 3 scenarios average = ~3 KB per claim

446

* **Compute cost**: $0.005-0.01 per request (LLM API call)

447

* **Frequency**: Viewed in detail by ~20% of users

448

* **Trade-off analysis**:

449

- IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access

450

- IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs

451

* **Reproducibility**: Scenarios may improve as AI improves (good to recompute)

452

* **Speed**: Computed = 5-8 seconds delay, Stored = instant

453

* **Decision**: ✅ STORE (hybrid approach below)

454

**Scenario Strategy** (APPROVED):

455

1. **Store scenarios** initially when claim analyzed

456

2. **Mark as stale** when system improves significantly

457

3. **Recompute on next view** if marked stale

458

4. **Cache for 30 days** if frequently accessed

459

5. **Result**: Best of both worlds - speed + freshness

460

==== Verdict Synthesis ====

461

* **What**: Final conclusion text synthesizing all scenarios

462

* **Compute cost**: $0.002-0.005 per request

463

* **Frequency**: Every time claim viewed

464

* **Why not store**: Changes as evidence/scenarios change, users want fresh analysis

465

* **Speed**: 2-3 seconds (acceptable)

466

**Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale

467

* **Recommendation**: ✅ STORE cached version, mark stale when changes occur

468

==== Search Results ====

469

* **What**: Lists of claims matching search query

470

* **Compute from**: Elasticsearch index

471

* **Cache**: 15 minutes in Redis for popular queries

472

* **Why not store permanently**: Constantly changing, infinite possible queries

473

==== Aggregated Statistics ====

474

* **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.

475

* **Compute from**: Database queries

476

* **Cache**: 1 hour in Redis

477

* **Why not store**: Can be derived, relatively cheap to compute

478

==== User Reputation ====

479

* **What**: Score based on contributions

480

* **Current design**: Stored in User table

481

* **Alternative**: Compute from Edit table

482

* **Trade-off**:

483

- Stored: Fast, simple

484

- Computed: Always accurate, no denormalization

485

* **Frequency**: Read on every user action

486

* **Compute cost**: Simple COUNT query, milliseconds

487

* **Decision**: ✅ STORE - Performance critical, read-heavy

488

=== Summary Table ===

489

490

|-----------|---------|---------|----------------|----------|-----------|

491

| Claim core | ✅ | - | 1 KB | STORE | Essential |

| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |

504

| Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |

505

**Total storage per claim**: ~18 KB (without edits and flags)

506

**For 1 million claims**:

507

* **Storage**: ~18 GB (manageable)

508

* **PostgreSQL**: ~$50/month (standard instance)

509

* **Redis cache**: ~$20/month (1 GB instance)

510

* **S3 archives**: ~$5/month (old edits)

511

* **Total**: ~$75/month infrastructure

512

**LLM cost savings by caching**:

513

* Analysis summary stored: Save $0.03 per claim = $30K per 1M claims

514

* Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims

515

* Verdict stored: Save $0.003 per claim = $3K per 1M claims

516

* **Total savings**: ~$35K per 1M claims vs recomputing every time

517

=== Recomputation Triggers ===

518

**When to mark cached data as stale and recompute:**

519

1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)

520

2. **Evidence added** → Recompute: scenarios, verdict, confidence score

521

3. **Source track record changes >10 points** → Recompute: confidence score, verdict

522

4. **System improvement deployed** → Mark affected claims stale, recompute on next view

523

5. **Controversy detected** (high flag rate) → Recompute: risk score

524

**Recomputation strategy**:

525

* **Eager**: Immediately recompute (for user edits)

526

* **Lazy**: Recompute on next view (for system improvements)

527

* **Batch**: Nightly re-evaluation of stale claims (if <1000)

528

=== Database Size Projection ===

529

**Year 1**: 10K claims

530

* Storage: 180 MB

531

* Cost: $10/month

532

**Year 3**: 100K claims

533

* Storage: 1.8 GB

534

* Cost: $30/month

535

**Year 5**: 1M claims

536

* Storage: 18 GB

537

* Cost: $75/month

538

**Year 10**: 10M claims

539

* Storage: 180 GB

540

* Cost: $300/month

541

* Optimization: Archive old claims to S3 ($5/TB/month)

542

**Conclusion**: Storage costs are manageable, LLM cost savings are substantial.

543

== 3. Key Simplifications ==

544

* **Two content states only**: Published, Hidden

545

* **Three user roles only**: Reader, Contributor, Moderator

546

* **No complex versioning**: Linear edit history

547

* **Reputation-based permissions**: Not role hierarchy

548

* **Source track records**: Continuous evaluation

549

== 3. What Gets Stored in the Database ==

550

=== 3.1 Primary Storage (PostgreSQL) ===

551

**Claims Table**:

552

* Current state only (latest version)

553

* Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at

554

**Evidence Table**:

555

* All evidence records

556

* Fields: id, claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at, archived

557

**Scenario Table**:

558

* All scenarios for each claim

559

* Fields: id, claim_id, description, assumptions (text array), confidence, created_by, created_at

560

**Source Table**:

561

* Track record database (continuously updated)

562

* Fields: id, name, domain, type, track_record_score, accuracy_history (JSON), correction_frequency, last_updated, claim_count, corrections_count

563

**User Table**:

564

* All user accounts

565

* Fields: id, username, email, role (Reader/Contributor/Moderator), reputation, created_at, last_active, contributions_count, flags_submitted, flags_accepted

566

**Edit Table**:

567

* Complete version history

568

* Fields: id, entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at

569

**Flag Table**:

570

* User-reported issues

571

* Fields: id, entity_type, entity_id, reported_by, issue_type, description, status, resolved_by, resolution_note, created_at, resolved_at