Data Model - XWiki

1

= Data Model =

2

3

FactHarbor's data model is **simple, focused, designed for automated processing**.

4

5

== 1. Core Entities ==

=== 1.1 Claim ===

Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count

10

11

==== Performance Optimization: Denormalized Fields ====

12

13

**Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.

14

**Additional cached fields in claims table**:

15

16

* **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores

17

* Avoids joining evidence table for listing/preview

18

* Updated when evidence is added/removed

19

* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`

20

* **source_names** (TEXT[]): Array of source names for quick display

21

* Avoids joining through evidence to sources

22

* Updated when sources change

23

* Format: `["New York Times", "Nature Journal", ...]`

24

* **scenario_count** (INTEGER): Number of scenarios for this claim

25

* Quick metric without counting rows

26

* Updated when scenarios added/removed

27

* **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed

28

* Helps invalidate stale caches

29

* Triggers background refresh if too old

30

**Update Strategy**:

31

* **Immediate**: Update on claim edit (user-facing)

32

* **Deferred**: Update via background job every hour (non-critical)

33

* **Invalidation**: Clear cache when source data changes significantly

34

**Trade-offs**:

35

* ✅ 70% fewer joins on common queries

36

* ✅ Much faster claim list/search pages

37

* ✅ Better user experience

38

* ⚠️ Small storage increase (10%)

39

* ⚠️ Need to keep caches in sync

=== 1.2 Evidence ===

Fields: claim_id, source_id, excerpt, url, relevance_score, supports

=== 1.3 Source ===

**Purpose**: Track reliability of information sources over time

48

**Fields**:

49

50

* **id** (UUID): Unique identifier

51

* **name** (text): Source name (e.g., "New York Times", "Nature Journal")

52

* **domain** (text): Website domain (e.g., "nytimes.com")

53

* **type** (enum): NewsOutlet, AcademicJournal, GovernmentAgency, etc.

54

* **track_record_score** (0-100): Overall reliability score

55

* **accuracy_history** (JSON): Historical accuracy data

56

* **correction_frequency** (float): How often source publishes corrections

57

* **last_updated** (timestamp): When track record last recalculated

58

**How It Works**:

59

* Initial score based on source type (70 for academic journals, 30 for unknown)

60

* Updated daily by background scheduler

61

* Formula: accuracy_rate (50%) + correction_policy (20%) + editorial_standards (15%) + bias_transparency (10%) + longevity (5%)

62

* Track Record Check in AKEL pipeline: Adjusts evidence confidence based on source quality

63

* Quality thresholds: 90+=Exceptional, 70-89=Reliable, 50-69=Acceptable, 30-49=Questionable, <30=Unreliable

64

**See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage

65

Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**

66

**Key**: Automated source reliability tracking

67

68

==== Source Scoring Process (Separation of Concerns) ====

69

70

**Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.

71

**The Problem**:

72

73

* Source scores should influence claim verdicts

74

* Claim verdicts should update source scores

75

* But: Direct feedback creates circular dependency and potential feedback loops

76

**The Solution**: Temporal separation

77

78

==== Weekly Background Job (Source Scoring) ====

79

80

Runs independently of claim analysis:

81

{{code language="python"}}def update_source_scores_weekly():

82

"""

83

Background job: Calculate source reliability

84

Never triggered by individual claim analysis

85

"""

86

# Analyze all claims from past week

87

claims = get_claims_from_past_week()

88

for source in get_all_sources():

89

# Calculate accuracy metrics

90

correct_verdicts = count_correct_verdicts_citing(source, claims)

91

total_citations = count_total_citations(source, claims)

92

accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5

93

# Weight by claim importance

94

weighted_score = calculate_weighted_score(source, claims)

95

# Update source record

96

source.track_record_score = weighted_score

97

source.total_citations = total_citations

98

source.last_updated = now()

99

source.save()

100

# Job runs: Sunday 2 AM UTC

101

# Never during claim processing{{/code}}

102

103

==== Real-Time Claim Analysis (AKEL) ====

104

105

Uses source scores but never updates them:

106

{{code language="python"}}def analyze_claim(claim_text):

107

"""

108

Real-time: Analyze claim using current source scores

109

READ source scores, never UPDATE them

110

"""

111

# Gather evidence

112

evidence_list = gather_evidence(claim_text)

113

for evidence in evidence_list:

114

# READ source score (snapshot from last weekly update)

115

source = get_source(evidence.source_id)

116

source_score = source.track_record_score

117

# Use score to weight evidence

118

evidence.weighted_relevance = evidence.relevance * source_score

119

# Generate verdict using weighted evidence

120

verdict = synthesize_verdict(evidence_list)

121

# NEVER update source scores here

122

# That happens in weekly background job

123

return verdict{{/code}}

124

125

==== Monthly Audit (Quality Assurance) ====

126

127

Moderator review of flagged source scores:

128

129

* Verify scores make sense

130

* Detect gaming attempts

131

* Identify systematic biases

132

* Manual adjustments if needed

133

**Key Principles**:

134

✅ **Scoring and analysis are temporally separated**

135

* Source scoring: Weekly batch job

136

* Claim analysis: Real-time processing

137

* Never update scores during analysis

138

✅ **One-way data flow during processing**

139

* Claims READ source scores

140

* Claims NEVER WRITE source scores

141

* Updates happen in background only

142

✅ **Predictable update cycle**

143

* Sources update every Sunday 2 AM

144

* Claims always use last week's scores

145

* No mid-week score changes

146

✅ **Audit trail**

147

* Log all score changes

148

* Track score history

149

* Explainable calculations

150

**Benefits**:

151

* No circular dependencies

152

* Predictable behavior

153

* Easier to reason about

154

* Simpler testing

155

* Clear audit trail

156

**Example Timeline**:

157

```

158

Sunday 2 AM: Calculate source scores for past week

159

→ NYT score: 0.87 (up from 0.85)

160

→ Blog X score: 0.52 (down from 0.61)

161

Monday-Saturday: Claims processed using these scores

162

→ All claims this week use NYT=0.87

163

→ All claims this week use Blog X=0.52

164

Next Sunday 2 AM: Recalculate scores including this week's claims

165

→ NYT score: 0.89 (trending up)

166

→ Blog X score: 0.48 (trending down)

```

=== 1.4 Scenario ===

**Purpose**: Different interpretations or contexts for evaluating claims

172

**Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.

173

**Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)

174

**Fields**:

175

176

* **id** (UUID): Unique identifier

177

* **claim_id** (UUID): Foreign key to claim (one-to-many)

178

* **description** (text): Human-readable description of the scenario

179

* **assumptions** (JSONB): Key assumptions that define this scenario context

180

* **extracted_from** (UUID): Reference to evidence that this scenario was extracted from

181

* **created_at** (timestamp): When scenario was created

182

* **updated_at** (timestamp): Last modification

183

**How Found**: Evidence search → Extract context → Create scenario → Link to claim

184

**Example**:

185

For claim "Vaccines reduce hospitalization":

186

* Scenario 1: "Clinical trials (healthy adults 18-65, original strain)" from trial paper

187

* Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data

188

* Scenario 3: "Immunocompromised patients" from specialist study

189

**V2.0 Evolution**: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality.

=== 1.5 Verdict ===

**Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.

**Core Fields**:

* **id** (UUID): Primary key

198

* **scenario_id** (UUID FK): The scenario being assessed

199

* **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")

200

* **confidence** (decimal 0-1): How confident we are in this assessment

201

* **explanation_summary** (text): Human-readable reasoning explaining the verdict

202

* **uncertainty_factors** (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")

203

* **created_at** (timestamp): When verdict was created

204

* **updated_at** (timestamp): Last modification

205

206

**Change Tracking**: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states.

207

208

**Relationship**: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity.

209

210

**Example**:

211

For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":

212

213

* Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]

214

* After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]

215

* Edit entity records the complete before/after change with timestamp and reason

216

217

**Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios.

=== 1.6 User ===

Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count

222

223

=== User Reputation System ===

224

225

**V1.0 Approach**: Simple manual role assignment

226

**Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.

227

228

=== Roles (Manual Assignment) ===

229

230

**reader** (default):

231

232

* View published claims and evidence

233

* Browse and search content

234

* No editing permissions

235

**contributor**:

236

* Submit new claims

237

* Suggest edits to existing content

238

* Add evidence

239

* Requires manual promotion by moderator/admin

240

**moderator**:

241

* Approve/reject contributor suggestions

242

* Flag inappropriate content

243

* Handle abuse reports

244

* Assigned by admins based on trust

245

**admin**:

246

* Manage users and roles

247

* System configuration

248

* Access to all features

249

* Founder-appointed initially

250

251

=== Contribution Tracking (Simple) ===

252

253

**Basic metrics only**:

254

255

* `contributions_count`: Total number of contributions

256

* `created_at`: Account age

257

* `last_active`: Recent activity

258

**No complex calculations**:

259

* No point systems

260

* No automated privilege escalation

261

* No reputation decay

262

* No threshold-based promotions

263

264

=== Promotion Process ===

265

266

**Manual review by moderators/admins**:

267

268

1. User demonstrates value through contributions

269

2. Moderator reviews user's contribution history

270

3. Moderator promotes user to contributor role

271

4. Admin promotes trusted contributors to moderator

272

**Criteria** (guidelines, not automated):

273

274

* Quality of contributions

275

* Consistency over time

276

* Collaborative behavior

277

* Understanding of project goals

278

279

=== V2.0+ Evolution ===

280

281

**Add complex reputation when**:

282

283

* 100+ active contributors

284

* Manual role management becomes bottleneck

285

* Clear patterns of abuse emerge requiring automation

286

**Future features may include**:

287

* Automated point calculations

288

* Threshold-based promotions

289

* Reputation decay for inactive users

290

* Track record scoring for contributors

291

See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers.

=== 1.7 Edit ===

**Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at

296

**Purpose**: Complete audit trail for all content changes

297

298

=== Edit History Details ===

299

300

**What Gets Edited**:

301

302

* **Claims** (20% edited): assertion, domain, status, scores, analysis

303

* **Evidence** (10% edited): excerpt, relevance_score, supports

304

* **Scenarios** (5% edited): description, assumptions, confidence

305

* **Sources**: NOT versioned (continuous updates, not editorial decisions)

306

**Who Edits**:

307

* **Contributors** (rep sufficient): Corrections, additions

308

* **Trusted Contributors** (rep sufficient): Major improvements, approvals

309

* **Moderators**: Abuse handling, dispute resolution

310

* **System (AKEL)**: Re-analysis, automated improvements (user_id = NULL)

311

**Edit Types**:

312

* `CONTENT_CORRECTION`: User fixes factual error

313

* `CLARIFICATION`: Improved wording

314

* `SYSTEM_REANALYSIS`: AKEL re-processed claim

315

* `MODERATION_ACTION`: Hide/unhide for abuse

316

* `REVERT`: Rollback to previous version

317

**Retention Policy** (5 years total):

318

319

1. **Hot storage** (3 months): PostgreSQL, instant access

320

2. **Warm storage** (2 years): Partitioned, slower queries

321

3. **Cold storage** (3 years): S3 compressed, download required

322

4. **Deletion**: After 5 years (except legal holds)

323

**Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit)

324

**Use Cases**:

325

326

* View claim history timeline

327

* Detect vandalism patterns

328

* Learn from user corrections (system improvement)

329

* Legal compliance (audit trail)

330

* Rollback capability

331

See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases

=== 1.8 Flag ===

Fields: entity_id, reported_by, issue_type, status, resolution_note

336

337

=== 1.9 QualityMetric ===

338

339

**Fields**: metric_type, category, value, target, timestamp

340

**Purpose**: Time-series quality tracking

341

**Usage**:

342

343

* **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times

344

* **Quality dashboard**: Real-time display with trend charts

345

* **Alerting**: Automatic alerts when metrics exceed thresholds

346

* **A/B testing**: Compare control vs treatment metrics

347

* **Improvement validation**: Measure before/after changes

348

**Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`

349

350

=== 1.10 ErrorPattern ===

351

352

**Fields**: error_category, claim_id, description, root_cause, frequency, status

353

**Purpose**: Capture errors to trigger system improvements

354

**Usage**:

355

356

* **Error capture**: When users flag issues or system detects problems

357

* **Pattern analysis**: Weekly grouping by category and frequency

358

* **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor

359

* **Metrics**: Track error rate reduction over time

360

**Example**: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}`

361

362

== 1.4 Core Data Model ERD ==

363

364

{{include reference="Test.FactHarbor V0\.9\.100.Specification.Diagrams.Core Data Model ERD.WebHome"/}}

365

366

== 1.5 User Class Diagram ==

367

368

{{include reference="Test.FactHarbor V0\.9\.100.Specification.Diagrams.User Class Diagram.WebHome"/}}

369

370

== 2. Versioning Strategy ==

371

372

**All Content Entities Are Versioned**:

373

374

* **Claim**: Every edit creates new version (V1→V2→V3...)

375

* **Evidence**: Changes tracked in edit history

376

* **Scenario**: Modifications versioned

377

**How Versioning Works**:

378

* Entity table stores **current state only**

379

* Edit table stores **all historical states** (before_state, after_state as JSON)

380

* Version number increments with each edit

381

* Complete audit trail maintained forever

382

**Unversioned Entities** (current state only, no history):

383

* **Source**: Track record continuously updated (not versioned history, just current score)

384

* **User**: Account state (reputation accumulated, not versioned)

385

* **QualityMetric**: Time-series data (each record is a point in time, not a version)

386

* **ErrorPattern**: System improvement queue (status tracked, not versioned)

387

**Example**:

388

```

389

Claim V1: "The sky is blue"

390

→ User edits →

391

Claim V2: "The sky is blue during daytime"

392

→ EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}

393

```

394

395

== 2.5. Storage vs Computation Strategy ==

396

397

**Critical architectural decision**: What to persist in databases vs compute dynamically?

398

**Trade-off**:

399

400

* **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs

401

* **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible

402

403

=== Recommendation: Hybrid Approach ===

404

405

**STORE (in PostgreSQL):**

406

407

==== Claims (Current State + History) ====

408

409

* **What**: assertion, domain, status, created_at, updated_at, version

410

* **Why**: Core entity, must be persistent

411

* **Also store**: confidence_score (computed once, then cached)

412

* **Size**: 1 KB per claim

413

* **Growth**: Linear with claims

414

* **Decision**: ✅ STORE - Essential

415

416

==== Evidence (All Records) ====

417

418

* **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at

419

* **Why**: Hard to re-gather, user contributions, reproducibility

420

* **Size**: 2 KB per evidence (with excerpt)

421

* **Growth**: 3-10 evidence per claim

422

* **Decision**: ✅ STORE - Essential for reproducibility

423

424

==== Sources (Track Records) ====

425

426

* **What**: name, domain, track_record_score, accuracy_history, correction_frequency

427

* **Why**: Continuously updated, expensive to recompute

428

* **Size**: 500 bytes per source

429

* **Growth**: Slow (limited number of sources)

430

* **Decision**: ✅ STORE - Essential for quality

431

432

==== Edit History (All Versions) ====

433

434

* **What**: before_state, after_state, user_id, reason, timestamp

435

* **Why**: Audit trail, legal requirement, reproducibility

436

* **Size**: 2 KB per edit

437

* **Growth**: Linear with edits (A portion of claims get edited)

438

* **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total

439

* **Decision**: ✅ STORE - Essential for accountability

440

441

==== Flags (User Reports) ====

442

443

* **What**: entity_id, reported_by, issue_type, description, status

444

* **Why**: Error detection, system improvement triggers

445

* **Size**: 500 bytes per flag

446

* **Growth**: 5-high percentage of claims get flagged

447

* **Decision**: ✅ STORE - Essential for improvement

448

449

==== ErrorPatterns (System Improvement) ====

450

451

* **What**: error_category, claim_id, description, root_cause, frequency, status

452

* **Why**: Learning loop, prevent recurring errors

453

* **Size**: 1 KB per pattern

454

* **Growth**: Slow (limited patterns, many fixed)

455

* **Decision**: ✅ STORE - Essential for learning

456

457

==== QualityMetrics (Time Series) ====

458

459

* **What**: metric_type, category, value, target, timestamp

460

* **Why**: Trend analysis, cannot recreate historical metrics

461

* **Size**: 200 bytes per metric

462

* **Growth**: Hourly = 8,760 per year per metric type

463

* **Retention**: 2 years hot, then aggregate and archive

464

* **Decision**: ✅ STORE - Essential for monitoring

465

**STORE (Computed Once, Then Cached):**

466

467

==== Analysis Summary ====

468

469

* **What**: Neutral text summary of claim analysis (200-500 words)

470

* **Computed**: Once by AKEL when claim first analyzed

471

* **Stored in**: Claim table (text field)

472

* **Recomputed**: Only when system significantly improves OR claim edited

473

* **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often

474

* **Size**: 2 KB per claim

475

* **Decision**: ✅ STORE (cached) - Cost-effective

476

477

==== Confidence Score ====

478

479

* **What**: 0-100 score of analysis confidence

480

* **Computed**: Once by AKEL

481

* **Stored in**: Claim table (integer field)

482

* **Recomputed**: When evidence added, source track record changes significantly, or system improves

483

* **Why store**: Cheap to store, expensive to compute, users need it fast

484

* **Size**: 4 bytes per claim

485

* **Decision**: ✅ STORE (cached) - Performance critical

==== Risk Score ====

* **What**: 0-100 score of claim risk level

490

* **Computed**: Once by AKEL

491

* **Stored in**: Claim table (integer field)

492

* **Recomputed**: When domain changes, evidence changes, or controversy detected

493

* **Why store**: Same as confidence score

494

* **Size**: 4 bytes per claim

495

* **Decision**: ✅ STORE (cached) - Performance critical

496

**COMPUTE DYNAMICALLY (Never Store):**

==== Scenarios ====

⚠️ CRITICAL DECISION

* **What**: 2-5 possible interpretations of claim with assumptions

503

* **Current design**: Stored in Scenario table

504

* **Alternative**: Compute on-demand when user views claim details

505

* **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim

506

* **Compute cost**: $0.005-0.01 per request (LLM API call)

507

* **Frequency**: Viewed in detail by 20% of users

508

* **Trade-off analysis**:

509

- IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access

510

- IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs

511

* **Reproducibility**: Scenarios may improve as AI improves (good to recompute)

512

* **Speed**: Computed = 5-8 seconds delay, Stored = instant

513

* **Decision**: ✅ STORE (hybrid approach below)

514

**Scenario Strategy** (APPROVED):

515

516

1. **Store scenarios** initially when claim analyzed

517

2. **Mark as stale** when system improves significantly

518

3. **Recompute on next view** if marked stale

519

4. **Cache for 30 days** if frequently accessed

520

5. **Result**: Best of both worlds - speed + freshness

521

522

==== Verdict Synthesis ====

* **What**: Final conclusion text synthesizing all scenarios

527

* **Compute cost**: $0.002-0.005 per request

528

* **Frequency**: Every time claim viewed

529

* **Why not store**: Changes as evidence/scenarios change, users want fresh analysis

530

* **Speed**: 2-3 seconds (acceptable)

531

**Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale

532

* **Recommendation**: ✅ STORE cached version, mark stale when changes occur

533

534

==== Search Results ====

535

536

* **What**: Lists of claims matching search query

537

* **Compute from**: Elasticsearch index

538

* **Cache**: 15 minutes in Redis for popular queries

539

* **Why not store permanently**: Constantly changing, infinite possible queries

540

541

==== Aggregated Statistics ====

542

543

* **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.

544

* **Compute from**: Database queries

545

* **Cache**: 1 hour in Redis

546

* **Why not store**: Can be derived, relatively cheap to compute

547

548

==== User Reputation ====

549

550

* **What**: Score based on contributions

551

* **Current design**: Stored in User table

552

* **Alternative**: Compute from Edit table

553

* **Trade-off**:

554

- Stored: Fast, simple

555

- Computed: Always accurate, no denormalization

556

* **Frequency**: Read on every user action

557

* **Compute cost**: Simple COUNT query, milliseconds

558

* **Decision**: ✅ STORE - Performance critical, read-heavy

559

560

=== Summary Table ===

|-|-|-|||-|\\

| Claim core | ✅ | - | 1 KB | STORE | Essential |\\

565

| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\

566

| Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\

567

| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\

568

| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\

569

| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\

570

| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\

571

| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\

572

| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\

573

| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\

574

| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\

575

| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\

576

| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\

577

| Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |

578

**Total storage per claim**: 18 KB (without edits and flags)

579

**For 1 million claims**:

580

581

* **Storage**: 18 GB (manageable)

582

* **PostgreSQL**: $50/month (standard instance)

583

* **Redis cache**: $20/month (1 GB instance)

584

* **S3 archives**: $5/month (old edits)

585

* **Total**: $75/month infrastructure

586

**LLM cost savings by caching**:

587

* Analysis summary stored: Save $0.03 per claim = $30K per 1M claims

588

* Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims

589

* Verdict stored: Save $0.003 per claim = $3K per 1M claims

590

* **Total savings**: $35K per 1M claims vs recomputing every time

591

592

=== Recomputation Triggers ===

593

594

**When to mark cached data as stale and recompute:**

595

596

1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)

597

2. **Evidence added** → Recompute: scenarios, verdict, confidence score

598

3. **Source track record changes >10 points** → Recompute: confidence score, verdict

599

4. **System improvement deployed** → Mark affected claims stale, recompute on next view

600

5. **Controversy detected** (high flag rate) → Recompute: risk score

601

**Recomputation strategy**:

602

603

* **Eager**: Immediately recompute (for user edits)

604

* **Lazy**: Recompute on next view (for system improvements)

605

* **Batch**: Nightly re-evaluation of stale claims (if <1000)

606

607

=== Database Size Projection ===

608

609

**Year 1**: 10K claims

* Storage: 180 MB

* Cost: $10/month

**Year 3**: 100K claims

614

* Storage: 1.8 GB

615

* Cost: $30/month

616

**Year 5**: 1M claims

617

* Storage: 18 GB

618

* Cost: $75/month

619

**Year 10**: 10M claims

620

* Storage: 180 GB

621

* Cost: $300/month

622

* Optimization: Archive old claims to S3 ($5/TB/month)

623

**Conclusion**: Storage costs are manageable, LLM cost savings are substantial.

624

625

== 3. Key Simplifications ==

626

627

* **Two content states only**: Published, Hidden

628

* **Three user roles only**: Reader, Contributor, Moderator

629

* **No complex versioning**: Linear edit history

630

* **Reputation-based permissions**: Not role hierarchy

631

* **Source track records**: Continuous evaluation

632

633

== 3. What Gets Stored in the Database ==

634

635

=== 3.1 Primary Storage (PostgreSQL) ===

**Claims Table**:

* Current state only (latest version)

640

* Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at

641

**Evidence Table**:

642

* All evidence records

643

* Fields: id, claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at, archived

644

**Scenario Table**:

645

* All scenarios for each claim

646

* Fields: id, claim_id, description, assumptions (text array), confidence, created_by, created_at

647

**Source Table**:

648

* Track record database (continuously updated)

649

* Fields: id, name, domain, type, track_record_score, accuracy_history (JSON), correction_frequency, last_updated, claim_count, corrections_count

650

**User Table**:

651

* All user accounts

652

* Fields: id, username, email, role (Reader/Contributor/Moderator), reputation, created_at, last_active, contributions_count, flags_submitted, flags_accepted

653

**Edit Table**:

654

* Complete version history

655

* Fields: id, entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at

656

**Flag Table**:

657

* User-reported issues

658

* Fields: id, entity_type, entity_id, reported_by, issue_type, description, status, resolved_by, resolution_note, created_at, resolved_at