Data Model - XWiki

= Data Model =

**Implementation Status (Updated January 2026)**

This specification describes the **target** normalized data model. Current implementation (v2.6.33) differs significantly:

7

8

* **Storage**: All data stored as **JSON blobs in SQLite**, not normalized PostgreSQL tables

9

* **Scenarios**: **Replaced with KeyFactors** - decomposition questions, not separate entities

10

* **Caching**: Redis cache **not implemented**; no claim caching yet

11

* **Source Scoring**: Uses external MBFC bundle, not internal track record calculation

12

* **User System**: Not implemented (no authentication in current version)

13

14

This specification remains valuable as the target architecture for future versions.

15

16

See `Docs/STATUS/Documentation_Inconsistencies.md` for full comparison.

17

18

19

FactHarbor's data model is **simple, focused, designed for automated processing**.

20

21

== 1. Core Entities ==

=== 1.1 Claim ===

Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count

26

27

==== Performance Optimization: Denormalized Fields ====

28

29

**Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.

30

**Additional cached fields in claims table**:

31

32

* **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores

33

* Avoids joining evidence table for listing/preview

34

* Updated when evidence is added/removed

35

* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`

36

* **source_names** (TEXT[]): Array of source names for quick display

37

* Avoids joining through evidence to sources

38

* Updated when sources change

39

* Format: `["New York Times", "Nature Journal", ...]`

40

* **scenario_count** (INTEGER): Number of scenarios for this claim

41

* Quick metric without counting rows

42

* Updated when scenarios added/removed

43

* **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed

44

* Helps invalidate stale caches

45

* Triggers background refresh if too old

46

**Update Strategy**:

47

* **Immediate**: Update on claim edit (user-facing)

48

* **Deferred**: Update via background job every hour (non-critical)

49

* **Invalidation**: Clear cache when source data changes significantly

50

**Trade-offs**:

51

* ✅ 70% fewer joins on common queries

52

* ✅ Much faster claim list/search pages

53

* ✅ Better user experience

54

* ⚠️ Small storage increase (10%)

55

* ⚠️ Need to keep caches in sync

=== 1.2 Evidence ===

Fields: claim_id, source_id, excerpt, url, relevance_score, supports

=== 1.3 Source ===

**Purpose**: Track reliability of information sources over time

64

**Fields**:

65

66

* **id** (UUID): Unique identifier

67

* **name** (text): Source name (e.g., "New York Times", "Nature Journal")

68

* **domain** (text): Website domain (e.g., "nytimes.com")

69

* **type** (enum): NewsOutlet, AcademicJournal, GovernmentAgency, etc.

70

* **track_record_score** (0-100): Overall reliability score

71

* **accuracy_history** (JSON): Historical accuracy data

72

* **correction_frequency** (float): How often source publishes corrections

73

* **last_updated** (timestamp): When track record last recalculated

74

**How It Works**:

75

* Initial score based on source type (70 for academic journals, 30 for unknown)

76

* Updated daily by background scheduler

77

* Formula: accuracy_rate (50%) + correction_policy (20%) + editorial_standards (15%) + bias_transparency (10%) + longevity (5%)

78

* Track Record Check in AKEL pipeline: Adjusts evidence confidence based on source quality

79

* Quality thresholds: 90+=Exceptional, 70-89=Reliable, 50-69=Acceptable, 30-49=Questionable, <30=Unreliable

80

81

82

**Current Implementation (v2.6.33):** Source reliability uses external **MBFC (Media Bias/Fact Check) bundle** instead of internal track record calculation. Scores are loaded from a configurable JSON file. See `Docs/ARCHITECTURE/Source_Reliability.md`.

83

84

85

**See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage

86

Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**

87

**Key**: Automated source reliability tracking

88

89

==== Source Scoring Process (Separation of Concerns) ====

90

91

**Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.

92

**The Problem**:

93

94

* Source scores should influence claim verdicts

95

* Claim verdicts should update source scores

96

* But: Direct feedback creates circular dependency and potential feedback loops

97

**The Solution**: Temporal separation

98

99

==== Weekly Background Job (Source Scoring) ====

100

101

Runs independently of claim analysis:

102

{{code language="python"}}def update_source_scores_weekly():

103

"""

104

Background job: Calculate source reliability

105

Never triggered by individual claim analysis

106

"""

107

# Analyze all claims from past week

108

claims = get_claims_from_past_week()

109

for source in get_all_sources():

110

# Calculate accuracy metrics

111

correct_verdicts = count_correct_verdicts_citing(source, claims)

112

total_citations = count_total_citations(source, claims)

113

accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5

114

# Weight by claim importance

115

weighted_score = calculate_weighted_score(source, claims)

116

# Update source record

117

source.track_record_score = weighted_score

118

source.total_citations = total_citations

119

source.last_updated = now()

120

source.save()

121

# Job runs: Sunday 2 AM UTC

122

# Never during claim processing{{/code}}

123

124

==== Real-Time Claim Analysis (AKEL) ====

125

126

Uses source scores but never updates them:

127

{{code language="python"}}def analyze_claim(claim_text):

128

"""

129

Real-time: Analyze claim using current source scores

130

READ source scores, never UPDATE them

131

"""

132

# Gather evidence

133

evidence_list = gather_evidence(claim_text)

134

for evidence in evidence_list:

135

# READ source score (snapshot from last weekly update)

136

source = get_source(evidence.source_id)

137

source_score = source.track_record_score

138

# Use score to weight evidence

139

evidence.weighted_relevance = evidence.relevance * source_score

140

# Generate verdict using weighted evidence

141

verdict = synthesize_verdict(evidence_list)

142

# NEVER update source scores here

143

# That happens in weekly background job

144

return verdict{{/code}}

145

146

==== Monthly Audit (Quality Assurance) ====

147

148

Moderator review of flagged source scores:

149

150

* Verify scores make sense

151

* Detect gaming attempts

152

* Identify systematic biases

153

* Manual adjustments if needed

154

**Key Principles**:

155

✅ **Scoring and analysis are temporally separated**

156

* Source scoring: Weekly batch job

157

* Claim analysis: Real-time processing

158

* Never update scores during analysis

159

✅ **One-way data flow during processing**

160

* Claims READ source scores

161

* Claims NEVER WRITE source scores

162

* Updates happen in background only

163

✅ **Predictable update cycle**

164

* Sources update every Sunday 2 AM

165

* Claims always use last week's scores

166

* No mid-week score changes

167

✅ **Audit trail**

168

* Log all score changes

169

* Track score history

170

* Explainable calculations

171

**Benefits**:

172

* No circular dependencies

173

* Predictable behavior

174

* Easier to reason about

175

* Simpler testing

176

* Clear audit trail

177

**Example Timeline**:

178

```

179

Sunday 2 AM: Calculate source scores for past week

180

→ NYT score: 0.87 (up from 0.85)

181

→ Blog X score: 0.52 (down from 0.61)

182

Monday-Saturday: Claims processed using these scores

183

→ All claims this week use NYT=0.87

184

→ All claims this week use Blog X=0.52

185

Next Sunday 2 AM: Recalculate scores including this week's claims

186

→ NYT score: 0.89 (trending up)

187

→ Blog X score: 0.48 (trending down)

```

=== 1.4 Scenario ===

**Implementation Change:** Scenarios were **replaced with KeyFactors** in the current implementation. KeyFactors are decomposition questions discovered during the understanding phase, not separate stored entities. See `Docs/ARCHITECTURE/KeyFactors_Design.md` for the design rationale.

194

195

196

**Purpose**: Different interpretations or contexts for evaluating claims

197

**Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.

198

**Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)

199

**Fields**:

200

201

* **id** (UUID): Unique identifier

202

* **claim_id** (UUID): Foreign key to claim (one-to-many)

203

* **description** (text): Human-readable description of the scenario

204

* **assumptions** (JSONB): Key assumptions that define this scenario context

205

* **extracted_from** (UUID): Reference to evidence that this scenario was extracted from

206

* **created_at** (timestamp): When scenario was created

207

* **updated_at** (timestamp): Last modification

208

**How Found**: Evidence search → Extract context → Create scenario → Link to claim

209

**Example**:

210

For claim "Vaccines reduce hospitalization":

211

* Scenario 1: "Clinical trials (healthy adults 18-65, original strain)" from trial paper

212

* Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data

213

* Scenario 3: "Immunocompromised patients" from specialist study

214

**V2.0 Evolution**: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality.

=== 1.5 Verdict ===

**Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.

**Core Fields**:

* **id** (UUID): Primary key

223

* **scenario_id** (UUID FK): The scenario being assessed

224

* **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")

225

* **confidence** (decimal 0-1): How confident we are in this assessment

226

* **explanation_summary** (text): Human-readable reasoning explaining the verdict

227

* **uncertainty_factors** (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")

228

* **created_at** (timestamp): When verdict was created

229

* **updated_at** (timestamp): Last modification

230

231

**Change Tracking**: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states.

232

233

**Relationship**: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity.

234

235

**Example**:

236

For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":

237

238

* Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]

239

* After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]

240

* Edit entity records the complete before/after change with timestamp and reason

241

242

**Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios.

=== 1.6 User ===

Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count

247

248

=== User Reputation System ===

249

250

**V1.0 Approach**: Simple manual role assignment

251

**Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.

252

253

=== Roles (Manual Assignment) ===

254

255

**reader** (default):

256

257

* View published claims and evidence

258

* Browse and search content

259

* No editing permissions

260

**contributor**:

261

* Submit new claims

262

* Suggest edits to existing content

263

* Add evidence

264

* Requires manual promotion by moderator/admin

265

**moderator**:

266

* Approve/reject contributor suggestions

267

* Flag inappropriate content

268

* Handle abuse reports

269

* Assigned by admins based on trust

270

**admin**:

271

* Manage users and roles

272

* System configuration

273

* Access to all features

274

* Founder-appointed initially

275

276

=== Contribution Tracking (Simple) ===

277

278

**Basic metrics only**:

279

280

* `contributions_count`: Total number of contributions

281

* `created_at`: Account age

282

* `last_active`: Recent activity

283

**No complex calculations**:

284

* No point systems

285

* No automated privilege escalation

286

* No reputation decay

287

* No threshold-based promotions

288

289

=== Promotion Process ===

290

291

**Manual review by moderators/admins**:

292

293

1. User demonstrates value through contributions

294

2. Moderator reviews user's contribution history

295

3. Moderator promotes user to contributor role

296

4. Admin promotes trusted contributors to moderator

297

**Criteria** (guidelines, not automated):

298

299

* Quality of contributions

300

* Consistency over time

301

* Collaborative behavior

302

* Understanding of project goals

303

304

=== V2.0+ Evolution ===

305

306

**Add complex reputation when**:

307

308

* 100+ active contributors

309

* Manual role management becomes bottleneck

310

* Clear patterns of abuse emerge requiring automation

311

**Future features may include**:

312

* Automated point calculations

313

* Threshold-based promotions

314

* Reputation decay for inactive users

315

* Track record scoring for contributors

316

See [[When to Add Complexity>>FactHarbor.Specification.When-to-Add-Complexity]] for triggers.

=== 1.7 Edit ===

**Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at

321

**Purpose**: Complete audit trail for all content changes

322

323

=== Edit History Details ===

324

325

**What Gets Edited**:

326

327

* **Claims** (20% edited): assertion, domain, status, scores, analysis

328

* **Evidence** (10% edited): excerpt, relevance_score, supports

329

* **Scenarios** (5% edited): description, assumptions, confidence

330

* **Sources**: NOT versioned (continuous updates, not editorial decisions)

331

**Who Edits**:

332

* **Contributors** (rep sufficient): Corrections, additions

333

* **Trusted Contributors** (rep sufficient): Major improvements, approvals

334

* **Moderators**: Abuse handling, dispute resolution

335

* **System (AKEL)**: Re-analysis, automated improvements (user_id = NULL)

336

**Edit Types**:

337

* `CONTENT_CORRECTION`: User fixes factual error

338

* `CLARIFICATION`: Improved wording

339

* `SYSTEM_REANALYSIS`: AKEL re-processed claim

340

* `MODERATION_ACTION`: Hide/unhide for abuse

341

* `REVERT`: Rollback to previous version

342

**Retention Policy** (5 years total):

343

344

1. **Hot storage** (3 months): PostgreSQL, instant access

345

2. **Warm storage** (2 years): Partitioned, slower queries

346

3. **Cold storage** (3 years): S3 compressed, download required

347

4. **Deletion**: After 5 years (except legal holds)

348

**Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit)

349

**Use Cases**:

350

351

* View claim history timeline

352

* Detect vandalism patterns

353

* Learn from user corrections (system improvement)

354

* Legal compliance (audit trail)

355

* Rollback capability

356

See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases

=== 1.8 Flag ===

Fields: entity_id, reported_by, issue_type, status, resolution_note

361

362

=== 1.9 QualityMetric ===

363

364

**Fields**: metric_type, category, value, target, timestamp

365

**Purpose**: Time-series quality tracking

366

**Usage**:

367

368

* **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times

369

* **Quality dashboard**: Real-time display with trend charts

370

* **Alerting**: Automatic alerts when metrics exceed thresholds

371

* **A/B testing**: Compare control vs treatment metrics

372

* **Improvement validation**: Measure before/after changes

373

**Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`

374

375

=== 1.10 ErrorPattern ===

376

377

**Fields**: error_category, claim_id, description, root_cause, frequency, status

378

**Purpose**: Capture errors to trigger system improvements

379

**Usage**:

380

381

* **Error capture**: When users flag issues or system detects problems

382

* **Pattern analysis**: Weekly grouping by category and frequency

383

* **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor

384

* **Metrics**: Track error rate reduction over time

385

**Example**: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}`

386

387

== 1.4 Core Data Model ERD ==

388

389

{{include reference="Archive.FactHarbor 2026\.02\.08.Specification.Diagrams.Core Data Model ERD.WebHome"/}}

390

391

== 1.5 User Class Diagram ==

392

393

{{include reference="Archive.FactHarbor 2026\.02\.08.Specification.Diagrams.User Class Diagram.WebHome"/}}

394

395

== 2. Versioning Strategy ==

396

397

**All Content Entities Are Versioned**:

398

399

* **Claim**: Every edit creates new version (V1→V2→V3...)

400

* **Evidence**: Changes tracked in edit history

401

* **Scenario**: Modifications versioned

402

**How Versioning Works**:

403

* Entity table stores **current state only**

404

* Edit table stores **all historical states** (before_state, after_state as JSON)

405

* Version number increments with each edit

406

* Complete audit trail maintained forever

407

**Unversioned Entities** (current state only, no history):

408

* **Source**: Track record continuously updated (not versioned history, just current score)

409

* **User**: Account state (reputation accumulated, not versioned)

410

* **QualityMetric**: Time-series data (each record is a point in time, not a version)

411

* **ErrorPattern**: System improvement queue (status tracked, not versioned)

412

**Example**:

413

```

414

Claim V1: "The sky is blue"

415

→ User edits →

416

Claim V2: "The sky is blue during daytime"

417

→ EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}

418

```

419

420

== 2.5. Storage vs Computation Strategy ==

421

422

**Critical architectural decision**: What to persist in databases vs compute dynamically?

423

**Trade-off**:

424

425

* **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs

426

* **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible

427

428

=== Recommendation: Hybrid Approach ===

429

430

**STORE (in PostgreSQL):**

431

432

==== Claims (Current State + History) ====

433

434

* **What**: assertion, domain, status, created_at, updated_at, version

435

* **Why**: Core entity, must be persistent

436

* **Also store**: confidence_score (computed once, then cached)

437

* **Size**: 1 KB per claim

438

* **Growth**: Linear with claims

439

* **Decision**: ✅ STORE - Essential

440

441

==== Evidence (All Records) ====

442

443

* **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at

444

* **Why**: Hard to re-gather, user contributions, reproducibility

445

* **Size**: 2 KB per evidence (with excerpt)

446

* **Growth**: 3-10 evidence per claim

447

* **Decision**: ✅ STORE - Essential for reproducibility

448

449

==== Sources (Track Records) ====

450

451

* **What**: name, domain, track_record_score, accuracy_history, correction_frequency

452

* **Why**: Continuously updated, expensive to recompute

453

* **Size**: 500 bytes per source

454

* **Growth**: Slow (limited number of sources)

455

* **Decision**: ✅ STORE - Essential for quality

456

457

==== Edit History (All Versions) ====

458

459

* **What**: before_state, after_state, user_id, reason, timestamp

460

* **Why**: Audit trail, legal requirement, reproducibility

461

* **Size**: 2 KB per edit

462

* **Growth**: Linear with edits (A portion of claims get edited)

463

* **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total

464

* **Decision**: ✅ STORE - Essential for accountability

465

466

==== Flags (User Reports) ====

467

468

* **What**: entity_id, reported_by, issue_type, description, status

469

* **Why**: Error detection, system improvement triggers

470

* **Size**: 500 bytes per flag

471

* **Growth**: 5-high percentage of claims get flagged

472

* **Decision**: ✅ STORE - Essential for improvement

473

474

==== ErrorPatterns (System Improvement) ====

475

476

* **What**: error_category, claim_id, description, root_cause, frequency, status

477

* **Why**: Learning loop, prevent recurring errors

478

* **Size**: 1 KB per pattern

479

* **Growth**: Slow (limited patterns, many fixed)

480

* **Decision**: ✅ STORE - Essential for learning

481

482

==== QualityMetrics (Time Series) ====

483

484

* **What**: metric_type, category, value, target, timestamp

485

* **Why**: Trend analysis, cannot recreate historical metrics

486

* **Size**: 200 bytes per metric

487

* **Growth**: Hourly = 8,760 per year per metric type

488

* **Retention**: 2 years hot, then aggregate and archive

489

* **Decision**: ✅ STORE - Essential for monitoring

490

**STORE (Computed Once, Then Cached):**

491

492

==== Analysis Summary ====

493

494

* **What**: Neutral text summary of claim analysis (200-500 words)

495

* **Computed**: Once by AKEL when claim first analyzed

496

* **Stored in**: Claim table (text field)

497

* **Recomputed**: Only when system significantly improves OR claim edited

498

* **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often

499

* **Size**: 2 KB per claim

500

* **Decision**: ✅ STORE (cached) - Cost-effective

501

502

==== Confidence Score ====

503

504

* **What**: 0-100 score of analysis confidence

505

* **Computed**: Once by AKEL

506

* **Stored in**: Claim table (integer field)

507

* **Recomputed**: When evidence added, source track record changes significantly, or system improves

508

* **Why store**: Cheap to store, expensive to compute, users need it fast

509

* **Size**: 4 bytes per claim

510

* **Decision**: ✅ STORE (cached) - Performance critical

==== Risk Score ====

* **What**: 0-100 score of claim risk level

515

* **Computed**: Once by AKEL

516

* **Stored in**: Claim table (integer field)

517

* **Recomputed**: When domain changes, evidence changes, or controversy detected

518

* **Why store**: Same as confidence score

519

* **Size**: 4 bytes per claim

520

* **Decision**: ✅ STORE (cached) - Performance critical

521

**COMPUTE DYNAMICALLY (Never Store):**

==== Scenarios ====

⚠️ CRITICAL DECISION

* **What**: 2-5 possible interpretations of claim with assumptions

528

* **Current design**: Stored in Scenario table

529

* **Alternative**: Compute on-demand when user views claim details

530

* **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim

531

* **Compute cost**: $0.005-0.01 per request (LLM API call)

532

* **Frequency**: Viewed in detail by 20% of users

533

* **Trade-off analysis**:

534

- IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access

535

- IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs

536

* **Reproducibility**: Scenarios may improve as AI improves (good to recompute)

537

* **Speed**: Computed = 5-8 seconds delay, Stored = instant

538

* **Decision**: ✅ STORE (hybrid approach below)

539

**Scenario Strategy** (APPROVED):

540

541

1. **Store scenarios** initially when claim analyzed

542

2. **Mark as stale** when system improves significantly

543

3. **Recompute on next view** if marked stale

544

4. **Cache for 30 days** if frequently accessed

545

5. **Result**: Best of both worlds - speed + freshness

546

547

==== Verdict Synthesis ====

* **What**: Final conclusion text synthesizing all scenarios

552

* **Compute cost**: $0.002-0.005 per request

553

* **Frequency**: Every time claim viewed

554

* **Why not store**: Changes as evidence/scenarios change, users want fresh analysis

555

* **Speed**: 2-3 seconds (acceptable)

556

**Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale

557

* **Recommendation**: ✅ STORE cached version, mark stale when changes occur

558

559

==== Search Results ====

560

561

* **What**: Lists of claims matching search query

562

* **Compute from**: Elasticsearch index

563

* **Cache**: 15 minutes in Redis for popular queries

564

* **Why not store permanently**: Constantly changing, infinite possible queries

565

566

==== Aggregated Statistics ====

567

568

* **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.

569

* **Compute from**: Database queries

570

* **Cache**: 1 hour in Redis

571

* **Why not store**: Can be derived, relatively cheap to compute

572

573

==== User Reputation ====

574

575

* **What**: Score based on contributions

576

* **Current design**: Stored in User table

577

* **Alternative**: Compute from Edit table

578

* **Trade-off**:

579

- Stored: Fast, simple

580

- Computed: Always accurate, no denormalization

581

* **Frequency**: Read on every user action

582

* **Compute cost**: Simple COUNT query, milliseconds

583

* **Decision**: ✅ STORE - Performance critical, read-heavy

584

585

=== Summary Table ===

|-|-|-|||-|\\

| Claim core | ✅ | - | 1 KB | STORE | Essential |\\

590

| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\

591

| Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\

592

| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\

593

| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\

594

| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\

595

| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\

596

| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\

597

| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\

598

| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\

599

| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\

600

| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\

601

| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\

602

| Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |

603

**Total storage per claim**: 18 KB (without edits and flags)

604

**For 1 million claims**:

605

606

* **Storage**: 18 GB (manageable)

607

* **PostgreSQL**: $50/month (standard instance)

608

* **Redis cache**: $20/month (1 GB instance)

609

* **S3 archives**: $5/month (old edits)

610

* **Total**: $75/month infrastructure

611

**LLM cost savings by caching**:

612

* Analysis summary stored: Save $0.03 per claim = $30K per 1M claims

613

* Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims

614

* Verdict stored: Save $0.003 per claim = $3K per 1M claims

615

* **Total savings**: $35K per 1M claims vs recomputing every time

616

617

=== Recomputation Triggers ===

618

619

**When to mark cached data as stale and recompute:**

620

621

1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)

622

2. **Evidence added** → Recompute: scenarios, verdict, confidence score

623

3. **Source track record changes >10 points** → Recompute: confidence score, verdict

624

4. **System improvement deployed** → Mark affected claims stale, recompute on next view

625

5. **Controversy detected** (high flag rate) → Recompute: risk score

626

**Recomputation strategy**:

627

628

* **Eager**: Immediately recompute (for user edits)

629

* **Lazy**: Recompute on next view (for system improvements)

630

* **Batch**: Nightly re-evaluation of stale claims (if <1000)

631

632

=== Database Size Projection ===

633

634

**Year 1**: 10K claims

* Storage: 180 MB

* Cost: $10/month

**Year 3**: 100K claims

639

* Storage: 1.8 GB

640

* Cost: $30/month

641

**Year 5**: 1M claims

642

* Storage: 18 GB

643

* Cost: $75/month

644

**Year 10**: 10M claims

645

* Storage: 180 GB

646

* Cost: $300/month

647

* Optimization: Archive old claims to S3 ($5/TB/month)

648

**Conclusion**: Storage costs are manageable, LLM cost savings are substantial.

649

650

== 3. Key Simplifications ==

651

652

* **Two content states only**: Published, Hidden

653

* **Three user roles only**: Reader, Contributor, Moderator

654

* **No complex versioning**: Linear edit history

655

* **Reputation-based permissions**: Not role hierarchy

656

* **Source track records**: Continuous evaluation

657

658

== 3. What Gets Stored in the Database ==

659

660

=== 3.1 Primary Storage (PostgreSQL) ===

**Claims Table**:

* Current state only (latest version)

665

* Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at

666

**Evidence Table**:

667

* All evidence records

668

* Fields: id, claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at, archived

669

**Scenario Table**:

670

* All scenarios for each claim

671

* Fields: id, claim_id, description, assumptions (text array), confidence, created_by, created_at

672

**Source Table**:

673

* Track record database (continuously updated)

674

* Fields: id, name, domain, type, track_record_score, accuracy_history (JSON), correction_frequency, last_updated, claim_count, corrections_count

675

**User Table**:

676

* All user accounts

677

* Fields: id, username, email, role (Reader/Contributor/Moderator), reputation, created_at, last_active, contributions_count, flags_submitted, flags_accepted

678

**Edit Table**:

679

* Complete version history

680

* Fields: id, entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at

681

**Flag Table**:

682

* User-reported issues

683

* Fields: id, entity_type, entity_id, reported_by, issue_type, description, status, resolved_by, resolution_note, created_at, resolved_at