The Article Verdict Problem

1

= The Article Verdict Problem

2

3

**Context:** Context-Aware Analysis Investigation

4

**Date:** December 23, 2025

5

**Status:** Solution Chosen for POC1 Testing

6

7

== 🎯 Executive Summary

8

9

**The Problem:** An article's overall credibility is not simply the average of its individual claim verdicts. An article with mostly accurate facts can still be misleading if the conclusion doesn't follow from the evidence.

10

11

**Investigation Scope:** 7 solution approaches analyzed for performance, cost, and complexity.

12

13

**Chosen Solution:** Single-Pass Holistic Analysis (Approach 1) for POC1 testing

14

- Enhance AI prompt to evaluate logical structure

15

- Zero additional cost or architecture changes

16

- Test with 30 articles to validate approach

17

- Mark as experimental - doesn't block POC1 success

18

19

**Fallback Plan:** If Approach 1 shows <70% accuracy, implement Weighted Aggregation (Approach 4) or defer to POC2 with Hybrid approach (Approach 6).

20

21

== 📋 The Core Problem

22

23

=== Problem Statement

24

25

> "An analysis and verdict of the whole article is not the same as a summary of the analysis and verdicts of the parts (the claims)."

=== Why This Matters

**Example: The Misleading Article**

30

31

```

32

Article: "Coffee Cures Cancer!"

33

34

Individual Claims:

35

[1] Coffee contains antioxidants → ✅ WELL-SUPPORTED (95%)

36

[2] Antioxidants fight cancer → ✅ WELL-SUPPORTED (85%)

37

[3] Therefore, coffee cures cancer → ❌ REFUTED (10%)

38

39

Simple Aggregation:

40

- Verdict counts: 2 supported, 1 refuted

41

- Average confidence: 63% (2/3 claims somewhat supported)

42

- Naive conclusion: "Mostly accurate article"

43

44

Reality:

45

- The MAIN CLAIM (coffee cures cancer) is FALSE

46

- Article commits logical fallacy (correlation ≠ causation)

47

- Article is MISLEADING despite containing accurate facts

48

- Readers could be harmed by false medical claim

49

50

Correct Assessment:

51

- Article verdict: MISLEADING / REFUTED

52

- Reason: Makes unsupported causal claim from correlational evidence

53

```

54

55

=== Why Simple Aggregation Fails

56

57

**Pattern 1: False Central Claim**

58

- 4 supporting facts (all true) ✅✅✅✅

59

- 1 main conclusion (false) ❌

60

- Simple average: 80% accurate

61

- Reality: Core argument is false → Article is MISLEADING

62

63

**Pattern 2: Accurate Facts, Wrong Conclusion**

64

- All individual facts are verifiable

65

- Conclusion doesn't follow from facts

66

- Logical fallacy (e.g., correlation → causation)

67

- Simple average looks good, article is dangerous

68

69

**Pattern 3: Misleading Framing**

70

- Facts are accurate

71

- Selective presentation creates false impression

72

- Headline doesn't match content

73

- Simple average misses the problem

74

75

== ✅ Chosen Solution: Single-Pass Holistic Analysis (POC1)

76

77

=== Approach Overview

78

79

**How it works:**

80

- AI analyzes the entire article in ONE API call

81

- Evaluates both individual claims AND overall article credibility

82

- No pipeline changes - just enhanced prompting

83

84

=== AI Prompt Enhancement

85

86

**Add to existing prompt:**

87

```

88

After analyzing individual claims, evaluate the article as a whole:

89

90

1. What is the article's main argument or conclusion?

91

2. Does this conclusion logically follow from the evidence presented?

92

3. Are there logical fallacies? (correlation→causation, cherry-picking, etc.)

93

4. Even if individual facts are accurate, is the article's framing misleading?

94

5. Should the article verdict differ from the average of claim verdicts?

95

96

Provide:

97

- Individual claim verdicts

98

- Overall article verdict (may differ from claim average)

99

- Explanation if article verdict differs from claim pattern

100

```

101

102

=== Expected AI Output

```json

{

"claims": [

{"text": "Coffee contains antioxidants", "verdict": "SUPPORTED", "confidence": 95},

108

{"text": "Antioxidants fight cancer", "verdict": "SUPPORTED", "confidence": 85},

109

{"text": "Coffee cures cancer", "verdict": "REFUTED", "confidence": 10}

110

],

111

"article_analysis": {

112

"main_argument": "Coffee cures cancer",

113

"logical_assessment": "Article makes causal claim not supported by evidence",

114

"fallacy_detected": "correlation presented as causation",

115

"article_verdict": "MISLEADING",

116

"differs_from_claims": true,

117

"reasoning": "Despite two accurate supporting facts, the main conclusion is unsupported"

}

}

```

=== Performance & Cost

123

124

**Performance:**

125

- Same as baseline POC1 (single API call)

126

- Fast response time

127

- No additional latency

128

129

**Cost:**

130

- Same as baseline POC1

131

- ~$0.015-0.025 per analysis

132

- No cost increase (just longer prompt)

133

134

**Architecture:**

135

- Zero changes to system architecture

136

- Just prompt engineering

137

- Easy to implement and test

138

139

=== POC1 Testing Plan

140

141

**Test Set: 30 Articles**

142

- 10 straightforward (verdict = average works fine)

143

- 10 misleading (accurate facts, wrong conclusion)

144

- 10 complex/nuanced cases

145

146

**Success Criteria:**

147

- AI correctly identifies ≥70% of misleading articles

148

- AI doesn't over-flag straightforward articles

149

- Reasoning is comprehensible

150

151

**Success → Ship it in POC2 as standard feature**

152

153

**Partial Success (50-70%) → Try Approach 4 (Weighted Aggregation) or plan Approach 6 (Hybrid) for POC2**

154

155

**Failure (<50%) → Defer to POC2 with more sophisticated approach**

156

157

=== Why This Approach for POC1

158

159

**Advantages:**

160

✅ Zero additional cost (no extra API calls)

161

✅ No architecture changes (just prompt)

162

✅ Fast to implement and test

163

✅ Fail-fast learning (find out if AI can do this)

164

✅ If it works → problem solved with minimal effort

165

✅ If it fails → informed decision for POC2

166

167

**Risks:**

168

⚠️ AI might miss subtle logical issues

169

⚠️ Relies entirely on AI's reasoning capability

170

⚠️ Less structured than multi-pass approaches

171

172

**Mitigation:**

173

- Mark as "experimental" in POC1

174

- Don't block POC1 success on this feature

175

- Use results to inform POC2 design

176

- Have fallback approaches ready

177

178

== 📊 Complete Analysis: All Solution Approaches

179

180

We investigated 7 approaches for solving this problem. Here's the complete overview:

181

182

=== Approach 1: Single-Pass Holistic Analysis ⭐ CHOSEN FOR POC1

183

184

**Concept:** AI analyzes article and evaluates both claims and overall credibility in one call.

185

186

**Pros:** Simplest, fastest, cheapest, no architecture changes

187

**Cons:** Relies on AI capability, might miss subtle issues

188

**Cost:** ~$0.020/analysis | **Speed:** Fast | **Complexity:** LOW

189

190

**When to use:** POC1 testing - validate if AI can do this at all

191

192

=== Approach 2: Two-Pass Sequential Analysis

193

194

**Concept:** Pass 1 extracts claims, Pass 2 analyzes logical structure given the claims.

195

196

**Pros:** More focused analysis, better debugging, higher reliability

197

**Cons:** Slower (two API calls), more expensive, more complex

198

**Cost:** ~$0.030/analysis | **Speed:** Slower | **Complexity:** MEDIUM

199

200

**When to use:** POC2 if Approach 1 fails, or production for highest quality

201

202

=== Approach 3: Structured Output with Explicit Relationships

203

204

**Concept:** AI outputs claim relationships explicitly (main claim, supporting claims, dependencies, logical validity).

205

206

**Pros:** Explicit structure, easier to validate, single API call

207

**Cons:** Complex prompt, relies on AI identifying relationships correctly

208

**Cost:** ~$0.023/analysis | **Speed:** Fast | **Complexity:** MEDIUM

209

210

**When to use:** POC1 if structured data valuable for UI/debugging

211

212

=== Approach 4: Weighted Aggregation with Importance Scores

213

214

**Concept:** AI assigns importance weight (0-1) to each claim. Article verdict = weighted average.

215

216

**Pros:** Simple math, easy to explain, single API call

217

**Cons:** Reduces to number (loses nuance), doesn't identify fallacies explicitly

218

**Cost:** ~$0.020/analysis | **Speed:** Fast | **Complexity:** LOW

219

220

**When to use:** POC1 fallback if Approach 1 doesn't work well

221

222

=== Approach 5: Post-Processing Heuristics

223

224

**Concept:** Rule-based detection of logical issues after claim extraction (e.g., "if article contains 'causes' but only correlational evidence, flag causal fallacy").

225

226

**Pros:** Cheapest, deterministic, explainable, no extra API calls

227

**Cons:** Brittle rules, high maintenance, false positives/negatives

228

**Cost:** ~$0.018/analysis | **Speed:** Fast | **Complexity:** MEDIUM

229

230

**When to use:** Add to any other approach for robustness

231

232

=== Approach 6: Hybrid (Weighted Aggregation + Heuristics) ⭐ RECOMMENDED FOR POC2

233

234

**Concept:** Combine AI-assigned weights with rule-based fallacy detection.

235

236

**Pros:** Best of both worlds, robust, still single API call, cost-effective

237

**Cons:** More complex than single approach, need to tune interaction

238

**Cost:** ~$0.020/analysis | **Speed:** Fast | **Complexity:** MED-HIGH

239

240

**When to use:** POC2 for robust production-ready solution

241

242

=== Approach 7: LLM-as-Judge (Verification Pass)

243

244

**Concept:** Pass 1 generates verdict, Pass 2 verifies if verdict matches article content.

245

246

**Pros:** AI checks AI, high reliability, catches mistakes

247

**Cons:** Slower (two calls), more expensive, verification might also err

248

**Cost:** ~$0.030/analysis | **Speed:** Slower | **Complexity:** MEDIUM

249

250

**When to use:** Production if quality is paramount

251

252

=== Comparison Matrix

253

254

255

|----------|-----------|------|-------|------------|-------------|----------|

256

| 1. Single-Pass ⭐ | 1 | 💰 Low | ⚡ Fast | 🟢 Low | ⚠️ Medium | POC1 |

257

| 2. Two-Pass | 2 | 💰💰 Med | 🐌 Slow | 🟡 Med | ✅ High | POC2/Prod |

258

| 3. Structured | 1 | 💰 Low | ⚡ Fast | 🟡 Med | ✅ High | POC1 |

259

| 4. Weighted | 1 | 💰 Low | ⚡ Fast | 🟢 Low | ⚠️ Medium | POC1 |

260

| 5. Heuristics | 1 | 💰 Lowest | ⚡⚡ Fastest | 🟡 Med | ⚠️ Medium | Any |

261

| 6. Hybrid ⭐ | 1 | 💰 Low | ⚡ Fast | 🟠 Med-High | ✅ High | POC2 |

262

| 7. Judge | 2 | 💰💰 Med | 🐌 Slow | 🟡 Med | ✅ High | Production |

263

264

=== Phased Recommendation

265

266

**POC1 (Immediate):**

267

- Test Approach 1 (Single-Pass Holistic)

268

- Mark as experimental

269

- Gather data on AI capability

270

271

**POC2 (If POC1 validates approach):**

272

- Upgrade to Approach 6 (Hybrid)

273

- Add heuristics for robustness

274

- Target 85%+ accuracy

275

276

**Production (Post-POC2):**

277

- If quality issues: Consider Approach 7 (LLM-as-Judge)

278

- If quality acceptable: Keep Approach 6

279

- Target 90%+ accuracy

280

281

== 🎯 Decision Framework

282

283

=== POC1 Evaluation Criteria

284

285

After testing with 30 articles:

286

287

**If AI Accuracy ≥70%:**

288

- ✅ Approach validated!

289

- ✅ Ship as standard feature in POC2

290

- ✅ Consider adding heuristics (Approach 6) for robustness

291

292

**If AI Accuracy 50-70%:**

293

- ⚠️ Promising but needs improvement

294

- ⚠️ Try Approach 4 (Weighted Aggregation) in POC1

295

- ⚠️ Plan Approach 6 (Hybrid) for POC2

296

297

**If AI Accuracy <50%:**

298

- ❌ Current AI can't do this reliably

299

- ❌ Defer to POC2 or post-POC2

300

- ❌ Consider Approach 2 or 7 (two-pass) for production

301

302

=== Why This Matters for POC1

303

304

**Testing this in POC1:**

305

- Validates core capability (can AI do nuanced reasoning?)

306

- Informs POC2 architecture decisions

307

- Zero cost to try (just prompt enhancement)

308

- Fail-fast principle (test hardest part first)

309

310

**Not testing this in POC1:**

311

- Keeps POC1 scope minimal

312

- Focuses on core claim extraction

313

- But misses early learning opportunity

314

315

**Decision:** Test it, mark as experimental, don't block POC1 success on it.

316

317

== 📝 Implementation Notes

318

319

=== What AKEL Must Do

320

321

**For POC1 (Approach 1):**

322

1. Enhanced prompt with logical analysis section

323

2. Parse AI output for both claim-level and article-level verdicts

324

3. Display both verdicts to user

325

4. Track accuracy on test set

326

327

**No architecture changes needed.**

328

329

=== What Gets Displayed to Users

**Output Format:**

```

ANALYSIS SUMMARY (4-6 sentences, context-aware):

334

"This article argues that coffee cures cancer based on evidence about

335

antioxidants. We analyzed 3 claims: two supporting facts about coffee's

336

chemical properties are well-supported, but the main causal claim is

337

refuted by current evidence. The article confuses correlation with

338

causation. Overall assessment: MISLEADING - makes an unsupported

339

medical claim despite citing some accurate facts."

340

341

CLAIMS VERDICTS:

342

[1] Coffee contains antioxidants: WELL-SUPPORTED (95%)

343

[2] Antioxidants fight cancer: WELL-SUPPORTED (85%)

344

[3] Coffee cures cancer: REFUTED (10%)

345

346

ARTICLE VERDICT: MISLEADING

347

The article's main conclusion is not supported by the evidence presented.

```

=== Error Handling

**If AI fails to provide article-level analysis:**

353

- Fall back to claim-average verdict

354

- Log failure for analysis

355

- Don't break the analysis

356

357

**If AI over-flags straightforward articles:**

358

- Review prompt tuning

359

- Consider adding confidence threshold

360

- Track false positive rate

361

362

== 🔬 Testing Strategy

363

364

=== Test Set Composition

365

366

**Category 1: Straightforward Articles (10 articles)**

367

- Clear claims with matching overall message

368

- Verdict = average should work fine

369

- Tests that we don't over-flag

370

371

**Category 2: Misleading Articles (10 articles)**

372

- Accurate facts, unsupported conclusion

373

- Logical fallacies present

374

- Verdict ≠ average

375

- Core test of capability

376

377

**Category 3: Complex/Nuanced (10 articles)**

378

- Gray areas

379

- Multiple valid interpretations

380

- Tests nuance handling

=== Success Metrics

**Quantitative:**

- ≥70% accuracy on Category 2 (misleading articles)

386

- ≤30% false positives on Category 1 (straightforward)

387

- ≥50% accuracy on Category 3 (complex)

388

389

**Qualitative:**

390

- Reasoning is comprehensible to humans

391

- False positives are explainable

392

- False negatives reveal clear AI limitations

=== Documentation

**For each test case, record:**

397

- Article summary

398

- AI's claim verdicts

399

- AI's article verdict

400

- AI's reasoning

401

- Human judgment (correct/incorrect)

402

- Notes on why AI succeeded/failed

== 💡 Key Insights

=== What This Tests

**Core Capability:**

Can AI understand that article credibility depends on:

410

1. Logical structure (does conclusion follow?)

411

2. Claim importance (main vs. supporting)

412

3. Reasoning quality (sound vs. fallacious)

413

414

**Not just:**

415

- Accuracy of individual facts

- Simple averages

- Keyword matching

=== Why This Is Important

420

421

**For FactHarbor's Mission:**

422

- Prevents misleading "mostly accurate" verdicts

423

- Catches dangerous misinformation (medical, financial)

424

- Provides nuanced analysis users can trust

425

426

**For POC Validation:**

427

- Tests most challenging capability

428

- If AI can do this, everything else is easier

429

- If AI can't, we know early and adjust

=== Strategic Value

**If Approach 1 works (≥70% accuracy):**

434

- ✅ Solved complex problem with zero architecture changes

435

- ✅ No cost increase

436

- ✅ Differentiation from competitors who only check facts

437

- ✅ Foundation for more sophisticated features

438

439

**If Approach 1 doesn't work:**

440

- ✅ Learned AI limitations early

441

- ✅ Informed decision for POC2

442

- ✅ Can plan proper solution (Approach 6 or 7)

== 🎓 Lessons Learned

=== From Investigation

447

448

**AI Capabilities:**

449

- Modern LLMs (Sonnet 4.5) can do nuanced reasoning

450

- But reliability varies significantly

451

- Need to test, not assume

452

453

**Cost-Performance Trade-offs:**

454

- Single-pass approaches: Fast and cheap but less reliable

455

- Multi-pass approaches: Slower and expensive but more robust

456

- Hybrid approaches: Best balance for production

457

458

**Architecture Decisions:**

459

- Don't over-engineer before validating need

460

- Test simplest approach first

461

- Have fallback plans ready

=== For POC1

**Keep It Simple:**

- Test Approach 1 with minimal changes

467

- Mark as experimental

468

- Use results to guide POC2

469

470

**Fail Fast:**

471

- 30-article test set reveals capability quickly

472

- Better to learn in POC1 than after building complex architecture

473

474

**Document Everything:**

475

- Track AI failures

476

- Understand patterns

477

- Inform future improvements

== ✅ Summary

**Problem:** Article credibility ≠ average of claim verdicts

482

483

**Investigation:** 7 approaches analyzed for cost, speed, reliability

484

485

**Chosen Solution:** Single-Pass Holistic Analysis (Approach 1)

486

- Test in POC1 with enhanced prompt

487

- Zero cost increase, no architecture changes

488

- Validate if AI can do nuanced reasoning

489

490

**Success Criteria:** ≥70% accuracy detecting misleading articles

491

492

**Fallback Plans:**

493

- POC1: Try Approach 4 (Weighted Aggregation)

494

- POC2: Implement Approach 6 (Hybrid)

495

- Production: Consider Approach 7 (LLM-as-Judge)

496

497

**Next Steps:**

498

1. Create 30-article test set

499

2. Enhance AI prompt

500

3. Test and measure accuracy

501

4. Use results to inform POC2 design

502

503

**This is POC1's key experimental feature!** 🎯

Wiki source code of The Article Verdict Problem

Applications

Navigation

Need help?