Wiki source code of The Article Verdict Problem

Last modified by Robert Schaub on 2025/12/24 21:53

Show last authors
1 = The Article Verdict Problem
2
3 **Context:** Context-Aware Analysis Investigation
4 **Date:** December 23, 2025
5 **Status:** Solution Chosen for POC1 Testing
6
7 == 🎯 Executive Summary
8
9 **The Problem:** An article's overall credibility is not simply the average of its individual claim verdicts. An article with mostly accurate facts can still be misleading if the conclusion doesn't follow from the evidence.
10
11 **Investigation Scope:** 7 solution approaches analyzed for performance, cost, and complexity.
12
13 **Chosen Solution:** Single-Pass Holistic Analysis (Approach 1) for POC1 testing
14 - Enhance AI prompt to evaluate logical structure
15 - Zero additional cost or architecture changes
16 - Test with 30 articles to validate approach
17 - Mark as experimental - doesn't block POC1 success
18
19 **Fallback Plan:** If Approach 1 shows <70% accuracy, implement Weighted Aggregation (Approach 4) or defer to POC2 with Hybrid approach (Approach 6).
20
21 == 📋 The Core Problem
22
23 === Problem Statement
24
25 > "An analysis and verdict of the whole article is not the same as a summary of the analysis and verdicts of the parts (the claims)."
26
27 === Why This Matters
28
29 **Example: The Misleading Article**
30
31 ```
32 Article: "Coffee Cures Cancer!"
33
34 Individual Claims:
35 [1] Coffee contains antioxidants → ✅ WELL-SUPPORTED (95%)
36 [2] Antioxidants fight cancer → ✅ WELL-SUPPORTED (85%)
37 [3] Therefore, coffee cures cancer → ❌ REFUTED (10%)
38
39 Simple Aggregation:
40 - Verdict counts: 2 supported, 1 refuted
41 - Average confidence: 63% (2/3 claims somewhat supported)
42 - Naive conclusion: "Mostly accurate article"
43
44 Reality:
45 - The MAIN CLAIM (coffee cures cancer) is FALSE
46 - Article commits logical fallacy (correlation ≠ causation)
47 - Article is MISLEADING despite containing accurate facts
48 - Readers could be harmed by false medical claim
49
50 Correct Assessment:
51 - Article verdict: MISLEADING / REFUTED
52 - Reason: Makes unsupported causal claim from correlational evidence
53 ```
54
55 === Why Simple Aggregation Fails
56
57 **Pattern 1: False Central Claim**
58 - 4 supporting facts (all true) ✅✅✅✅
59 - 1 main conclusion (false) ❌
60 - Simple average: 80% accurate
61 - Reality: Core argument is false → Article is MISLEADING
62
63 **Pattern 2: Accurate Facts, Wrong Conclusion**
64 - All individual facts are verifiable
65 - Conclusion doesn't follow from facts
66 - Logical fallacy (e.g., correlation → causation)
67 - Simple average looks good, article is dangerous
68
69 **Pattern 3: Misleading Framing**
70 - Facts are accurate
71 - Selective presentation creates false impression
72 - Headline doesn't match content
73 - Simple average misses the problem
74
75 == ✅ Chosen Solution: Single-Pass Holistic Analysis (POC1)
76
77 === Approach Overview
78
79 **How it works:**
80 - AI analyzes the entire article in ONE API call
81 - Evaluates both individual claims AND overall article credibility
82 - No pipeline changes - just enhanced prompting
83
84 === AI Prompt Enhancement
85
86 **Add to existing prompt:**
87 ```
88 After analyzing individual claims, evaluate the article as a whole:
89
90 1. What is the article's main argument or conclusion?
91 2. Does this conclusion logically follow from the evidence presented?
92 3. Are there logical fallacies? (correlation→causation, cherry-picking, etc.)
93 4. Even if individual facts are accurate, is the article's framing misleading?
94 5. Should the article verdict differ from the average of claim verdicts?
95
96 Provide:
97 - Individual claim verdicts
98 - Overall article verdict (may differ from claim average)
99 - Explanation if article verdict differs from claim pattern
100 ```
101
102 === Expected AI Output
103
104 ```json
105 {
106 "claims": [
107 {"text": "Coffee contains antioxidants", "verdict": "SUPPORTED", "confidence": 95},
108 {"text": "Antioxidants fight cancer", "verdict": "SUPPORTED", "confidence": 85},
109 {"text": "Coffee cures cancer", "verdict": "REFUTED", "confidence": 10}
110 ],
111 "article_analysis": {
112 "main_argument": "Coffee cures cancer",
113 "logical_assessment": "Article makes causal claim not supported by evidence",
114 "fallacy_detected": "correlation presented as causation",
115 "article_verdict": "MISLEADING",
116 "differs_from_claims": true,
117 "reasoning": "Despite two accurate supporting facts, the main conclusion is unsupported"
118 }
119 }
120 ```
121
122 === Performance & Cost
123
124 **Performance:**
125 - Same as baseline POC1 (single API call)
126 - Fast response time
127 - No additional latency
128
129 **Cost:**
130 - Same as baseline POC1
131 - ~$0.015-0.025 per analysis
132 - No cost increase (just longer prompt)
133
134 **Architecture:**
135 - Zero changes to system architecture
136 - Just prompt engineering
137 - Easy to implement and test
138
139 === POC1 Testing Plan
140
141 **Test Set: 30 Articles**
142 - 10 straightforward (verdict = average works fine)
143 - 10 misleading (accurate facts, wrong conclusion)
144 - 10 complex/nuanced cases
145
146 **Success Criteria:**
147 - AI correctly identifies ≥70% of misleading articles
148 - AI doesn't over-flag straightforward articles
149 - Reasoning is comprehensible
150
151 **Success → Ship it in POC2 as standard feature**
152
153 **Partial Success (50-70%) → Try Approach 4 (Weighted Aggregation) or plan Approach 6 (Hybrid) for POC2**
154
155 **Failure (<50%) → Defer to POC2 with more sophisticated approach**
156
157 === Why This Approach for POC1
158
159 **Advantages:**
160 ✅ Zero additional cost (no extra API calls)
161 ✅ No architecture changes (just prompt)
162 ✅ Fast to implement and test
163 ✅ Fail-fast learning (find out if AI can do this)
164 ✅ If it works → problem solved with minimal effort
165 ✅ If it fails → informed decision for POC2
166
167 **Risks:**
168 ⚠️ AI might miss subtle logical issues
169 ⚠️ Relies entirely on AI's reasoning capability
170 ⚠️ Less structured than multi-pass approaches
171
172 **Mitigation:**
173 - Mark as "experimental" in POC1
174 - Don't block POC1 success on this feature
175 - Use results to inform POC2 design
176 - Have fallback approaches ready
177
178 == 📊 Complete Analysis: All Solution Approaches
179
180 We investigated 7 approaches for solving this problem. Here's the complete overview:
181
182 === Approach 1: Single-Pass Holistic Analysis ⭐ CHOSEN FOR POC1
183
184 **Concept:** AI analyzes article and evaluates both claims and overall credibility in one call.
185
186 **Pros:** Simplest, fastest, cheapest, no architecture changes
187 **Cons:** Relies on AI capability, might miss subtle issues
188 **Cost:** ~$0.020/analysis | **Speed:** Fast | **Complexity:** LOW
189
190 **When to use:** POC1 testing - validate if AI can do this at all
191
192 === Approach 2: Two-Pass Sequential Analysis
193
194 **Concept:** Pass 1 extracts claims, Pass 2 analyzes logical structure given the claims.
195
196 **Pros:** More focused analysis, better debugging, higher reliability
197 **Cons:** Slower (two API calls), more expensive, more complex
198 **Cost:** ~$0.030/analysis | **Speed:** Slower | **Complexity:** MEDIUM
199
200 **When to use:** POC2 if Approach 1 fails, or production for highest quality
201
202 === Approach 3: Structured Output with Explicit Relationships
203
204 **Concept:** AI outputs claim relationships explicitly (main claim, supporting claims, dependencies, logical validity).
205
206 **Pros:** Explicit structure, easier to validate, single API call
207 **Cons:** Complex prompt, relies on AI identifying relationships correctly
208 **Cost:** ~$0.023/analysis | **Speed:** Fast | **Complexity:** MEDIUM
209
210 **When to use:** POC1 if structured data valuable for UI/debugging
211
212 === Approach 4: Weighted Aggregation with Importance Scores
213
214 **Concept:** AI assigns importance weight (0-1) to each claim. Article verdict = weighted average.
215
216 **Pros:** Simple math, easy to explain, single API call
217 **Cons:** Reduces to number (loses nuance), doesn't identify fallacies explicitly
218 **Cost:** ~$0.020/analysis | **Speed:** Fast | **Complexity:** LOW
219
220 **When to use:** POC1 fallback if Approach 1 doesn't work well
221
222 === Approach 5: Post-Processing Heuristics
223
224 **Concept:** Rule-based detection of logical issues after claim extraction (e.g., "if article contains 'causes' but only correlational evidence, flag causal fallacy").
225
226 **Pros:** Cheapest, deterministic, explainable, no extra API calls
227 **Cons:** Brittle rules, high maintenance, false positives/negatives
228 **Cost:** ~$0.018/analysis | **Speed:** Fast | **Complexity:** MEDIUM
229
230 **When to use:** Add to any other approach for robustness
231
232 === Approach 6: Hybrid (Weighted Aggregation + Heuristics) ⭐ RECOMMENDED FOR POC2
233
234 **Concept:** Combine AI-assigned weights with rule-based fallacy detection.
235
236 **Pros:** Best of both worlds, robust, still single API call, cost-effective
237 **Cons:** More complex than single approach, need to tune interaction
238 **Cost:** ~$0.020/analysis | **Speed:** Fast | **Complexity:** MED-HIGH
239
240 **When to use:** POC2 for robust production-ready solution
241
242 === Approach 7: LLM-as-Judge (Verification Pass)
243
244 **Concept:** Pass 1 generates verdict, Pass 2 verifies if verdict matches article content.
245
246 **Pros:** AI checks AI, high reliability, catches mistakes
247 **Cons:** Slower (two calls), more expensive, verification might also err
248 **Cost:** ~$0.030/analysis | **Speed:** Slower | **Complexity:** MEDIUM
249
250 **When to use:** Production if quality is paramount
251
252 === Comparison Matrix
253
254 | Approach | API Calls | Cost | Speed | Complexity | Reliability | Best For |
255 |----------|-----------|------|-------|------------|-------------|----------|
256 | 1. Single-Pass ⭐ | 1 | 💰 Low | ⚡ Fast | 🟢 Low | ⚠️ Medium | POC1 |
257 | 2. Two-Pass | 2 | 💰💰 Med | 🐌 Slow | 🟡 Med | ✅ High | POC2/Prod |
258 | 3. Structured | 1 | 💰 Low | ⚡ Fast | 🟡 Med | ✅ High | POC1 |
259 | 4. Weighted | 1 | 💰 Low | ⚡ Fast | 🟢 Low | ⚠️ Medium | POC1 |
260 | 5. Heuristics | 1 | 💰 Lowest | ⚡⚡ Fastest | 🟡 Med | ⚠️ Medium | Any |
261 | 6. Hybrid ⭐ | 1 | 💰 Low | ⚡ Fast | 🟠 Med-High | ✅ High | POC2 |
262 | 7. Judge | 2 | 💰💰 Med | 🐌 Slow | 🟡 Med | ✅ High | Production |
263
264 === Phased Recommendation
265
266 **POC1 (Immediate):**
267 - Test Approach 1 (Single-Pass Holistic)
268 - Mark as experimental
269 - Gather data on AI capability
270
271 **POC2 (If POC1 validates approach):**
272 - Upgrade to Approach 6 (Hybrid)
273 - Add heuristics for robustness
274 - Target 85%+ accuracy
275
276 **Production (Post-POC2):**
277 - If quality issues: Consider Approach 7 (LLM-as-Judge)
278 - If quality acceptable: Keep Approach 6
279 - Target 90%+ accuracy
280
281 == 🎯 Decision Framework
282
283 === POC1 Evaluation Criteria
284
285 After testing with 30 articles:
286
287 **If AI Accuracy ≥70%:**
288 - ✅ Approach validated!
289 - ✅ Ship as standard feature in POC2
290 - ✅ Consider adding heuristics (Approach 6) for robustness
291
292 **If AI Accuracy 50-70%:**
293 - ⚠️ Promising but needs improvement
294 - ⚠️ Try Approach 4 (Weighted Aggregation) in POC1
295 - ⚠️ Plan Approach 6 (Hybrid) for POC2
296
297 **If AI Accuracy <50%:**
298 - ❌ Current AI can't do this reliably
299 - ❌ Defer to POC2 or post-POC2
300 - ❌ Consider Approach 2 or 7 (two-pass) for production
301
302 === Why This Matters for POC1
303
304 **Testing this in POC1:**
305 - Validates core capability (can AI do nuanced reasoning?)
306 - Informs POC2 architecture decisions
307 - Zero cost to try (just prompt enhancement)
308 - Fail-fast principle (test hardest part first)
309
310 **Not testing this in POC1:**
311 - Keeps POC1 scope minimal
312 - Focuses on core claim extraction
313 - But misses early learning opportunity
314
315 **Decision:** Test it, mark as experimental, don't block POC1 success on it.
316
317 == 📝 Implementation Notes
318
319 === What AKEL Must Do
320
321 **For POC1 (Approach 1):**
322 1. Enhanced prompt with logical analysis section
323 2. Parse AI output for both claim-level and article-level verdicts
324 3. Display both verdicts to user
325 4. Track accuracy on test set
326
327 **No architecture changes needed.**
328
329 === What Gets Displayed to Users
330
331 **Output Format:**
332 ```
333 ANALYSIS SUMMARY (4-6 sentences, context-aware):
334 "This article argues that coffee cures cancer based on evidence about
335 antioxidants. We analyzed 3 claims: two supporting facts about coffee's
336 chemical properties are well-supported, but the main causal claim is
337 refuted by current evidence. The article confuses correlation with
338 causation. Overall assessment: MISLEADING - makes an unsupported
339 medical claim despite citing some accurate facts."
340
341 CLAIMS VERDICTS:
342 [1] Coffee contains antioxidants: WELL-SUPPORTED (95%)
343 [2] Antioxidants fight cancer: WELL-SUPPORTED (85%)
344 [3] Coffee cures cancer: REFUTED (10%)
345
346 ARTICLE VERDICT: MISLEADING
347 The article's main conclusion is not supported by the evidence presented.
348 ```
349
350 === Error Handling
351
352 **If AI fails to provide article-level analysis:**
353 - Fall back to claim-average verdict
354 - Log failure for analysis
355 - Don't break the analysis
356
357 **If AI over-flags straightforward articles:**
358 - Review prompt tuning
359 - Consider adding confidence threshold
360 - Track false positive rate
361
362 == 🔬 Testing Strategy
363
364 === Test Set Composition
365
366 **Category 1: Straightforward Articles (10 articles)**
367 - Clear claims with matching overall message
368 - Verdict = average should work fine
369 - Tests that we don't over-flag
370
371 **Category 2: Misleading Articles (10 articles)**
372 - Accurate facts, unsupported conclusion
373 - Logical fallacies present
374 - Verdict ≠ average
375 - Core test of capability
376
377 **Category 3: Complex/Nuanced (10 articles)**
378 - Gray areas
379 - Multiple valid interpretations
380 - Tests nuance handling
381
382 === Success Metrics
383
384 **Quantitative:**
385 - ≥70% accuracy on Category 2 (misleading articles)
386 - ≤30% false positives on Category 1 (straightforward)
387 - ≥50% accuracy on Category 3 (complex)
388
389 **Qualitative:**
390 - Reasoning is comprehensible to humans
391 - False positives are explainable
392 - False negatives reveal clear AI limitations
393
394 === Documentation
395
396 **For each test case, record:**
397 - Article summary
398 - AI's claim verdicts
399 - AI's article verdict
400 - AI's reasoning
401 - Human judgment (correct/incorrect)
402 - Notes on why AI succeeded/failed
403
404 == 💡 Key Insights
405
406 === What This Tests
407
408 **Core Capability:**
409 Can AI understand that article credibility depends on:
410 1. Logical structure (does conclusion follow?)
411 2. Claim importance (main vs. supporting)
412 3. Reasoning quality (sound vs. fallacious)
413
414 **Not just:**
415 - Accuracy of individual facts
416 - Simple averages
417 - Keyword matching
418
419 === Why This Is Important
420
421 **For FactHarbor's Mission:**
422 - Prevents misleading "mostly accurate" verdicts
423 - Catches dangerous misinformation (medical, financial)
424 - Provides nuanced analysis users can trust
425
426 **For POC Validation:**
427 - Tests most challenging capability
428 - If AI can do this, everything else is easier
429 - If AI can't, we know early and adjust
430
431 === Strategic Value
432
433 **If Approach 1 works (≥70% accuracy):**
434 - ✅ Solved complex problem with zero architecture changes
435 - ✅ No cost increase
436 - ✅ Differentiation from competitors who only check facts
437 - ✅ Foundation for more sophisticated features
438
439 **If Approach 1 doesn't work:**
440 - ✅ Learned AI limitations early
441 - ✅ Informed decision for POC2
442 - ✅ Can plan proper solution (Approach 6 or 7)
443
444 == 🎓 Lessons Learned
445
446 === From Investigation
447
448 **AI Capabilities:**
449 - Modern LLMs (Sonnet 4.5) can do nuanced reasoning
450 - But reliability varies significantly
451 - Need to test, not assume
452
453 **Cost-Performance Trade-offs:**
454 - Single-pass approaches: Fast and cheap but less reliable
455 - Multi-pass approaches: Slower and expensive but more robust
456 - Hybrid approaches: Best balance for production
457
458 **Architecture Decisions:**
459 - Don't over-engineer before validating need
460 - Test simplest approach first
461 - Have fallback plans ready
462
463 === For POC1
464
465 **Keep It Simple:**
466 - Test Approach 1 with minimal changes
467 - Mark as experimental
468 - Use results to guide POC2
469
470 **Fail Fast:**
471 - 30-article test set reveals capability quickly
472 - Better to learn in POC1 than after building complex architecture
473
474 **Document Everything:**
475 - Track AI failures
476 - Understand patterns
477 - Inform future improvements
478
479 == ✅ Summary
480
481 **Problem:** Article credibility ≠ average of claim verdicts
482
483 **Investigation:** 7 approaches analyzed for cost, speed, reliability
484
485 **Chosen Solution:** Single-Pass Holistic Analysis (Approach 1)
486 - Test in POC1 with enhanced prompt
487 - Zero cost increase, no architecture changes
488 - Validate if AI can do nuanced reasoning
489
490 **Success Criteria:** ≥70% accuracy detecting misleading articles
491
492 **Fallback Plans:**
493 - POC1: Try Approach 4 (Weighted Aggregation)
494 - POC2: Implement Approach 6 (Hybrid)
495 - Production: Consider Approach 7 (LLM-as-Judge)
496
497 **Next Steps:**
498 1. Create 30-article test set
499 2. Enhance AI prompt
500 3. Test and measure accuracy
501 4. Use results to inform POC2 design
502
503 **This is POC1's key experimental feature!** 🎯