The Article Verdict Problem
The Article Verdict Problem
Context: Context-Aware Analysis Investigation
Date: December 23, 2025
Status: Solution Chosen for POC1 Testing
π― Executive Summary
The Problem: An article's overall credibility is not simply the average of its individual claim verdicts. An article with mostly accurate facts can still be misleading if the conclusion doesn't follow from the evidence.
Investigation Scope: 7 solution approaches analyzed for performance, cost, and complexity.
Chosen Solution: Single-Pass Holistic Analysis (Approach 1) for POC1 testing
- Enhance AI prompt to evaluate logical structure
- Zero additional cost or architecture changes
- Test with 30 articles to validate approach
- Mark as experimental - doesn't block POC1 success
Fallback Plan: If Approach 1 shows <70% accuracy, implement Weighted Aggregation (Approach 4) or defer to POC2 with Hybrid approach (Approach 6).
π The Core Problem
Problem Statement
"An analysis and verdict of the whole article is not the same as a summary of the analysis and verdicts of the parts (the claims)."
Why This Matters
Example: The Misleading Article
```
Article: "Coffee Cures Cancer!"
Individual Claims:
[1] Coffee contains antioxidants β β
WELL-SUPPORTED (95%)
[2] Antioxidants fight cancer β β
WELL-SUPPORTED (85%)
[3] Therefore, coffee cures cancer β β REFUTED (10%)
Simple Aggregation:
- Verdict counts: 2 supported, 1 refuted
- Average confidence: 63% (2/3 claims somewhat supported)
- Naive conclusion: "Mostly accurate article"
Reality:
- The MAIN CLAIM (coffee cures cancer) is FALSE
- Article commits logical fallacy (correlation β causation)
- Article is MISLEADING despite containing accurate facts
- Readers could be harmed by false medical claim
Correct Assessment:
- Article verdict: MISLEADING / REFUTED
- Reason: Makes unsupported causal claim from correlational evidence
```
Why Simple Aggregation Fails
Pattern 1: False Central Claim
- 4 supporting facts (all true) β
β
β
β
- 1 main conclusion (false) β
- Simple average: 80% accurate
- Reality: Core argument is false β Article is MISLEADING
Pattern 2: Accurate Facts, Wrong Conclusion
- All individual facts are verifiable
- Conclusion doesn't follow from facts
- Logical fallacy (e.g., correlation β causation)
- Simple average looks good, article is dangerous
Pattern 3: Misleading Framing
- Facts are accurate
- Selective presentation creates false impression
- Headline doesn't match content
- Simple average misses the problem
β Chosen Solution: Single-Pass Holistic Analysis (POC1)
Approach Overview
How it works:
- AI analyzes the entire article in ONE API call
- Evaluates both individual claims AND overall article credibility
- No pipeline changes - just enhanced prompting
AI Prompt Enhancement
Add to existing prompt:
```
After analyzing individual claims, evaluate the article as a whole:
- What is the article's main argument or conclusion?
2. Does this conclusion logically follow from the evidence presented?
3. Are there logical fallacies? (correlationβcausation, cherry-picking, etc.)
4. Even if individual facts are accurate, is the article's framing misleading?
5. Should the article verdict differ from the average of claim verdicts?
Provide:
- Individual claim verdicts
- Overall article verdict (may differ from claim average)
- Explanation if article verdict differs from claim pattern
```
Expected AI Output
```json
{
"claims": [
{"text": "Coffee contains antioxidants", "verdict": "SUPPORTED", "confidence": 95},
{"text": "Antioxidants fight cancer", "verdict": "SUPPORTED", "confidence": 85},
{"text": "Coffee cures cancer", "verdict": "REFUTED", "confidence": 10}
],
"article_analysis": {
"main_argument": "Coffee cures cancer",
"logical_assessment": "Article makes causal claim not supported by evidence",
"fallacy_detected": "correlation presented as causation",
"article_verdict": "MISLEADING",
"differs_from_claims": true,
"reasoning": "Despite two accurate supporting facts, the main conclusion is unsupported"
}
}
```
Performance & Cost
Performance:
- Same as baseline POC1 (single API call)
- Fast response time
- No additional latency
Cost:
- Same as baseline POC1
- $0.015-0.025 per analysis
- No cost increase (just longer prompt)
Architecture:
- Zero changes to system architecture
- Just prompt engineering
- Easy to implement and test
POC1 Testing Plan
Test Set: 30 Articles
- 10 straightforward (verdict = average works fine)
- 10 misleading (accurate facts, wrong conclusion)
- 10 complex/nuanced cases
Success Criteria:
- AI correctly identifies β₯70% of misleading articles
- AI doesn't over-flag straightforward articles
- Reasoning is comprehensible
Success β Ship it in POC2 as standard feature
Partial Success (50-70%) β Try Approach 4 (Weighted Aggregation) or plan Approach 6 (Hybrid) for POC2
Failure (<50%) β Defer to POC2 with more sophisticated approach
Why This Approach for POC1
Advantages:
β
Zero additional cost (no extra API calls)
β
No architecture changes (just prompt)
β
Fast to implement and test
β
Fail-fast learning (find out if AI can do this)
β
If it works β problem solved with minimal effort
β
If it fails β informed decision for POC2
Risks:
β οΈ AI might miss subtle logical issues
β οΈ Relies entirely on AI's reasoning capability
β οΈ Less structured than multi-pass approaches
Mitigation:
- Mark as "experimental" in POC1
- Don't block POC1 success on this feature
- Use results to inform POC2 design
- Have fallback approaches ready
π Complete Analysis: All Solution Approaches
We investigated 7 approaches for solving this problem. Here's the complete overview:
Approach 1: Single-Pass Holistic Analysis β CHOSEN FOR POC1
Concept: AI analyzes article and evaluates both claims and overall credibility in one call.
Pros: Simplest, fastest, cheapest, no architecture changes
Cons: Relies on AI capability, might miss subtle issues
Cost: $0.020/analysis | Speed: Fast | Complexity: LOW
When to use: POC1 testing - validate if AI can do this at all
Approach 2: Two-Pass Sequential Analysis
Concept: Pass 1 extracts claims, Pass 2 analyzes logical structure given the claims.
Pros: More focused analysis, better debugging, higher reliability
Cons: Slower (two API calls), more expensive, more complex
Cost: $0.030/analysis | Speed: Slower | Complexity: MEDIUM
When to use: POC2 if Approach 1 fails, or production for highest quality
Approach 3: Structured Output with Explicit Relationships
Concept: AI outputs claim relationships explicitly (main claim, supporting claims, dependencies, logical validity).
Pros: Explicit structure, easier to validate, single API call
Cons: Complex prompt, relies on AI identifying relationships correctly
Cost: $0.023/analysis | Speed: Fast | Complexity: MEDIUM
When to use: POC1 if structured data valuable for UI/debugging
Approach 4: Weighted Aggregation with Importance Scores
Concept: AI assigns importance weight (0-1) to each claim. Article verdict = weighted average.
Pros: Simple math, easy to explain, single API call
Cons: Reduces to number (loses nuance), doesn't identify fallacies explicitly
Cost: $0.020/analysis | Speed: Fast | Complexity: LOW
When to use: POC1 fallback if Approach 1 doesn't work well
Approach 5: Post-Processing Heuristics
Concept: Rule-based detection of logical issues after claim extraction (e.g., "if article contains 'causes' but only correlational evidence, flag causal fallacy").
Pros: Cheapest, deterministic, explainable, no extra API calls
Cons: Brittle rules, high maintenance, false positives/negatives
Cost: $0.018/analysis | Speed: Fast | Complexity: MEDIUM
When to use: Add to any other approach for robustness
Approach 6: Hybrid (Weighted Aggregation + Heuristics) β RECOMMENDED FOR POC2
Concept: Combine AI-assigned weights with rule-based fallacy detection.
Pros: Best of both worlds, robust, still single API call, cost-effective
Cons: More complex than single approach, need to tune interaction
Cost: $0.020/analysis | Speed: Fast | Complexity: MED-HIGH
When to use: POC2 for robust production-ready solution
Approach 7: LLM-as-Judge (Verification Pass)
Concept: Pass 1 generates verdict, Pass 2 verifies if verdict matches article content.
Pros: AI checks AI, high reliability, catches mistakes
Cons: Slower (two calls), more expensive, verification might also err
Cost: $0.030/analysis | Speed: Slower | Complexity: MEDIUM
When to use: Production if quality is paramount
Comparison Matrix
| Approach | API Calls | Cost | Speed | Complexity | Reliability | Best For | |
| - | |||||||
| 1. Single-Pass β | 1 | π° Low | β‘ Fast | π’ Low | β οΈ Medium | POC1 | |
| 2. Two-Pass | 2 | π°π° Med | π Slow | π‘ Med | β High | POC2/Prod | |
| 3. Structured | 1 | π° Low | β‘ Fast | π‘ Med | β High | POC1 | |
| 4. Weighted | 1 | π° Low | β‘ Fast | π’ Low | β οΈ Medium | POC1 | |
| 5. Heuristics | 1 | π° Lowest | β‘β‘ Fastest | π‘ Med | β οΈ Medium | Any | |
| 6. Hybrid β | 1 | π° Low | β‘ Fast | π Med-High | β High | POC2 | |
| 7. Judge | 2 | π°π° Med | π Slow | π‘ Med | β High | Production |
Phased Recommendation
POC1 (Immediate):
- Test Approach 1 (Single-Pass Holistic)
- Mark as experimental
- Gather data on AI capability
POC2 (If POC1 validates approach):
- Upgrade to Approach 6 (Hybrid)
- Add heuristics for robustness
- Target 85%+ accuracy
Production (Post-POC2):
- If quality issues: Consider Approach 7 (LLM-as-Judge)
- If quality acceptable: Keep Approach 6
- Target 90%+ accuracy
π― Decision Framework
POC1 Evaluation Criteria
After testing with 30 articles:
If AI Accuracy β₯70%:
- β
Approach validated!
- β
Ship as standard feature in POC2
- β
Consider adding heuristics (Approach 6) for robustness
If AI Accuracy 50-70%:
- β οΈ Promising but needs improvement
- β οΈ Try Approach 4 (Weighted Aggregation) in POC1
- β οΈ Plan Approach 6 (Hybrid) for POC2
If AI Accuracy <50%:
- β Current AI can't do this reliably
- β Defer to POC2 or post-POC2
- β Consider Approach 2 or 7 (two-pass) for production
Why This Matters for POC1
Testing this in POC1:
- Validates core capability (can AI do nuanced reasoning?)
- Informs POC2 architecture decisions
- Zero cost to try (just prompt enhancement)
- Fail-fast principle (test hardest part first)
Not testing this in POC1:
- Keeps POC1 scope minimal
- Focuses on core claim extraction
- But misses early learning opportunity
Decision: Test it, mark as experimental, don't block POC1 success on it.
π Implementation Notes
What AKEL Must Do
For POC1 (Approach 1):
- Enhanced prompt with logical analysis section
2. Parse AI output for both claim-level and article-level verdicts
3. Display both verdicts to user
4. Track accuracy on test set
No architecture changes needed.
What Gets Displayed to Users
Output Format:
```
ANALYSIS SUMMARY (4-6 sentences, context-aware):
"This article argues that coffee cures cancer based on evidence about
antioxidants. We analyzed 3 claims: two supporting facts about coffee's
chemical properties are well-supported, but the main causal claim is
refuted by current evidence. The article confuses correlation with
causation. Overall assessment: MISLEADING - makes an unsupported
medical claim despite citing some accurate facts."
CLAIMS VERDICTS:
[1] Coffee contains antioxidants: WELL-SUPPORTED (95%)
[2] Antioxidants fight cancer: WELL-SUPPORTED (85%)
[3] Coffee cures cancer: REFUTED (10%)
ARTICLE VERDICT: MISLEADING
The article's main conclusion is not supported by the evidence presented.
```
Error Handling
If AI fails to provide article-level analysis:
- Fall back to claim-average verdict
- Log failure for analysis
- Don't break the analysis
If AI over-flags straightforward articles:
- Review prompt tuning
- Consider adding confidence threshold
- Track false positive rate
π¬ Testing Strategy
Test Set Composition
Category 1: Straightforward Articles (10 articles)
- Clear claims with matching overall message
- Verdict = average should work fine
- Tests that we don't over-flag
Category 2: Misleading Articles (10 articles)
- Accurate facts, unsupported conclusion
- Logical fallacies present
- Verdict β average
- Core test of capability
Category 3: Complex/Nuanced (10 articles)
- Gray areas
- Multiple valid interpretations
- Tests nuance handling
Success Metrics
Quantitative:
- β₯70% accuracy on Category 2 (misleading articles)
- β€30% false positives on Category 1 (straightforward)
- β₯50% accuracy on Category 3 (complex)
Qualitative:
- Reasoning is comprehensible to humans
- False positives are explainable
- False negatives reveal clear AI limitations
Documentation
For each test case, record:
- Article summary
- AI's claim verdicts
- AI's article verdict
- AI's reasoning
- Human judgment (correct/incorrect)
- Notes on why AI succeeded/failed
π‘ Key Insights
What This Tests
Core Capability:
Can AI understand that article credibility depends on:
- Logical structure (does conclusion follow?)
2. Claim importance (main vs. supporting)
3. Reasoning quality (sound vs. fallacious)
Not just:
- Accuracy of individual facts
- Simple averages
- Keyword matching
Why This Is Important
For FactHarbor's Mission:
- Prevents misleading "mostly accurate" verdicts
- Catches dangerous misinformation (medical, financial)
- Provides nuanced analysis users can trust
For POC Validation:
- Tests most challenging capability
- If AI can do this, everything else is easier
- If AI can't, we know early and adjust
Strategic Value
If Approach 1 works (β₯70% accuracy):
- β
Solved complex problem with zero architecture changes
- β
No cost increase
- β
Differentiation from competitors who only check facts
- β
Foundation for more sophisticated features
If Approach 1 doesn't work:
- β
Learned AI limitations early
- β
Informed decision for POC2
- β
Can plan proper solution (Approach 6 or 7)
π Lessons Learned
From Investigation
AI Capabilities:
- Modern LLMs (Sonnet 4.5) can do nuanced reasoning
- But reliability varies significantly
- Need to test, not assume
Cost-Performance Trade-offs:
- Single-pass approaches: Fast and cheap but less reliable
- Multi-pass approaches: Slower and expensive but more robust
- Hybrid approaches: Best balance for production
Architecture Decisions:
- Don't over-engineer before validating need
- Test simplest approach first
- Have fallback plans ready
For POC1
Keep It Simple:
- Test Approach 1 with minimal changes
- Mark as experimental
- Use results to guide POC2
Fail Fast:
- 30-article test set reveals capability quickly
- Better to learn in POC1 than after building complex architecture
Document Everything:
- Track AI failures
- Understand patterns
- Inform future improvements
β Summary
Problem: Article credibility β average of claim verdicts
Investigation: 7 approaches analyzed for cost, speed, reliability
Chosen Solution: Single-Pass Holistic Analysis (Approach 1)
- Test in POC1 with enhanced prompt
- Zero cost increase, no architecture changes
- Validate if AI can do nuanced reasoning
Success Criteria: β₯70% accuracy detecting misleading articles
Fallback Plans:
- POC1: Try Approach 4 (Weighted Aggregation)
- POC2: Implement Approach 6 (Hybrid)
- Production: Consider Approach 7 (LLM-as-Judge)
Next Steps:
- Create 30-article test set
2. Enhance AI prompt
3. Test and measure accuracy
4. Use results to inform POC2 design
This is POC1's key experimental feature! π―