Last modified by Robert Schaub on 2025/12/24 09:59

Show last authors
1 = POC2: Robust Quality & Reliability =
2
3 **Phase Goal:** Prove AKEL produces high-quality outputs consistently at scale
4
5 **Success Metric:** <5% hallucination rate, all 4 quality gates operational
6
7
8 == 1. Overview ==
9
10 POC2 extends POC1 by implementing the full quality assurance framework (all 4 gates), adding evidence deduplication, and processing significantly more test articles to validate system reliability at scale.
11
12 **Key Innovation:** Complete quality validation pipeline catches all categories of errors
13
14 **What We're Proving:**
15
16 * All 4 quality gates work together effectively
17 * Evidence deduplication prevents artificial inflation
18 * System maintains quality at larger scale
19 * Quality metrics dashboard provides actionable insights
20
21 == 2. New Requirements ==
22
23 === 2.1 NFR11: Complete Quality Assurance Framework ===
24
25 **Add Gates 2 & 3** (POC1 had only Gates 1 & 4)
26
27 ==== Gate 2: Evidence Relevance Validation ====
28
29 **Purpose:** Ensure AI-linked evidence actually relates to the claim
30
31 **Validation Checks:**
32
33 1. **Semantic Similarity:** Cosine similarity between claim and evidence embeddings ≥ 0.6
34 2. **Entity Overlap:** At least 1 shared named entity between claim and evidence
35 3. **Topic Relevance:** Evidence discusses the claim's subject matter (score ≥ 0.5)
36
37 **Action if Failed:**
38
39 * Discard irrelevant evidence (don't count it)
40 * If <2 relevant evidence items remain → "Insufficient Evidence" verdict
41 * Log discarded evidence for quality review
42
43 **Target:** 0% of evidence cited is off-topic
44
45
46 ==== Gate 3: Scenario Coherence Check ====
47
48 **Purpose:** Validate scenarios are logical, complete, and meaningfully different
49
50 **Validation Checks:**
51
52 1. **Completeness:** All required fields populated (assumptions, scope, evidence context)
53 2. **Internal Consistency:** Assumptions don't contradict each other (score <0.3)
54 3. **Distinctiveness:** Scenarios are meaningfully different (similarity <0.8)
55 4. **Minimum Detail:** At least 1 specific assumption per scenario
56
57 **Action if Failed:**
58
59 * Merge duplicate scenarios
60 * Flag contradictory assumptions for review
61 * Reduce confidence score by 20%
62 * Do not publish if <2 distinct scenarios
63
64 **Target:** 0% duplicate scenarios, all scenarios internally consistent
65
66
67 === 2.2 FR54: Evidence Deduplication (NEW) ===
68
69 **Importance:** HIGH
70 **Fulfills:** Accurate evidence counting, prevents artificial inflation
71
72 **Purpose:** Prevent counting the same evidence multiple times when cited by different sources
73
74 **Problem:**
75
76 * Wire services (AP, Reuters) redistribute same content
77 * Different sites cite the same original study
78 * Aggregators copy primary sources
79 * AKEL might count this as "5 sources" when it's really 1
80
81 **Solution: Content Fingerprinting**
82
83 * Generate SHA-256 hash of normalized text
84 * Detect near-duplicates (≥85% similarity) using fuzzy matching
85 * Track which sources cited each unique piece of evidence
86 * Display provenance chain to user
87
88 **Target:** Duplicate detection >95% accurate, evidence counts reflect reality
89
90
91 === 2.3 NFR13: Quality Metrics Dashboard (Internal) ===
92
93 **Importance:** HIGH
94 **Fulfills:** Real-time quality monitoring during development
95
96 **Dashboard Metrics:**
97
98 * Claim processing statistics
99 * Gate performance (pass/fail rates for each gate)
100 * Evidence quality metrics
101 * Hallucination rate tracking
102 * Processing performance
103
104 **Target:** Dashboard functional, all metrics tracked, exportable
105
106
107 == 3. Success Criteria ==
108
109 **✅ Quality:**
110
111 * Hallucination rate <5% (target: <3%)
112 * Average quality rating ≥8.0/10
113 * 0 critical failures (publishable falsities)
114 * Gates correctly identify >95% of low-quality outputs
115
116 **✅ All 4 Gates Operational:**
117
118 * Gate 1: Claim validation working
119 * Gate 2: Evidence relevance filtering working
120 * Gate 3: Scenario coherence checking working
121 * Gate 4: Verdict confidence assessment working
122
123 **✅ Evidence Deduplication:**
124
125 * Duplicate detection >95% accurate
126 * Evidence counts reflect reality
127 * Provenance tracked correctly
128
129 **✅ Metrics Dashboard:**
130
131 * All metrics implemented and tracking
132 * Dashboard functional and useful
133 * Alerts trigger appropriately
134
135 == 4. Architecture Notes ==
136
137 **POC2 Enhanced Architecture:**
138
139 {{code}}
140 Input → AKEL Processing → All 4 Quality Gates → Display
141 (claims + scenarios (1: Claim validation
142 + evidence linking 2: Evidence relevance
143 + verdicts) 3: Scenario coherence
144 4: Verdict confidence)
145 {{/code}}
146
147 **Key Additions from POC1:**
148
149 * Scenario generation component
150 * Evidence deduplication system
151 * Gates 2 & 3 implementation
152 * Quality metrics collection
153
154 **Still Simplified vs. Full System:**
155
156 * Single AKEL orchestration (not multi-component pipeline)
157 * No review queue
158 * No federation architecture
159
160 **See:** [[Architecture>>Test.FactHarbor pre10 V0\.9\.70.Specification.Architecture.WebHome]] for details
161
162
163 == 5. Context-Aware Analysis (Conditional Feature) ==
164
165 **Status:** Depends on POC1 experimental test results
166
167 **Background:**
168
169 POC1 tested context-aware analysis as an experimental feature using Approach 1 (Single-Pass Holistic Analysis). The goal is to detect when articles use accurate individual claims but reach misleading conclusions through faulty logic or selective presentation.
170
171 **See:** [[Article Verdict Problem>>Test.FactHarbor.Specification.POC.Article-Verdict-Problem]] for complete investigation
172
173 === 5.1 POC2 Implementation Path ===
174
175 **Decision based on POC1 test results (30-article test set):**
176
177 ==== If POC1 Accuracy ≥70% (Success) ====
178
179 **Action:** Implement as standard feature (no longer experimental)
180
181 **Enhancement to FR4:**
182 * Context-aware analysis becomes part of standard Analysis Summary
183 * Article verdict may differ from simple claim average
184 * AI evaluates logical structure and reasoning quality
185
186 **Potential Upgrade to Approach 6 (Hybrid):**
187 * Add weighted claim importance (some claims more central than others)
188 * Add rule-based fallacy detection alongside AI reasoning
189 * Combine AI judgment with heuristic checks for robustness
190
191 **Target:** Maintain ≥70% accuracy at detecting misleading articles
192
193 ==== If POC1 Accuracy 50-70% (Promising) ====
194
195 **Action:** Implement alternative Approach 4 (Weighted Aggregation)
196
197 **Instead of holistic analysis:**
198 * AI assigns importance weights (0-1) to each claim
199 * Weight based on: claim centrality, evidence strength, logical role
200 * Article verdict = weighted average of claim verdicts
201 * More structured than pure AI reasoning
202
203 **Rationale:** If holistic reasoning is inconsistent, structured weighting may work better
204
205 ==== If POC1 Accuracy <50% (Insufficient) ====
206
207 **Action:** Defer context-aware analysis to post-POC2
208
209 **Fallback:**
210 * Focus on individual claim accuracy only
211 * Article verdict = simple average of claim verdicts
212 * Note limitation: May miss misleading articles built from accurate claims
213
214 **Future consideration:** Try Approach 7 (LLM-as-Judge) with better models in future releases
215
216 === 5.2 Testing in POC2 ===
217
218 **If context-aware feature is implemented:**
219
220 * Expand test set from 30 to 100 articles
221 * Include more diverse article types (op-eds, news, analysis, advocacy)
222 * Track false positive rate (flagging good articles as misleading)
223 * Validate with subject matter experts when possible
224
225 **Success Metrics:**
226 * ≥70% accuracy on misleading article detection
227 * <15% false positive rate
228 * Reasoning is comprehensible to users
229
230 === 5.3 Architecture Notes ===
231
232 **Context-aware analysis adds NO additional API calls**
233
234 The enhanced analysis happens within the existing AKEL workflow:
235
236 {{code}}
237 Standard Flow: Context-Aware Enhancement:
238 1. Extract claims 1. Extract claims + mark central claims
239 2. Find evidence 2. Find evidence
240 3. Generate verdicts 3. Generate verdicts
241 4. Write summary 4. Write context-aware summary
242 (evaluates article structure)
243 {{/code}}
244
245 **Cost:** $0 increase (same API calls, enhanced prompt only)
246
247 **See:** [[POC Requirements>>Test.FactHarbor.Specification.POC.Requirements]] Component 1 for implementation details
248
249
250
251
252 == Related Pages ==
253
254 * [[POC1>>Test.FactHarbor pre10 V0\.9\.70.Roadmap.POC1.WebHome]] - Previous phase
255 * [[Beta 0>>Test.FactHarbor pre10 V0\.9\.70.Roadmap.Beta0.WebHome]] - Next phase
256 * [[Roadmap Overview>>Test.FactHarbor pre10 V0\.9\.70.Roadmap.WebHome]]
257 * [[Architecture>>Test.FactHarbor pre10 V0\.9\.70.Specification.Architecture.WebHome]]
258
259 **Document Status:** ✅ POC2 Specification Complete - Waiting for POC1 Completion
260 **Version:** V0.9.70