Wiki source code of POC2: Robust Quality & Reliability
Last modified by Robert Schaub on 2025/12/24 09:59
Show last authors
| author | version | line-number | content |
|---|---|---|---|
| 1 | = POC2: Robust Quality & Reliability = | ||
| 2 | |||
| 3 | **Phase Goal:** Prove AKEL produces high-quality outputs consistently at scale | ||
| 4 | |||
| 5 | **Success Metric:** <5% hallucination rate, all 4 quality gates operational | ||
| 6 | |||
| 7 | |||
| 8 | == 1. Overview == | ||
| 9 | |||
| 10 | POC2 extends POC1 by implementing the full quality assurance framework (all 4 gates), adding evidence deduplication, and processing significantly more test articles to validate system reliability at scale. | ||
| 11 | |||
| 12 | **Key Innovation:** Complete quality validation pipeline catches all categories of errors | ||
| 13 | |||
| 14 | **What We're Proving:** | ||
| 15 | |||
| 16 | * All 4 quality gates work together effectively | ||
| 17 | * Evidence deduplication prevents artificial inflation | ||
| 18 | * System maintains quality at larger scale | ||
| 19 | * Quality metrics dashboard provides actionable insights | ||
| 20 | |||
| 21 | == 2. New Requirements == | ||
| 22 | |||
| 23 | === 2.1 NFR11: Complete Quality Assurance Framework === | ||
| 24 | |||
| 25 | **Add Gates 2 & 3** (POC1 had only Gates 1 & 4) | ||
| 26 | |||
| 27 | ==== Gate 2: Evidence Relevance Validation ==== | ||
| 28 | |||
| 29 | **Purpose:** Ensure AI-linked evidence actually relates to the claim | ||
| 30 | |||
| 31 | **Validation Checks:** | ||
| 32 | |||
| 33 | 1. **Semantic Similarity:** Cosine similarity between claim and evidence embeddings ≥ 0.6 | ||
| 34 | 2. **Entity Overlap:** At least 1 shared named entity between claim and evidence | ||
| 35 | 3. **Topic Relevance:** Evidence discusses the claim's subject matter (score ≥ 0.5) | ||
| 36 | |||
| 37 | **Action if Failed:** | ||
| 38 | |||
| 39 | * Discard irrelevant evidence (don't count it) | ||
| 40 | * If <2 relevant evidence items remain → "Insufficient Evidence" verdict | ||
| 41 | * Log discarded evidence for quality review | ||
| 42 | |||
| 43 | **Target:** 0% of evidence cited is off-topic | ||
| 44 | |||
| 45 | |||
| 46 | ==== Gate 3: Scenario Coherence Check ==== | ||
| 47 | |||
| 48 | **Purpose:** Validate scenarios are logical, complete, and meaningfully different | ||
| 49 | |||
| 50 | **Validation Checks:** | ||
| 51 | |||
| 52 | 1. **Completeness:** All required fields populated (assumptions, scope, evidence context) | ||
| 53 | 2. **Internal Consistency:** Assumptions don't contradict each other (score <0.3) | ||
| 54 | 3. **Distinctiveness:** Scenarios are meaningfully different (similarity <0.8) | ||
| 55 | 4. **Minimum Detail:** At least 1 specific assumption per scenario | ||
| 56 | |||
| 57 | **Action if Failed:** | ||
| 58 | |||
| 59 | * Merge duplicate scenarios | ||
| 60 | * Flag contradictory assumptions for review | ||
| 61 | * Reduce confidence score by 20% | ||
| 62 | * Do not publish if <2 distinct scenarios | ||
| 63 | |||
| 64 | **Target:** 0% duplicate scenarios, all scenarios internally consistent | ||
| 65 | |||
| 66 | |||
| 67 | === 2.2 FR54: Evidence Deduplication (NEW) === | ||
| 68 | |||
| 69 | **Importance:** HIGH | ||
| 70 | **Fulfills:** Accurate evidence counting, prevents artificial inflation | ||
| 71 | |||
| 72 | **Purpose:** Prevent counting the same evidence multiple times when cited by different sources | ||
| 73 | |||
| 74 | **Problem:** | ||
| 75 | |||
| 76 | * Wire services (AP, Reuters) redistribute same content | ||
| 77 | * Different sites cite the same original study | ||
| 78 | * Aggregators copy primary sources | ||
| 79 | * AKEL might count this as "5 sources" when it's really 1 | ||
| 80 | |||
| 81 | **Solution: Content Fingerprinting** | ||
| 82 | |||
| 83 | * Generate SHA-256 hash of normalized text | ||
| 84 | * Detect near-duplicates (≥85% similarity) using fuzzy matching | ||
| 85 | * Track which sources cited each unique piece of evidence | ||
| 86 | * Display provenance chain to user | ||
| 87 | |||
| 88 | **Target:** Duplicate detection >95% accurate, evidence counts reflect reality | ||
| 89 | |||
| 90 | |||
| 91 | === 2.3 NFR13: Quality Metrics Dashboard (Internal) === | ||
| 92 | |||
| 93 | **Importance:** HIGH | ||
| 94 | **Fulfills:** Real-time quality monitoring during development | ||
| 95 | |||
| 96 | **Dashboard Metrics:** | ||
| 97 | |||
| 98 | * Claim processing statistics | ||
| 99 | * Gate performance (pass/fail rates for each gate) | ||
| 100 | * Evidence quality metrics | ||
| 101 | * Hallucination rate tracking | ||
| 102 | * Processing performance | ||
| 103 | |||
| 104 | **Target:** Dashboard functional, all metrics tracked, exportable | ||
| 105 | |||
| 106 | |||
| 107 | == 3. Success Criteria == | ||
| 108 | |||
| 109 | **✅ Quality:** | ||
| 110 | |||
| 111 | * Hallucination rate <5% (target: <3%) | ||
| 112 | * Average quality rating ≥8.0/10 | ||
| 113 | * 0 critical failures (publishable falsities) | ||
| 114 | * Gates correctly identify >95% of low-quality outputs | ||
| 115 | |||
| 116 | **✅ All 4 Gates Operational:** | ||
| 117 | |||
| 118 | * Gate 1: Claim validation working | ||
| 119 | * Gate 2: Evidence relevance filtering working | ||
| 120 | * Gate 3: Scenario coherence checking working | ||
| 121 | * Gate 4: Verdict confidence assessment working | ||
| 122 | |||
| 123 | **✅ Evidence Deduplication:** | ||
| 124 | |||
| 125 | * Duplicate detection >95% accurate | ||
| 126 | * Evidence counts reflect reality | ||
| 127 | * Provenance tracked correctly | ||
| 128 | |||
| 129 | **✅ Metrics Dashboard:** | ||
| 130 | |||
| 131 | * All metrics implemented and tracking | ||
| 132 | * Dashboard functional and useful | ||
| 133 | * Alerts trigger appropriately | ||
| 134 | |||
| 135 | == 4. Architecture Notes == | ||
| 136 | |||
| 137 | **POC2 Enhanced Architecture:** | ||
| 138 | |||
| 139 | {{code}} | ||
| 140 | Input → AKEL Processing → All 4 Quality Gates → Display | ||
| 141 | (claims + scenarios (1: Claim validation | ||
| 142 | + evidence linking 2: Evidence relevance | ||
| 143 | + verdicts) 3: Scenario coherence | ||
| 144 | 4: Verdict confidence) | ||
| 145 | {{/code}} | ||
| 146 | |||
| 147 | **Key Additions from POC1:** | ||
| 148 | |||
| 149 | * Scenario generation component | ||
| 150 | * Evidence deduplication system | ||
| 151 | * Gates 2 & 3 implementation | ||
| 152 | * Quality metrics collection | ||
| 153 | |||
| 154 | **Still Simplified vs. Full System:** | ||
| 155 | |||
| 156 | * Single AKEL orchestration (not multi-component pipeline) | ||
| 157 | * No review queue | ||
| 158 | * No federation architecture | ||
| 159 | |||
| 160 | **See:** [[Architecture>>Test.FactHarbor pre10 V0\.9\.70.Specification.Architecture.WebHome]] for details | ||
| 161 | |||
| 162 | |||
| 163 | == 5. Context-Aware Analysis (Conditional Feature) == | ||
| 164 | |||
| 165 | **Status:** Depends on POC1 experimental test results | ||
| 166 | |||
| 167 | **Background:** | ||
| 168 | |||
| 169 | POC1 tested context-aware analysis as an experimental feature using Approach 1 (Single-Pass Holistic Analysis). The goal is to detect when articles use accurate individual claims but reach misleading conclusions through faulty logic or selective presentation. | ||
| 170 | |||
| 171 | **See:** [[Article Verdict Problem>>Test.FactHarbor.Specification.POC.Article-Verdict-Problem]] for complete investigation | ||
| 172 | |||
| 173 | === 5.1 POC2 Implementation Path === | ||
| 174 | |||
| 175 | **Decision based on POC1 test results (30-article test set):** | ||
| 176 | |||
| 177 | ==== If POC1 Accuracy ≥70% (Success) ==== | ||
| 178 | |||
| 179 | **Action:** Implement as standard feature (no longer experimental) | ||
| 180 | |||
| 181 | **Enhancement to FR4:** | ||
| 182 | * Context-aware analysis becomes part of standard Analysis Summary | ||
| 183 | * Article verdict may differ from simple claim average | ||
| 184 | * AI evaluates logical structure and reasoning quality | ||
| 185 | |||
| 186 | **Potential Upgrade to Approach 6 (Hybrid):** | ||
| 187 | * Add weighted claim importance (some claims more central than others) | ||
| 188 | * Add rule-based fallacy detection alongside AI reasoning | ||
| 189 | * Combine AI judgment with heuristic checks for robustness | ||
| 190 | |||
| 191 | **Target:** Maintain ≥70% accuracy at detecting misleading articles | ||
| 192 | |||
| 193 | ==== If POC1 Accuracy 50-70% (Promising) ==== | ||
| 194 | |||
| 195 | **Action:** Implement alternative Approach 4 (Weighted Aggregation) | ||
| 196 | |||
| 197 | **Instead of holistic analysis:** | ||
| 198 | * AI assigns importance weights (0-1) to each claim | ||
| 199 | * Weight based on: claim centrality, evidence strength, logical role | ||
| 200 | * Article verdict = weighted average of claim verdicts | ||
| 201 | * More structured than pure AI reasoning | ||
| 202 | |||
| 203 | **Rationale:** If holistic reasoning is inconsistent, structured weighting may work better | ||
| 204 | |||
| 205 | ==== If POC1 Accuracy <50% (Insufficient) ==== | ||
| 206 | |||
| 207 | **Action:** Defer context-aware analysis to post-POC2 | ||
| 208 | |||
| 209 | **Fallback:** | ||
| 210 | * Focus on individual claim accuracy only | ||
| 211 | * Article verdict = simple average of claim verdicts | ||
| 212 | * Note limitation: May miss misleading articles built from accurate claims | ||
| 213 | |||
| 214 | **Future consideration:** Try Approach 7 (LLM-as-Judge) with better models in future releases | ||
| 215 | |||
| 216 | === 5.2 Testing in POC2 === | ||
| 217 | |||
| 218 | **If context-aware feature is implemented:** | ||
| 219 | |||
| 220 | * Expand test set from 30 to 100 articles | ||
| 221 | * Include more diverse article types (op-eds, news, analysis, advocacy) | ||
| 222 | * Track false positive rate (flagging good articles as misleading) | ||
| 223 | * Validate with subject matter experts when possible | ||
| 224 | |||
| 225 | **Success Metrics:** | ||
| 226 | * ≥70% accuracy on misleading article detection | ||
| 227 | * <15% false positive rate | ||
| 228 | * Reasoning is comprehensible to users | ||
| 229 | |||
| 230 | === 5.3 Architecture Notes === | ||
| 231 | |||
| 232 | **Context-aware analysis adds NO additional API calls** | ||
| 233 | |||
| 234 | The enhanced analysis happens within the existing AKEL workflow: | ||
| 235 | |||
| 236 | {{code}} | ||
| 237 | Standard Flow: Context-Aware Enhancement: | ||
| 238 | 1. Extract claims 1. Extract claims + mark central claims | ||
| 239 | 2. Find evidence 2. Find evidence | ||
| 240 | 3. Generate verdicts 3. Generate verdicts | ||
| 241 | 4. Write summary 4. Write context-aware summary | ||
| 242 | (evaluates article structure) | ||
| 243 | {{/code}} | ||
| 244 | |||
| 245 | **Cost:** $0 increase (same API calls, enhanced prompt only) | ||
| 246 | |||
| 247 | **See:** [[POC Requirements>>Test.FactHarbor.Specification.POC.Requirements]] Component 1 for implementation details | ||
| 248 | |||
| 249 | |||
| 250 | |||
| 251 | |||
| 252 | == Related Pages == | ||
| 253 | |||
| 254 | * [[POC1>>Test.FactHarbor pre10 V0\.9\.70.Roadmap.POC1.WebHome]] - Previous phase | ||
| 255 | * [[Beta 0>>Test.FactHarbor pre10 V0\.9\.70.Roadmap.Beta0.WebHome]] - Next phase | ||
| 256 | * [[Roadmap Overview>>Test.FactHarbor pre10 V0\.9\.70.Roadmap.WebHome]] | ||
| 257 | * [[Architecture>>Test.FactHarbor pre10 V0\.9\.70.Specification.Architecture.WebHome]] | ||
| 258 | |||
| 259 | **Document Status:** ✅ POC2 Specification Complete - Waiting for POC1 Completion | ||
| 260 | **Version:** V0.9.70 |