Wiki source code of POC2: Robust Quality & Reliability
Last modified by Robert Schaub on 2025/12/24 21:53
Show last authors
| author | version | line-number | content |
|---|---|---|---|
| 1 | = POC2: Robust Quality & Reliability = | ||
| 2 | |||
| 3 | **Phase Goal:** Prove AKEL produces high-quality outputs consistently at scale | ||
| 4 | |||
| 5 | **Success Metric:** <5% hallucination rate, all 4 quality gates operational | ||
| 6 | |||
| 7 | == 1. Overview == | ||
| 8 | |||
| 9 | POC2 extends POC1 by implementing the full quality assurance framework (all 4 gates), adding evidence deduplication, and processing significantly more test articles to validate system reliability at scale. | ||
| 10 | |||
| 11 | **Key Innovation:** Complete quality validation pipeline catches all categories of errors | ||
| 12 | |||
| 13 | **What We're Proving:** | ||
| 14 | |||
| 15 | * All 4 quality gates work together effectively | ||
| 16 | * Evidence deduplication prevents artificial inflation | ||
| 17 | * System maintains quality at larger scale | ||
| 18 | * Quality metrics dashboard provides actionable insights | ||
| 19 | |||
| 20 | == 2. New Requirements == | ||
| 21 | |||
| 22 | === 2.1 NFR11: Complete Quality Assurance Framework === | ||
| 23 | |||
| 24 | **Add Gates 2 & 3** (POC1 had only Gates 1 & 4) | ||
| 25 | |||
| 26 | ==== Gate 2: Evidence Relevance Validation ==== | ||
| 27 | |||
| 28 | **Purpose:** Ensure AI-linked evidence actually relates to the claim | ||
| 29 | |||
| 30 | **Validation Checks:** | ||
| 31 | |||
| 32 | 1. **Semantic Similarity:** Cosine similarity between claim and evidence embeddings ≥ 0.6 | ||
| 33 | 2. **Entity Overlap:** At least 1 shared named entity between claim and evidence | ||
| 34 | 3. **Topic Relevance:** Evidence discusses the claim's subject matter (score ≥ 0.5) | ||
| 35 | |||
| 36 | **Action if Failed:** | ||
| 37 | |||
| 38 | * Discard irrelevant evidence (don't count it) | ||
| 39 | * If <2 relevant evidence items remain → "Insufficient Evidence" verdict | ||
| 40 | * Log discarded evidence for quality review | ||
| 41 | |||
| 42 | **Target:** 0% of evidence cited is off-topic | ||
| 43 | |||
| 44 | ==== Gate 3: Scenario Coherence Check ==== | ||
| 45 | |||
| 46 | **Purpose:** Validate scenarios are logical, complete, and meaningfully different | ||
| 47 | |||
| 48 | **Validation Checks:** | ||
| 49 | |||
| 50 | 1. **Completeness:** All required fields populated (assumptions, scope, evidence context) | ||
| 51 | 2. **Internal Consistency:** Assumptions don't contradict each other (score <0.3) | ||
| 52 | 3. **Distinctiveness:** Scenarios are meaningfully different (similarity <0.8) | ||
| 53 | 4. **Minimum Detail:** At least 1 specific assumption per scenario | ||
| 54 | |||
| 55 | **Action if Failed:** | ||
| 56 | |||
| 57 | * Merge duplicate scenarios | ||
| 58 | * Flag contradictory assumptions for review | ||
| 59 | * Reduce confidence score by 20% | ||
| 60 | * Do not publish if <2 distinct scenarios | ||
| 61 | |||
| 62 | **Target:** 0% duplicate scenarios, all scenarios internally consistent | ||
| 63 | |||
| 64 | === 2.2 FR54: Evidence Deduplication (NEW) === | ||
| 65 | |||
| 66 | **Importance:** HIGH | ||
| 67 | **Fulfills:** Accurate evidence counting, prevents artificial inflation | ||
| 68 | |||
| 69 | **Purpose:** Prevent counting the same evidence multiple times when cited by different sources | ||
| 70 | |||
| 71 | **Problem:** | ||
| 72 | |||
| 73 | * Wire services (AP, Reuters) redistribute same content | ||
| 74 | * Different sites cite the same original study | ||
| 75 | * Aggregators copy primary sources | ||
| 76 | * AKEL might count this as "5 sources" when it's really 1 | ||
| 77 | |||
| 78 | **Solution: Content Fingerprinting** | ||
| 79 | |||
| 80 | * Generate SHA-256 hash of normalized text | ||
| 81 | * Detect near-duplicates (≥85% similarity) using fuzzy matching | ||
| 82 | * Track which sources cited each unique piece of evidence | ||
| 83 | * Display provenance chain to user | ||
| 84 | |||
| 85 | **Target:** Duplicate detection >95% accurate, evidence counts reflect reality | ||
| 86 | |||
| 87 | === 2.3 NFR13: Quality Metrics Dashboard (Internal) === | ||
| 88 | |||
| 89 | **Importance:** HIGH | ||
| 90 | **Fulfills:** Real-time quality monitoring during development | ||
| 91 | |||
| 92 | **Dashboard Metrics:** | ||
| 93 | |||
| 94 | * Claim processing statistics | ||
| 95 | * Gate performance (pass/fail rates for each gate) | ||
| 96 | * Evidence quality metrics | ||
| 97 | * Hallucination rate tracking | ||
| 98 | * Processing performance | ||
| 99 | |||
| 100 | **Target:** Dashboard functional, all metrics tracked, exportable | ||
| 101 | |||
| 102 | == 3. Success Criteria == | ||
| 103 | |||
| 104 | **✅ Quality:** | ||
| 105 | |||
| 106 | * Hallucination rate <5% (target: <3%) | ||
| 107 | * Average quality rating ≥8.0/10 | ||
| 108 | * 0 critical failures (publishable falsities) | ||
| 109 | * Gates correctly identify >95% of low-quality outputs | ||
| 110 | |||
| 111 | **✅ All 4 Gates Operational:** | ||
| 112 | |||
| 113 | * Gate 1: Claim validation working | ||
| 114 | * Gate 2: Evidence relevance filtering working | ||
| 115 | * Gate 3: Scenario coherence checking working | ||
| 116 | * Gate 4: Verdict confidence assessment working | ||
| 117 | |||
| 118 | **✅ Evidence Deduplication:** | ||
| 119 | |||
| 120 | * Duplicate detection >95% accurate | ||
| 121 | * Evidence counts reflect reality | ||
| 122 | * Provenance tracked correctly | ||
| 123 | |||
| 124 | **✅ Metrics Dashboard:** | ||
| 125 | |||
| 126 | * All metrics implemented and tracking | ||
| 127 | * Dashboard functional and useful | ||
| 128 | * Alerts trigger appropriately | ||
| 129 | |||
| 130 | == 4. Architecture Notes == | ||
| 131 | |||
| 132 | **POC2 Enhanced Architecture:** | ||
| 133 | |||
| 134 | {{code}} | ||
| 135 | Input → AKEL Processing → All 4 Quality Gates → Display | ||
| 136 | (claims + scenarios (1: Claim validation | ||
| 137 | + evidence linking 2: Evidence relevance | ||
| 138 | + verdicts) 3: Scenario coherence | ||
| 139 | 4: Verdict confidence) | ||
| 140 | {{/code}} | ||
| 141 | |||
| 142 | **Key Additions from POC1:** | ||
| 143 | |||
| 144 | * Scenario generation component | ||
| 145 | * Evidence deduplication system | ||
| 146 | * Gates 2 & 3 implementation | ||
| 147 | * Quality metrics collection | ||
| 148 | |||
| 149 | **Still Simplified vs. Full System:** | ||
| 150 | |||
| 151 | * Single AKEL orchestration (not multi-component pipeline) | ||
| 152 | * No review queue | ||
| 153 | * No federation architecture | ||
| 154 | |||
| 155 | **See:** [[Architecture>>FactHarbor pre10 V0\.9\.70.Specification.Architecture.WebHome]] for details | ||
| 156 | |||
| 157 | == 5. Context-Aware Analysis (Conditional Feature) == | ||
| 158 | |||
| 159 | **Status:** Depends on POC1 experimental test results | ||
| 160 | |||
| 161 | **Background:** | ||
| 162 | |||
| 163 | POC1 tested context-aware analysis as an experimental feature using Approach 1 (Single-Pass Holistic Analysis). The goal is to detect when articles use accurate individual claims but reach misleading conclusions through faulty logic or selective presentation. | ||
| 164 | |||
| 165 | **See:** [[Article Verdict Problem>>FactHarbor.Specification.POC.Article-Verdict-Problem]] for complete investigation | ||
| 166 | |||
| 167 | === 5.1 POC2 Implementation Path === | ||
| 168 | |||
| 169 | **Decision based on POC1 test results (30-article test set):** | ||
| 170 | |||
| 171 | ==== If POC1 Accuracy ≥70% (Success) ==== | ||
| 172 | |||
| 173 | **Action:** Implement as standard feature (no longer experimental) | ||
| 174 | |||
| 175 | **Enhancement to FR4:** | ||
| 176 | * Context-aware analysis becomes part of standard Analysis Summary | ||
| 177 | * Article verdict may differ from simple claim average | ||
| 178 | * AI evaluates logical structure and reasoning quality | ||
| 179 | |||
| 180 | **Potential Upgrade to Approach 6 (Hybrid):** | ||
| 181 | * Add weighted claim importance (some claims more central than others) | ||
| 182 | * Add rule-based fallacy detection alongside AI reasoning | ||
| 183 | * Combine AI judgment with heuristic checks for robustness | ||
| 184 | |||
| 185 | **Target:** Maintain ≥70% accuracy at detecting misleading articles | ||
| 186 | |||
| 187 | ==== If POC1 Accuracy 50-70% (Promising) ==== | ||
| 188 | |||
| 189 | **Action:** Implement alternative Approach 4 (Weighted Aggregation) | ||
| 190 | |||
| 191 | **Instead of holistic analysis:** | ||
| 192 | * AI assigns importance weights (0-1) to each claim | ||
| 193 | * Weight based on: claim centrality, evidence strength, logical role | ||
| 194 | * Article verdict = weighted average of claim verdicts | ||
| 195 | * More structured than pure AI reasoning | ||
| 196 | |||
| 197 | **Rationale:** If holistic reasoning is inconsistent, structured weighting may work better | ||
| 198 | |||
| 199 | ==== If POC1 Accuracy <50% (Insufficient) ==== | ||
| 200 | |||
| 201 | **Action:** Defer context-aware analysis to post-POC2 | ||
| 202 | |||
| 203 | **Fallback:** | ||
| 204 | * Focus on individual claim accuracy only | ||
| 205 | * Article verdict = simple average of claim verdicts | ||
| 206 | * Note limitation: May miss misleading articles built from accurate claims | ||
| 207 | |||
| 208 | **Future consideration:** Try Approach 7 (LLM-as-Judge) with better models in future releases | ||
| 209 | |||
| 210 | === 5.2 Testing in POC2 === | ||
| 211 | |||
| 212 | **If context-aware feature is implemented:** | ||
| 213 | |||
| 214 | * Expand test set from 30 to 100 articles | ||
| 215 | * Include more diverse article types (op-eds, news, analysis, advocacy) | ||
| 216 | * Track false positive rate (flagging good articles as misleading) | ||
| 217 | * Validate with subject matter experts when possible | ||
| 218 | |||
| 219 | **Success Metrics:** | ||
| 220 | * ≥70% accuracy on misleading article detection | ||
| 221 | * <15% false positive rate | ||
| 222 | * Reasoning is comprehensible to users | ||
| 223 | |||
| 224 | === 5.3 Architecture Notes === | ||
| 225 | |||
| 226 | **Context-aware analysis adds NO additional API calls** | ||
| 227 | |||
| 228 | The enhanced analysis happens within the existing AKEL workflow: | ||
| 229 | |||
| 230 | {{code}} | ||
| 231 | Standard Flow: Context-Aware Enhancement: | ||
| 232 | 1. Extract claims 1. Extract claims + mark central claims | ||
| 233 | 2. Find evidence 2. Find evidence | ||
| 234 | 3. Generate verdicts 3. Generate verdicts | ||
| 235 | 4. Write summary 4. Write context-aware summary | ||
| 236 | (evaluates article structure) | ||
| 237 | {{/code}} | ||
| 238 | |||
| 239 | **Cost:** $0 increase (same API calls, enhanced prompt only) | ||
| 240 | |||
| 241 | **See:** [[POC Requirements>>FactHarbor.Specification.POC.Requirements]] Component 1 for implementation details | ||
| 242 | |||
| 243 | == Related Pages == | ||
| 244 | |||
| 245 | * [[POC1>>FactHarbor pre10 V0\.9\.70.Roadmap.POC1.WebHome]] - Previous phase | ||
| 246 | * [[Beta 0>>FactHarbor pre10 V0\.9\.70.Roadmap.Beta0.WebHome]] - Next phase | ||
| 247 | * [[Roadmap Overview>>FactHarbor pre10 V0\.9\.70.Roadmap.WebHome]] | ||
| 248 | * [[Architecture>>FactHarbor pre10 V0\.9\.70.Specification.Architecture.WebHome]] | ||
| 249 | |||
| 250 | **Document Status:** ✅ POC2 Specification Complete - Waiting for POC1 Completion | ||
| 251 | **Version:** V0.9.70 |