POC2: Robust Quality & Reliability

1

= POC2: Robust Quality & Reliability =

2

3

**Phase Goal:** Prove AKEL produces high-quality outputs consistently at scale

4

5

**Success Metric:** <5% hallucination rate, all 4 quality gates operational

== 1. Overview ==

POC2 extends POC1 by implementing the full quality assurance framework (all 4 gates), adding evidence deduplication, and processing significantly more test articles to validate system reliability at scale.

11

12

**Key Innovation:** Complete quality validation pipeline catches all categories of errors

13

14

**What We're Proving:**

15

16

* All 4 quality gates work together effectively

17

* Evidence deduplication prevents artificial inflation

18

* System maintains quality at larger scale

19

* Quality metrics dashboard provides actionable insights

20

21

== 2. New Requirements ==

22

23

=== 2.1 NFR11: Complete Quality Assurance Framework ===

24

25

**Add Gates 2 & 3** (POC1 had only Gates 1 & 4)

26

27

==== Gate 2: Evidence Relevance Validation ====

28

29

**Purpose:** Ensure AI-linked evidence actually relates to the claim

30

31

**Validation Checks:**

32

33

1. **Semantic Similarity:** Cosine similarity between claim and evidence embeddings ≥ 0.6

34

2. **Entity Overlap:** At least 1 shared named entity between claim and evidence

35

3. **Topic Relevance:** Evidence discusses the claim's subject matter (score ≥ 0.5)

36

37

**Action if Failed:**

38

39

* Discard irrelevant evidence (don't count it)

40

* If <2 relevant evidence items remain → "Insufficient Evidence" verdict

41

* Log discarded evidence for quality review

42

43

**Target:** 0% of evidence cited is off-topic

44

45

46

==== Gate 3: Scenario Coherence Check ====

47

48

**Purpose:** Validate scenarios are logical, complete, and meaningfully different

49

50

**Validation Checks:**

51

52

1. **Completeness:** All required fields populated (assumptions, scope, evidence context)

53

2. **Internal Consistency:** Assumptions don't contradict each other (score <0.3)

54

3. **Distinctiveness:** Scenarios are meaningfully different (similarity <0.8)

55

4. **Minimum Detail:** At least 1 specific assumption per scenario

56

57

**Action if Failed:**

58

59

* Merge duplicate scenarios

60

* Flag contradictory assumptions for review

61

* Reduce confidence score by 20%

62

* Do not publish if <2 distinct scenarios

63

64

**Target:** 0% duplicate scenarios, all scenarios internally consistent

65

66

67

=== 2.2 FR54: Evidence Deduplication (NEW) ===

68

69

**Importance:** HIGH

70

**Fulfills:** Accurate evidence counting, prevents artificial inflation

71

72

**Purpose:** Prevent counting the same evidence multiple times when cited by different sources

**Problem:**

* Wire services (AP, Reuters) redistribute same content

77

* Different sites cite the same original study

78

* Aggregators copy primary sources

79

* AKEL might count this as "5 sources" when it's really 1

80

81

**Solution: Content Fingerprinting**

82

83

* Generate SHA-256 hash of normalized text

84

* Detect near-duplicates (≥85% similarity) using fuzzy matching

85

* Track which sources cited each unique piece of evidence

86

* Display provenance chain to user

87

88

**Target:** Duplicate detection >95% accurate, evidence counts reflect reality

89

90

91

=== 2.3 NFR13: Quality Metrics Dashboard (Internal) ===

92

93

**Importance:** HIGH

94

**Fulfills:** Real-time quality monitoring during development

95

96

**Dashboard Metrics:**

97

98

* Claim processing statistics

99

* Gate performance (pass/fail rates for each gate)

100

* Evidence quality metrics

101

* Hallucination rate tracking

102

* Processing performance

103

104

**Target:** Dashboard functional, all metrics tracked, exportable

105

106

107

== 3. Success Criteria ==

**✅ Quality:**

* Hallucination rate <5% (target: <3%)

112

* Average quality rating ≥8.0/10

113

* 0 critical failures (publishable falsities)

114

* Gates correctly identify >95% of low-quality outputs

115

116

**✅ All 4 Gates Operational:**

117

118

* Gate 1: Claim validation working

119

* Gate 2: Evidence relevance filtering working

120

* Gate 3: Scenario coherence checking working

121

* Gate 4: Verdict confidence assessment working

122

123

**✅ Evidence Deduplication:**

124

125

* Duplicate detection >95% accurate

126

* Evidence counts reflect reality

127

* Provenance tracked correctly

128

129

**✅ Metrics Dashboard:**

130

131

* All metrics implemented and tracking

132

* Dashboard functional and useful

133

* Alerts trigger appropriately

134

135

== 4. Architecture Notes ==

136

137

**POC2 Enhanced Architecture:**

138

139

140

Input → AKEL Processing → All 4 Quality Gates → Display

141

(claims + scenarios (1: Claim validation

142

+ evidence linking 2: Evidence relevance

143

+ verdicts) 3: Scenario coherence

144

4: Verdict confidence)

145

146

147

**Key Additions from POC1:**

148

149

* Scenario generation component

150

* Evidence deduplication system

151

* Gates 2 & 3 implementation

152

* Quality metrics collection

153

154

**Still Simplified vs. Full System:**

155

156

* Single AKEL orchestration (not multi-component pipeline)

157

* No review queue

158

* No federation architecture

159

160

**See:** [[Architecture>>Test.FactHarbor pre10 V0\.9\.70.Specification.Architecture.WebHome]] for details

161

162

163

== 5. Context-Aware Analysis (Conditional Feature) ==

164

165

**Status:** Depends on POC1 experimental test results

**Background:**

POC1 tested context-aware analysis as an experimental feature using Approach 1 (Single-Pass Holistic Analysis). The goal is to detect when articles use accurate individual claims but reach misleading conclusions through faulty logic or selective presentation.

170

171

**See:** [[Article Verdict Problem>>Test.FactHarbor.Specification.POC.Article-Verdict-Problem]] for complete investigation

172

173

=== 5.1 POC2 Implementation Path ===

174

175

**Decision based on POC1 test results (30-article test set):**

176

177

==== If POC1 Accuracy ≥70% (Success) ====

178

179

**Action:** Implement as standard feature (no longer experimental)

180

181

**Enhancement to FR4:**

182

* Context-aware analysis becomes part of standard Analysis Summary

183

* Article verdict may differ from simple claim average

184

* AI evaluates logical structure and reasoning quality

185

186

**Potential Upgrade to Approach 6 (Hybrid):**

187

* Add weighted claim importance (some claims more central than others)

188

* Add rule-based fallacy detection alongside AI reasoning

189

* Combine AI judgment with heuristic checks for robustness

190

191

**Target:** Maintain ≥70% accuracy at detecting misleading articles

192

193

==== If POC1 Accuracy 50-70% (Promising) ====

194

195

**Action:** Implement alternative Approach 4 (Weighted Aggregation)

196

197

**Instead of holistic analysis:**

198

* AI assigns importance weights (0-1) to each claim

199

* Weight based on: claim centrality, evidence strength, logical role

200

* Article verdict = weighted average of claim verdicts

201

* More structured than pure AI reasoning

202

203

**Rationale:** If holistic reasoning is inconsistent, structured weighting may work better

204

205

==== If POC1 Accuracy <50% (Insufficient) ====

206

207

**Action:** Defer context-aware analysis to post-POC2

208

209

**Fallback:**

210

* Focus on individual claim accuracy only

211

* Article verdict = simple average of claim verdicts

212

* Note limitation: May miss misleading articles built from accurate claims

213

214

**Future consideration:** Try Approach 7 (LLM-as-Judge) with better models in future releases

215

216

=== 5.2 Testing in POC2 ===

217

218

**If context-aware feature is implemented:**

219

220

* Expand test set from 30 to 100 articles

221

* Include more diverse article types (op-eds, news, analysis, advocacy)

222

* Track false positive rate (flagging good articles as misleading)

223

* Validate with subject matter experts when possible

224

225

**Success Metrics:**

226

* ≥70% accuracy on misleading article detection

227

* <15% false positive rate

228

* Reasoning is comprehensible to users

229

230

=== 5.3 Architecture Notes ===

231

232

**Context-aware analysis adds NO additional API calls**

233

234

The enhanced analysis happens within the existing AKEL workflow:

235

236

237

Standard Flow: Context-Aware Enhancement:

238

1. Extract claims 1. Extract claims + mark central claims

239

2. Find evidence 2. Find evidence

240

3. Generate verdicts 3. Generate verdicts

241

4. Write summary 4. Write context-aware summary

242

(evaluates article structure)

243

244

245

**Cost:** $0 increase (same API calls, enhanced prompt only)

246

247

**See:** [[POC Requirements>>Test.FactHarbor.Specification.POC.Requirements]] Component 1 for implementation details

== Related Pages ==

* [[POC1>>Test.FactHarbor pre10 V0\.9\.70.Roadmap.POC1.WebHome]] - Previous phase

255

* [[Beta 0>>Test.FactHarbor pre10 V0\.9\.70.Roadmap.Beta0.WebHome]] - Next phase

256

* [[Roadmap Overview>>Test.FactHarbor pre10 V0\.9\.70.Roadmap.WebHome]]

257

* [[Architecture>>Test.FactHarbor pre10 V0\.9\.70.Specification.Architecture.WebHome]]

258

259

**Document Status:** ✅ POC2 Specification Complete - Waiting for POC1 Completion

260

**Version:** V0.9.70

Wiki source code of POC2: Robust Quality & Reliability

Applications

Navigation

Need help?