POC2: Robust Quality & Reliability

1

= POC2: Robust Quality & Reliability =

2

3

**Phase Goal:** Prove AKEL produces high-quality outputs consistently at scale

4

5

**Success Metric:** <5% hallucination rate, all 4 quality gates operational

== 1. Overview ==

POC2 extends POC1 by implementing the full quality assurance framework (all 4 gates), adding evidence deduplication, and processing significantly more test articles to validate system reliability at scale.

10

11

**Key Innovation:** Complete quality validation pipeline catches all categories of errors

12

13

**What We're Proving:**

14

15

* All 4 quality gates work together effectively

16

* Evidence deduplication prevents artificial inflation

17

* System maintains quality at larger scale

18

* Quality metrics dashboard provides actionable insights

19

20

== 2. New Requirements ==

21

22

=== 2.1 NFR11: Complete Quality Assurance Framework ===

23

24

**Add Gates 2 & 3** (POC1 had only Gates 1 & 4)

25

26

==== Gate 2: Evidence Relevance Validation ====

27

28

**Purpose:** Ensure AI-linked evidence actually relates to the claim

29

30

**Validation Checks:**

31

32

1. **Semantic Similarity:** Cosine similarity between claim and evidence embeddings ≥ 0.6

33

2. **Entity Overlap:** At least 1 shared named entity between claim and evidence

34

3. **Topic Relevance:** Evidence discusses the claim's subject matter (score ≥ 0.5)

35

36

**Action if Failed:**

37

38

* Discard irrelevant evidence (don't count it)

39

* If <2 relevant evidence items remain → "Insufficient Evidence" verdict

40

* Log discarded evidence for quality review

41

42

**Target:** 0% of evidence cited is off-topic

43

44

==== Gate 3: Scenario Coherence Check ====

45

46

**Purpose:** Validate scenarios are logical, complete, and meaningfully different

47

48

**Validation Checks:**

49

50

1. **Completeness:** All required fields populated (assumptions, scope, evidence context)

51

2. **Internal Consistency:** Assumptions don't contradict each other (score <0.3)

52

3. **Distinctiveness:** Scenarios are meaningfully different (similarity <0.8)

53

4. **Minimum Detail:** At least 1 specific assumption per scenario

54

55

**Action if Failed:**

56

57

* Merge duplicate scenarios

58

* Flag contradictory assumptions for review

59

* Reduce confidence score by 20%

60

* Do not publish if <2 distinct scenarios

61

62

**Target:** 0% duplicate scenarios, all scenarios internally consistent

63

64

=== 2.2 FR54: Evidence Deduplication (NEW) ===

65

66

**Importance:** HIGH

67

**Fulfills:** Accurate evidence counting, prevents artificial inflation

68

69

**Purpose:** Prevent counting the same evidence multiple times when cited by different sources

**Problem:**

* Wire services (AP, Reuters) redistribute same content

74

* Different sites cite the same original study

75

* Aggregators copy primary sources

76

* AKEL might count this as "5 sources" when it's really 1

77

78

**Solution: Content Fingerprinting**

79

80

* Generate SHA-256 hash of normalized text

81

* Detect near-duplicates (≥85% similarity) using fuzzy matching

82

* Track which sources cited each unique piece of evidence

83

* Display provenance chain to user

84

85

**Target:** Duplicate detection >95% accurate, evidence counts reflect reality

86

87

=== 2.3 NFR13: Quality Metrics Dashboard (Internal) ===

88

89

**Importance:** HIGH

90

**Fulfills:** Real-time quality monitoring during development

91

92

**Dashboard Metrics:**

93

94

* Claim processing statistics

95

* Gate performance (pass/fail rates for each gate)

96

* Evidence quality metrics

97

* Hallucination rate tracking

98

* Processing performance

99

100

**Target:** Dashboard functional, all metrics tracked, exportable

101

102

== 3. Success Criteria ==

**✅ Quality:**

* Hallucination rate <5% (target: <3%)

107

* Average quality rating ≥8.0/10

108

* 0 critical failures (publishable falsities)

109

* Gates correctly identify >95% of low-quality outputs

110

111

**✅ All 4 Gates Operational:**

112

113

* Gate 1: Claim validation working

114

* Gate 2: Evidence relevance filtering working

115

* Gate 3: Scenario coherence checking working

116

* Gate 4: Verdict confidence assessment working

117

118

**✅ Evidence Deduplication:**

119

120

* Duplicate detection >95% accurate

121

* Evidence counts reflect reality

122

* Provenance tracked correctly

123

124

**✅ Metrics Dashboard:**

125

126

* All metrics implemented and tracking

127

* Dashboard functional and useful

128

* Alerts trigger appropriately

129

130

== 4. Architecture Notes ==

131

132

**POC2 Enhanced Architecture:**

133

134

135

Input → AKEL Processing → All 4 Quality Gates → Display

136

(claims + scenarios (1: Claim validation

137

+ evidence linking 2: Evidence relevance

138

+ verdicts) 3: Scenario coherence

139

4: Verdict confidence)

140

141

142

**Key Additions from POC1:**

143

144

* Scenario generation component

145

* Evidence deduplication system

146

* Gates 2 & 3 implementation

147

* Quality metrics collection

148

149

**Still Simplified vs. Full System:**

150

151

* Single AKEL orchestration (not multi-component pipeline)

152

* No review queue

153

* No federation architecture

154

155

**See:** [[Architecture>>FactHarbor pre10 V0\.9\.70.Specification.Architecture.WebHome]] for details

156

157

== 5. Context-Aware Analysis (Conditional Feature) ==

158

159

**Status:** Depends on POC1 experimental test results

**Background:**

POC1 tested context-aware analysis as an experimental feature using Approach 1 (Single-Pass Holistic Analysis). The goal is to detect when articles use accurate individual claims but reach misleading conclusions through faulty logic or selective presentation.

164

165

**See:** [[Article Verdict Problem>>FactHarbor.Specification.POC.Article-Verdict-Problem]] for complete investigation

166

167

=== 5.1 POC2 Implementation Path ===

168

169

**Decision based on POC1 test results (30-article test set):**

170

171

==== If POC1 Accuracy ≥70% (Success) ====

172

173

**Action:** Implement as standard feature (no longer experimental)

174

175

**Enhancement to FR4:**

176

* Context-aware analysis becomes part of standard Analysis Summary

177

* Article verdict may differ from simple claim average

178

* AI evaluates logical structure and reasoning quality

179

180

**Potential Upgrade to Approach 6 (Hybrid):**

181

* Add weighted claim importance (some claims more central than others)

182

* Add rule-based fallacy detection alongside AI reasoning

183

* Combine AI judgment with heuristic checks for robustness

184

185

**Target:** Maintain ≥70% accuracy at detecting misleading articles

186

187

==== If POC1 Accuracy 50-70% (Promising) ====

188

189

**Action:** Implement alternative Approach 4 (Weighted Aggregation)

190

191

**Instead of holistic analysis:**

192

* AI assigns importance weights (0-1) to each claim

193

* Weight based on: claim centrality, evidence strength, logical role

194

* Article verdict = weighted average of claim verdicts

195

* More structured than pure AI reasoning

196

197

**Rationale:** If holistic reasoning is inconsistent, structured weighting may work better

198

199

==== If POC1 Accuracy <50% (Insufficient) ====

200

201

**Action:** Defer context-aware analysis to post-POC2

202

203

**Fallback:**

204

* Focus on individual claim accuracy only

205

* Article verdict = simple average of claim verdicts

206

* Note limitation: May miss misleading articles built from accurate claims

207

208

**Future consideration:** Try Approach 7 (LLM-as-Judge) with better models in future releases

209

210

=== 5.2 Testing in POC2 ===

211

212

**If context-aware feature is implemented:**

213

214

* Expand test set from 30 to 100 articles

215

* Include more diverse article types (op-eds, news, analysis, advocacy)

216

* Track false positive rate (flagging good articles as misleading)

217

* Validate with subject matter experts when possible

218

219

**Success Metrics:**

220

* ≥70% accuracy on misleading article detection

221

* <15% false positive rate

222

* Reasoning is comprehensible to users

223

224

=== 5.3 Architecture Notes ===

225

226

**Context-aware analysis adds NO additional API calls**

227

228

The enhanced analysis happens within the existing AKEL workflow:

229

230

231

Standard Flow: Context-Aware Enhancement:

232

1. Extract claims 1. Extract claims + mark central claims

233

2. Find evidence 2. Find evidence

234

3. Generate verdicts 3. Generate verdicts

235

4. Write summary 4. Write context-aware summary

236

(evaluates article structure)

237

238

239

**Cost:** $0 increase (same API calls, enhanced prompt only)

240

241

**See:** [[POC Requirements>>FactHarbor.Specification.POC.Requirements]] Component 1 for implementation details

== Related Pages ==

* [[POC1>>FactHarbor pre10 V0\.9\.70.Roadmap.POC1.WebHome]] - Previous phase

246

* [[Beta 0>>FactHarbor pre10 V0\.9\.70.Roadmap.Beta0.WebHome]] - Next phase

247

* [[Roadmap Overview>>FactHarbor pre10 V0\.9\.70.Roadmap.WebHome]]

248

* [[Architecture>>FactHarbor pre10 V0\.9\.70.Specification.Architecture.WebHome]]

249

250

**Document Status:** ✅ POC2 Specification Complete - Waiting for POC1 Completion

251

**Version:** V0.9.70

Wiki source code of POC2: Robust Quality & Reliability