Skip to Content

Wiki source code of POC2: Robust Quality & Reliability

Last modified by Robert Schaub on 2025/12/23 11:02

Hide last authors

author	version	line-number	content
	1.1	1	= POC2: Robust Quality & Reliability =
		2
		3	Phase Goal: Prove AKEL produces high-quality outputs consistently at scale
		4
		5	Success Metric: <5% hallucination rate, all 4 quality gates operational
		6
		7
		8	== 1. Overview ==
		9
		10	POC2 extends POC1 by implementing the full quality assurance framework (all 4 gates), adding evidence deduplication, and processing significantly more test articles to validate system reliability at scale.
		11
		12	Key Innovation: Complete quality validation pipeline catches all categories of errors
		13
		14	What We're Proving:
		15
		16	* All 4 quality gates work together effectively
		17	* Evidence deduplication prevents artificial inflation
		18	* System maintains quality at larger scale
		19	* Quality metrics dashboard provides actionable insights
		20
		21	== 2. New Requirements ==
		22
		23	=== 2.1 NFR11: Complete Quality Assurance Framework ===
		24
		25	Add Gates 2 & 3 (POC1 had only Gates 1 & 4)
		26
		27	==== Gate 2: Evidence Relevance Validation ====
		28
		29	Purpose: Ensure AI-linked evidence actually relates to the claim
		30
		31	Validation Checks:
		32
		33	1. Semantic Similarity: Cosine similarity between claim and evidence embeddings ≥ 0.6
		34	2. Entity Overlap: At least 1 shared named entity between claim and evidence
		35	3. Topic Relevance: Evidence discusses the claim's subject matter (score ≥ 0.5)
		36
		37	Action if Failed:
		38
		39	* Discard irrelevant evidence (don't count it)
		40	* If <2 relevant evidence items remain → "Insufficient Evidence" verdict
		41	* Log discarded evidence for quality review
		42
		43	Target: 0% of evidence cited is off-topic
		44
		45
		46	==== Gate 3: Scenario Coherence Check ====
		47
		48	Purpose: Validate scenarios are logical, complete, and meaningfully different
		49
		50	Validation Checks:
		51
		52	1. Completeness: All required fields populated (assumptions, scope, evidence context)
		53	2. Internal Consistency: Assumptions don't contradict each other (score <0.3)
		54	3. Distinctiveness: Scenarios are meaningfully different (similarity <0.8)
		55	4. Minimum Detail: At least 1 specific assumption per scenario
		56
		57	Action if Failed:
		58
		59	* Merge duplicate scenarios
		60	* Flag contradictory assumptions for review
		61	* Reduce confidence score by 20%
		62	* Do not publish if <2 distinct scenarios
		63
		64	Target: 0% duplicate scenarios, all scenarios internally consistent
		65
		66
		67	=== 2.2 FR54: Evidence Deduplication (NEW) ===
		68
		69	Priority: HIGH
		70	Fulfills: Accurate evidence counting, prevents artificial inflation
		71
		72	Purpose: Prevent counting the same evidence multiple times when cited by different sources
		73
		74	Problem:
		75
		76	* Wire services (AP, Reuters) redistribute same content
		77	* Different sites cite the same original study
		78	* Aggregators copy primary sources
		79	* AKEL might count this as "5 sources" when it's really 1
		80
		81	Solution: Content Fingerprinting
		82
		83	* Generate SHA-256 hash of normalized text
		84	* Detect near-duplicates (≥85% similarity) using fuzzy matching
		85	* Track which sources cited each unique piece of evidence
		86	* Display provenance chain to user
		87
		88	Target: Duplicate detection >95% accurate, evidence counts reflect reality
		89
		90
		91	=== 2.3 NFR13: Quality Metrics Dashboard (Internal) ===
		92
		93	Priority: HIGH
		94	Fulfills: Real-time quality monitoring during development
		95
		96	Dashboard Metrics:
		97
		98	* Claim processing statistics
		99	* Gate performance (pass/fail rates for each gate)
		100	* Evidence quality metrics
		101	* Hallucination rate tracking
		102	* Processing performance
		103
		104	Target: Dashboard functional, all metrics tracked, exportable
		105
		106
		107	== 3. Success Criteria ==
		108
		109	✅ Quality:
		110
		111	* Hallucination rate <5% (target: <3%)
		112	* Average quality rating ≥8.0/10
		113	* 0 critical failures (publishable falsities)
		114	* Gates correctly identify >95% of low-quality outputs
		115
		116	✅ All 4 Gates Operational:
		117
		118	* Gate 1: Claim validation working
		119	* Gate 2: Evidence relevance filtering working
		120	* Gate 3: Scenario coherence checking working
		121	* Gate 4: Verdict confidence assessment working
		122
		123	✅ Evidence Deduplication:
		124
		125	* Duplicate detection >95% accurate
		126	* Evidence counts reflect reality
		127	* Provenance tracked correctly
		128
		129	✅ Metrics Dashboard:
		130
		131	* All metrics implemented and tracking
		132	* Dashboard functional and useful
		133	* Alerts trigger appropriately
		134
		135	== 4. Architecture Notes ==
		136
		137	POC2 Enhanced Architecture:
		138
		139	{{code}}
		140	Input → AKEL Processing → All 4 Quality Gates → Display
		141	(claims + scenarios (1: Claim validation
		142	+ evidence linking 2: Evidence relevance
		143	+ verdicts) 3: Scenario coherence
		144	4: Verdict confidence)
		145	{{/code}}
		146
		147	Key Additions from POC1:
		148
		149	* Scenario generation component
		150	* Evidence deduplication system
		151	* Gates 2 & 3 implementation
		152	* Quality metrics collection
		153
		154	Still Simplified vs. Full System:
		155
		156	* Single AKEL orchestration (not multi-component pipeline)
		157	* No review queue
		158	* No federation architecture
		159
		160	See: [[Architecture>>Test.FactHarbor pre10 V0\.9\.70.Specification.Architecture.WebHome]] for details
		161
		162
		163	== Related Pages ==
		164
		165	* [[POC1>>Test.FactHarbor pre10 V0\.9\.70.Roadmap.POC1.WebHome]] - Previous phase
		166	* [[Beta 0>>Test.FactHarbor pre10 V0\.9\.70.Roadmap.Beta0.WebHome]] - Next phase
		167	* [[Roadmap Overview>>Test.FactHarbor pre10 V0\.9\.70.Roadmap.WebHome]]
		168	* [[Architecture>>Test.FactHarbor pre10 V0\.9\.70.Specification.Architecture.WebHome]]
		169
		170	Document Status: ✅ POC2 Specification Complete - Waiting for POC1 Completion
		171	Version: V0.9.70