Wiki source code of System Performance Metrics

Last modified by Robert Schaub on 2026/02/08 08:32

version	line-number	content
1.1	1	= System Performance Metrics =
1.2	2
1.1	3	What we monitor to ensure AKEL performs well.
1.2	4
1.1	5	== 1. Purpose ==
1.2	6
1.1	7	These metrics tell us:
1.2	8
1.1	9	* ✅ Is AKEL performing within acceptable ranges?
	10	* ✅ Where should we focus improvement efforts?
	11	* ✅ When do humans need to intervene?
	12	* ✅ Are our changes improving things?
	13	Principle: Measure to improve, not to judge.
1.2	14
1.1	15	== 2. Metric Categories ==
1.2	16
1.1	17	=== 2.1 AKEL Performance ===
1.2	18
1.1	19	Processing speed and reliability
1.2	20
1.1	21	=== 2.2 Content Quality ===
1.2	22
1.1	23	Output quality and user satisfaction
1.2	24
1.1	25	=== 2.3 System Health ===
1.2	26
1.1	27	Infrastructure and operational metrics
1.2	28
1.1	29	=== 2.4 User Experience ===
1.2	30
1.1	31	How users interact with the system
1.2	32
1.1	33	== 3. AKEL Performance Metrics ==
1.2	34
1.1	35	=== 3.1 Processing Time ===
1.2	36
1.1	37	Metric: Time from claim submission to verdict publication
	38	Measurements:
1.2	39
1.1	40	* P50 (median): 50% of claims processed within X seconds
	41	* P95: 95% of claims processed within Y seconds
	42	* P99: 99% of claims processed within Z seconds
	43	Targets:
	44	* P50: ≤ 12 seconds
	45	* P95: ≤ 18 seconds
	46	* P99: ≤ 25 seconds
	47	Alert thresholds:
	48	* P95 > 20 seconds: Monitor closely
	49	* P95 > 25 seconds: Investigate immediately
	50	* P95 > 30 seconds: Emergency - intervention required
	51	Why it matters: Slow processing = poor user experience
	52	Improvement ideas:
	53	* Optimize evidence extraction
	54	* Better caching
	55	* Parallel processing
	56	* Database query optimization
1.2	57
1.1	58	=== 3.2 Success Rate ===
1.2	59
1.1	60	Metric: % of claims successfully processed without errors
	61	Target: ≥ 99%
	62	Alert thresholds:
1.2	63
1.1	64	* 98-99%: Monitor
	65	* 95-98%: Investigate
	66	* <95%: Emergency
	67	Common failure causes:
	68	* Timeout (evidence extraction took too long)
	69	* Parse error (claim text unparsable)
	70	* External API failure (source unavailable)
	71	* Resource exhaustion (memory/CPU)
	72	Why it matters: Errors frustrate users and reduce trust
1.2	73
1.1	74	=== 3.3 Evidence Completeness ===
1.2	75
1.1	76	Metric: % of claims where AKEL found sufficient evidence
	77	Measurement: Claims with ≥3 pieces of evidence from ≥2 distinct sources
	78	Target: ≥ 80%
	79	Alert thresholds:
1.2	80
1.1	81	* 75-80%: Monitor
	82	* 70-75%: Investigate
	83	* <70%: Intervention needed
	84	Why it matters: Incomplete evidence = low confidence verdicts
	85	Improvement ideas:
	86	* Better search algorithms
	87	* More source integrations
	88	* Improved relevance scoring
1.2	89
1.1	90	=== 3.4 Source Diversity ===
1.2	91
1.1	92	Metric: Average number of distinct sources per claim
	93	Target: ≥ 3.0 sources per claim
	94	Alert thresholds:
1.2	95
1.1	96	* 2.5-3.0: Monitor
	97	* 2.0-2.5: Investigate
	98	* <2.0: Intervention needed
	99	Why it matters: Multiple sources increase confidence and reduce bias
1.2	100
1.1	101	=== 3.5 Scenario Coverage ===
1.2	102
1.1	103	Metric: % of claims with at least one scenario extracted
	104	Target: ≥ 75%
	105	Why it matters: Scenarios provide context for verdicts
1.2	106
1.1	107	== 4. Content Quality Metrics ==
1.2	108
1.1	109	=== 4.1 Confidence Distribution ===
1.2	110
1.1	111	Metric: Distribution of confidence scores across claims
	112	Target: Roughly normal distribution
1.2	113
	114	* 10% very low confidence (0.0-0.3)
	115	* 20% low confidence (0.3-0.5)
	116	* 40% medium confidence (0.5-0.7)
	117	* 20% high confidence (0.7-0.9)
	118	* 10% very high confidence (0.9-1.0)
1.1	119	Alert thresholds:
	120	* >30% very low confidence: Evidence extraction issues
	121	* >30% very high confidence: Too aggressive/overconfident
	122	* Heavily skewed distribution: Systematic bias
	123	Why it matters: Confidence should reflect actual uncertainty
1.2	124
1.1	125	=== 4.2 Contradiction Rate ===
1.2	126
1.1	127	Metric: % of claims with internal contradictions detected
	128	Target: ≤ 5%
	129	Alert thresholds:
1.2	130
1.1	131	* 5-10%: Monitor
	132	* 10-15%: Investigate
	133	* >15%: Intervention needed
	134	Why it matters: High contradiction rate suggests poor evidence quality or logic errors
1.2	135
1.1	136	=== 4.3 User Feedback Ratio ===
1.2	137
1.1	138	Metric: Helpful vs unhelpful user ratings
	139	Target: ≥ 70% helpful
	140	Alert thresholds:
1.2	141
1.1	142	* 60-70%: Monitor
	143	* 50-60%: Investigate
	144	* <50%: Emergency
	145	Why it matters: Direct measure of user satisfaction
1.2	146
1.1	147	=== 4.4 False Positive/Negative Rate ===
1.2	148
1.1	149	Metric: When humans review flagged items, how often was AKEL right?
	150	Measurement:
1.2	151
1.1	152	* False positive: AKEL flagged for review, but actually fine
	153	* False negative: Missed something that should've been flagged
	154	Target:
	155	* False positive rate: ≤ 20%
	156	* False negative rate: ≤ 5%
	157	Why it matters: Balance between catching problems and not crying wolf
1.2	158
1.1	159	== 5. System Health Metrics ==
1.2	160
1.1	161	=== 5.1 Uptime ===
1.2	162
1.1	163	Metric: % of time system is available and functional
	164	Target: ≥ 99.9% (less than 45 minutes downtime per month)
	165	Alert: Immediate notification on any downtime
	166	Why it matters: Users expect 24/7 availability
1.2	167
1.1	168	=== 5.2 Error Rate ===
1.2	169
1.1	170	Metric: Errors per 1000 requests
	171	Target: ≤ 1 error per 1000 requests (0.1%)
	172	Alert thresholds:
1.2	173
1.1	174	* 1-5 per 1000: Monitor
	175	* 5-10 per 1000: Investigate
	176	* >10 per 1000: Emergency
	177	Why it matters: Errors disrupt user experience
1.2	178
1.1	179	=== 5.3 Database Performance ===
1.2	180
1.1	181	Metrics:
1.2	182
1.1	183	* Query response time (P95)
	184	* Connection pool utilization
	185	* Slow query frequency
	186	Targets:
	187	* P95 query time: ≤ 50ms
	188	* Connection pool: ≤ 80% utilized
	189	* Slow queries (>1s): ≤ 10 per hour
	190	Why it matters: Database bottlenecks slow entire system
1.2	191
1.1	192	=== 5.4 Cache Hit Rate ===
1.2	193
1.1	194	Metric: % of requests served from cache vs. database
	195	Target: ≥ 80%
	196	Why it matters: Higher cache hit rate = faster responses, less DB load
1.2	197
1.1	198	=== 5.5 Resource Utilization ===
1.2	199
1.1	200	Metrics:
1.2	201
1.1	202	* CPU utilization
	203	* Memory utilization
	204	* Disk I/O
	205	* Network bandwidth
	206	Targets:
	207	* Average CPU: ≤ 60%
	208	* Peak CPU: ≤ 85%
	209	* Memory: ≤ 80%
	210	* Disk I/O: ≤ 70%
	211	Alert: Any metric consistently >85%
	212	Why it matters: Headroom for traffic spikes, prevents resource exhaustion
1.2	213
1.1	214	== 6. User Experience Metrics ==
1.2	215
1.1	216	=== 6.1 Time to First Verdict ===
1.2	217
1.1	218	Metric: Time from user submitting claim to seeing initial verdict
	219	Target: ≤ 15 seconds
	220	Why it matters: User perception of speed
1.2	221
1.1	222	=== 6.2 Claim Submission Rate ===
1.2	223
1.1	224	Metric: Claims submitted per day/hour
	225	Monitoring: Track trends, detect anomalies
	226	Why it matters: Understand usage patterns, capacity planning
1.2	227
1.1	228	=== 6.3 User Retention ===
1.2	229
1.1	230	Metric: % of users who return after first visit
	231	Target: ≥ 30% (1-week retention)
	232	Why it matters: Indicates system usefulness
1.2	233
1.1	234	=== 6.4 Feature Usage ===
1.2	235
1.1	236	Metrics:
1.2	237
1.1	238	* % of users who explore evidence
	239	* % who check scenarios
	240	* % who view source track records
	241	Why it matters: Understand how users interact with system
1.2	242
1.1	243	== 7. Metric Dashboard ==
1.2	244
1.1	245	=== 7.1 Real-Time Dashboard ===
1.2	246
1.1	247	Always visible:
1.2	248
1.1	249	* Current processing time (P95)
	250	* Success rate (last hour)
	251	* Error rate (last hour)
	252	* System health status
	253	Update frequency: Every 30 seconds
1.2	254
1.1	255	=== 7.2 Daily Dashboard ===
1.2	256
1.1	257	Reviewed daily:
1.2	258
1.1	259	* All AKEL performance metrics
	260	* Content quality metrics
	261	* System health trends
	262	* User feedback summary
1.2	263
1.1	264	=== 7.3 Weekly Reports ===
1.2	265
1.1	266	Reviewed weekly:
1.2	267
1.1	268	* Trends over time
	269	* Week-over-week comparisons
	270	* Improvement priorities
	271	* Outstanding issues
1.2	272
1.1	273	=== 7.4 Monthly/Quarterly Reports ===
1.2	274
1.1	275	Comprehensive analysis:
1.2	276
1.1	277	* Long-term trends
	278	* Seasonal patterns
	279	* Strategic metrics
	280	* Goal progress
1.2	281
1.1	282	== 8. Alert System ==
1.2	283
1.1	284	=== 8.1 Alert Levels ===
1.2	285
1.1	286	Info: Metric outside target, but within acceptable range
1.2	287
1.1	288	* Action: Note in daily review
	289	* Example: P95 processing time 19s (target 18s, acceptable <20s)
	290	Warning: Metric outside acceptable range
	291	* Action: Investigate within 24 hours
	292	* Example: Success rate 97% (acceptable >98%)
	293	Critical: Metric severely degraded
	294	* Action: Investigate immediately
	295	* Example: Error rate 2% (acceptable <0.5%)
	296	Emergency: System failure or severe degradation
	297	* Action: Page on-call, all hands
	298	* Example: Uptime <95%, P95 >30s
1.2	299
1.1	300	=== 8.2 Alert Channels ===
1.2	301
1.1	302	Slack/Discord: All alerts
	303	Email: Warning and above
	304	SMS: Critical and emergency only
	305	PagerDuty: Emergency only
1.2	306
1.1	307	=== 8.3 On-Call Rotation ===
1.2	308
1.1	309	Technical Coordinator: Primary on-call
	310	Backup: Designated team member
	311	Responsibilities:
1.2	312
1.1	313	* Respond to alerts within SLA
	314	* Investigate and diagnose issues
	315	* Implement fixes or escalate
	316	* Document incidents
1.2	317
1.1	318	== 9. Metric-Driven Improvement ==
1.2	319
1.1	320	=== 9.1 Prioritization ===
1.2	321
1.1	322	Focus improvements on:
1.2	323
1.1	324	* Metrics furthest from target
	325	* Metrics with biggest user impact
	326	* Metrics easiest to improve
	327	* Strategic priorities
1.2	328
1.1	329	=== 9.2 Success Criteria ===
1.2	330
1.1	331	Every improvement project should:
1.2	332
1.1	333	* Target specific metrics
	334	* Set concrete improvement goals
	335	* Measure before and after
	336	* Document learnings
	337	Example: "Reduce P95 processing time from 20s to 16s by optimizing evidence extraction"
1.2	338
1.1	339	=== 9.3 A/B Testing ===
1.2	340
1.1	341	When feasible:
1.2	342
1.1	343	* Run two versions
	344	* Measure metric differences
	345	* Choose based on data
	346	* Roll out winner
1.2	347
1.1	348	== 10. Bias and Fairness Metrics ==
1.2	349
1.1	350	=== 10.1 Domain Balance ===
1.2	351
1.1	352	Metric: Confidence distribution by domain
	353	Target: Similar distributions across domains
	354	Alert: One domain consistently much lower/higher confidence
	355	Why it matters: Ensure no systematic domain bias
1.2	356
1.1	357	=== 10.2 Source Type Balance ===
1.2	358
1.1	359	Metric: Evidence distribution by source type
	360	Target: Diverse source types represented
	361	Alert: Over-reliance on one source type
	362	Why it matters: Prevent source type bias
1.2	363
1.1	364	=== 10.3 Geographic Balance ===
1.2	365
1.1	366	Metric: Source geographic distribution
	367	Target: Multiple regions represented
	368	Alert: Over-concentration in one region
	369	Why it matters: Reduce geographic/cultural bias
1.2	370
1.1	371	== 11. Experimental Metrics ==
1.2	372
1.1	373	New metrics to test:
1.2	374
1.1	375	* User engagement time
	376	* Evidence exploration depth
	377	* Cross-reference usage
	378	* Mobile vs desktop usage
	379	Process:
1.2	380
1.1	381	1. Define metric hypothesis
	382	2. Implement tracking
	383	3. Collect data for 1 month
	384	4. Evaluate usefulness
	385	5. Add to standard set or discard
1.2	386
1.1	387	== 12. Anti-Patterns ==
1.2	388
1.1	389	Don't:
1.2	390
1.1	391	* ❌ Measure too many things (focus on what matters)
	392	* ❌ Set unrealistic targets (demotivating)
	393	* ❌ Ignore metrics when inconvenient
	394	* ❌ Game metrics (destroys their value)
	395	* ❌ Blame individuals for metric failures
	396	* ❌ Let metrics become the goal (they're tools)
	397	Do:
	398	* ✅ Focus on actionable metrics
	399	* ✅ Set ambitious but achievable targets
	400	* ✅ Respond to metric signals
	401	* ✅ Continuously validate metrics still matter
	402	* ✅ Use metrics for system improvement, not people evaluation
	403	* ✅ Remember: metrics serve users, not the other way around
1.2	404
1.1	405	== 13. Related Pages ==
1.2	406
1.1	407	* [[Automation Philosophy>>FactHarbor.Organisation.Automation-Philosophy]] - Why we monitor systems, not outputs
	408	* [[Continuous Improvement>>FactHarbor.Organisation.How-We-Work-Together.Continuous-Improvement]] - How we use metrics to improve
1.2	409	* [[Governance>>Archive.FactHarbor 2026\.02\.08.Organisation.Governance.WebHome]] - Quarterly performance reviews
1.1	410	---
1.2	411	Remember: We measure the SYSTEM, not individual outputs. Metrics drive IMPROVEMENT, not judgment.--