Wiki source code of System Performance Metrics

Last modified by Robert Schaub on 2026/02/08 08:32

Show last authors
1 = System Performance Metrics =
2
3 **What we monitor to ensure AKEL performs well.**
4
5 == 1. Purpose ==
6
7 These metrics tell us:
8
9 * ✅ Is AKEL performing within acceptable ranges?
10 * ✅ Where should we focus improvement efforts?
11 * ✅ When do humans need to intervene?
12 * ✅ Are our changes improving things?
13 **Principle**: Measure to improve, not to judge.
14
15 == 2. Metric Categories ==
16
17 === 2.1 AKEL Performance ===
18
19 **Processing speed and reliability**
20
21 === 2.2 Content Quality ===
22
23 **Output quality and user satisfaction**
24
25 === 2.3 System Health ===
26
27 **Infrastructure and operational metrics**
28
29 === 2.4 User Experience ===
30
31 **How users interact with the system**
32
33 == 3. AKEL Performance Metrics ==
34
35 === 3.1 Processing Time ===
36
37 **Metric**: Time from claim submission to verdict publication
38 **Measurements**:
39
40 * P50 (median): 50% of claims processed within X seconds
41 * P95: 95% of claims processed within Y seconds
42 * P99: 99% of claims processed within Z seconds
43 **Targets**:
44 * P50: ≤ 12 seconds
45 * P95: ≤ 18 seconds
46 * P99: ≤ 25 seconds
47 **Alert thresholds**:
48 * P95 > 20 seconds: Monitor closely
49 * P95 > 25 seconds: Investigate immediately
50 * P95 > 30 seconds: Emergency - intervention required
51 **Why it matters**: Slow processing = poor user experience
52 **Improvement ideas**:
53 * Optimize evidence extraction
54 * Better caching
55 * Parallel processing
56 * Database query optimization
57
58 === 3.2 Success Rate ===
59
60 **Metric**: % of claims successfully processed without errors
61 **Target**: ≥ 99%
62 **Alert thresholds**:
63
64 * 98-99%: Monitor
65 * 95-98%: Investigate
66 * <95%: Emergency
67 **Common failure causes**:
68 * Timeout (evidence extraction took too long)
69 * Parse error (claim text unparsable)
70 * External API failure (source unavailable)
71 * Resource exhaustion (memory/CPU)
72 **Why it matters**: Errors frustrate users and reduce trust
73
74 === 3.3 Evidence Completeness ===
75
76 **Metric**: % of claims where AKEL found sufficient evidence
77 **Measurement**: Claims with ≥3 pieces of evidence from ≥2 distinct sources
78 **Target**: ≥ 80%
79 **Alert thresholds**:
80
81 * 75-80%: Monitor
82 * 70-75%: Investigate
83 * <70%: Intervention needed
84 **Why it matters**: Incomplete evidence = low confidence verdicts
85 **Improvement ideas**:
86 * Better search algorithms
87 * More source integrations
88 * Improved relevance scoring
89
90 === 3.4 Source Diversity ===
91
92 **Metric**: Average number of distinct sources per claim
93 **Target**: ≥ 3.0 sources per claim
94 **Alert thresholds**:
95
96 * 2.5-3.0: Monitor
97 * 2.0-2.5: Investigate
98 * <2.0: Intervention needed
99 **Why it matters**: Multiple sources increase confidence and reduce bias
100
101 === 3.5 Scenario Coverage ===
102
103 **Metric**: % of claims with at least one scenario extracted
104 **Target**: ≥ 75%
105 **Why it matters**: Scenarios provide context for verdicts
106
107 == 4. Content Quality Metrics ==
108
109 === 4.1 Confidence Distribution ===
110
111 **Metric**: Distribution of confidence scores across claims
112 **Target**: Roughly normal distribution
113
114 * 10% very low confidence (0.0-0.3)
115 * 20% low confidence (0.3-0.5)
116 * 40% medium confidence (0.5-0.7)
117 * 20% high confidence (0.7-0.9)
118 * 10% very high confidence (0.9-1.0)
119 **Alert thresholds**:
120 * >30% very low confidence: Evidence extraction issues
121 * >30% very high confidence: Too aggressive/overconfident
122 * Heavily skewed distribution: Systematic bias
123 **Why it matters**: Confidence should reflect actual uncertainty
124
125 === 4.2 Contradiction Rate ===
126
127 **Metric**: % of claims with internal contradictions detected
128 **Target**: ≤ 5%
129 **Alert thresholds**:
130
131 * 5-10%: Monitor
132 * 10-15%: Investigate
133 * >15%: Intervention needed
134 **Why it matters**: High contradiction rate suggests poor evidence quality or logic errors
135
136 === 4.3 User Feedback Ratio ===
137
138 **Metric**: Helpful vs unhelpful user ratings
139 **Target**: ≥ 70% helpful
140 **Alert thresholds**:
141
142 * 60-70%: Monitor
143 * 50-60%: Investigate
144 * <50%: Emergency
145 **Why it matters**: Direct measure of user satisfaction
146
147 === 4.4 False Positive/Negative Rate ===
148
149 **Metric**: When humans review flagged items, how often was AKEL right?
150 **Measurement**:
151
152 * False positive: AKEL flagged for review, but actually fine
153 * False negative: Missed something that should've been flagged
154 **Target**:
155 * False positive rate: ≤ 20%
156 * False negative rate: ≤ 5%
157 **Why it matters**: Balance between catching problems and not crying wolf
158
159 == 5. System Health Metrics ==
160
161 === 5.1 Uptime ===
162
163 **Metric**: % of time system is available and functional
164 **Target**: ≥ 99.9% (less than 45 minutes downtime per month)
165 **Alert**: Immediate notification on any downtime
166 **Why it matters**: Users expect 24/7 availability
167
168 === 5.2 Error Rate ===
169
170 **Metric**: Errors per 1000 requests
171 **Target**: ≤ 1 error per 1000 requests (0.1%)
172 **Alert thresholds**:
173
174 * 1-5 per 1000: Monitor
175 * 5-10 per 1000: Investigate
176 * >10 per 1000: Emergency
177 **Why it matters**: Errors disrupt user experience
178
179 === 5.3 Database Performance ===
180
181 **Metrics**:
182
183 * Query response time (P95)
184 * Connection pool utilization
185 * Slow query frequency
186 **Targets**:
187 * P95 query time: ≤ 50ms
188 * Connection pool: ≤ 80% utilized
189 * Slow queries (>1s): ≤ 10 per hour
190 **Why it matters**: Database bottlenecks slow entire system
191
192 === 5.4 Cache Hit Rate ===
193
194 **Metric**: % of requests served from cache vs. database
195 **Target**: ≥ 80%
196 **Why it matters**: Higher cache hit rate = faster responses, less DB load
197
198 === 5.5 Resource Utilization ===
199
200 **Metrics**:
201
202 * CPU utilization
203 * Memory utilization
204 * Disk I/O
205 * Network bandwidth
206 **Targets**:
207 * Average CPU: ≤ 60%
208 * Peak CPU: ≤ 85%
209 * Memory: ≤ 80%
210 * Disk I/O: ≤ 70%
211 **Alert**: Any metric consistently >85%
212 **Why it matters**: Headroom for traffic spikes, prevents resource exhaustion
213
214 == 6. User Experience Metrics ==
215
216 === 6.1 Time to First Verdict ===
217
218 **Metric**: Time from user submitting claim to seeing initial verdict
219 **Target**: ≤ 15 seconds
220 **Why it matters**: User perception of speed
221
222 === 6.2 Claim Submission Rate ===
223
224 **Metric**: Claims submitted per day/hour
225 **Monitoring**: Track trends, detect anomalies
226 **Why it matters**: Understand usage patterns, capacity planning
227
228 === 6.3 User Retention ===
229
230 **Metric**: % of users who return after first visit
231 **Target**: ≥ 30% (1-week retention)
232 **Why it matters**: Indicates system usefulness
233
234 === 6.4 Feature Usage ===
235
236 **Metrics**:
237
238 * % of users who explore evidence
239 * % who check scenarios
240 * % who view source track records
241 **Why it matters**: Understand how users interact with system
242
243 == 7. Metric Dashboard ==
244
245 === 7.1 Real-Time Dashboard ===
246
247 **Always visible**:
248
249 * Current processing time (P95)
250 * Success rate (last hour)
251 * Error rate (last hour)
252 * System health status
253 **Update frequency**: Every 30 seconds
254
255 === 7.2 Daily Dashboard ===
256
257 **Reviewed daily**:
258
259 * All AKEL performance metrics
260 * Content quality metrics
261 * System health trends
262 * User feedback summary
263
264 === 7.3 Weekly Reports ===
265
266 **Reviewed weekly**:
267
268 * Trends over time
269 * Week-over-week comparisons
270 * Improvement priorities
271 * Outstanding issues
272
273 === 7.4 Monthly/Quarterly Reports ===
274
275 **Comprehensive analysis**:
276
277 * Long-term trends
278 * Seasonal patterns
279 * Strategic metrics
280 * Goal progress
281
282 == 8. Alert System ==
283
284 === 8.1 Alert Levels ===
285
286 **Info**: Metric outside target, but within acceptable range
287
288 * Action: Note in daily review
289 * Example: P95 processing time 19s (target 18s, acceptable <20s)
290 **Warning**: Metric outside acceptable range
291 * Action: Investigate within 24 hours
292 * Example: Success rate 97% (acceptable >98%)
293 **Critical**: Metric severely degraded
294 * Action: Investigate immediately
295 * Example: Error rate 2% (acceptable <0.5%)
296 **Emergency**: System failure or severe degradation
297 * Action: Page on-call, all hands
298 * Example: Uptime <95%, P95 >30s
299
300 === 8.2 Alert Channels ===
301
302 **Slack/Discord**: All alerts
303 **Email**: Warning and above
304 **SMS**: Critical and emergency only
305 **PagerDuty**: Emergency only
306
307 === 8.3 On-Call Rotation ===
308
309 **Technical Coordinator**: Primary on-call
310 **Backup**: Designated team member
311 **Responsibilities**:
312
313 * Respond to alerts within SLA
314 * Investigate and diagnose issues
315 * Implement fixes or escalate
316 * Document incidents
317
318 == 9. Metric-Driven Improvement ==
319
320 === 9.1 Prioritization ===
321
322 **Focus improvements on**:
323
324 * Metrics furthest from target
325 * Metrics with biggest user impact
326 * Metrics easiest to improve
327 * Strategic priorities
328
329 === 9.2 Success Criteria ===
330
331 **Every improvement project should**:
332
333 * Target specific metrics
334 * Set concrete improvement goals
335 * Measure before and after
336 * Document learnings
337 **Example**: "Reduce P95 processing time from 20s to 16s by optimizing evidence extraction"
338
339 === 9.3 A/B Testing ===
340
341 **When feasible**:
342
343 * Run two versions
344 * Measure metric differences
345 * Choose based on data
346 * Roll out winner
347
348 == 10. Bias and Fairness Metrics ==
349
350 === 10.1 Domain Balance ===
351
352 **Metric**: Confidence distribution by domain
353 **Target**: Similar distributions across domains
354 **Alert**: One domain consistently much lower/higher confidence
355 **Why it matters**: Ensure no systematic domain bias
356
357 === 10.2 Source Type Balance ===
358
359 **Metric**: Evidence distribution by source type
360 **Target**: Diverse source types represented
361 **Alert**: Over-reliance on one source type
362 **Why it matters**: Prevent source type bias
363
364 === 10.3 Geographic Balance ===
365
366 **Metric**: Source geographic distribution
367 **Target**: Multiple regions represented
368 **Alert**: Over-concentration in one region
369 **Why it matters**: Reduce geographic/cultural bias
370
371 == 11. Experimental Metrics ==
372
373 **New metrics to test**:
374
375 * User engagement time
376 * Evidence exploration depth
377 * Cross-reference usage
378 * Mobile vs desktop usage
379 **Process**:
380
381 1. Define metric hypothesis
382 2. Implement tracking
383 3. Collect data for 1 month
384 4. Evaluate usefulness
385 5. Add to standard set or discard
386
387 == 12. Anti-Patterns ==
388
389 **Don't**:
390
391 * ❌ Measure too many things (focus on what matters)
392 * ❌ Set unrealistic targets (demotivating)
393 * ❌ Ignore metrics when inconvenient
394 * ❌ Game metrics (destroys their value)
395 * ❌ Blame individuals for metric failures
396 * ❌ Let metrics become the goal (they're tools)
397 **Do**:
398 * ✅ Focus on actionable metrics
399 * ✅ Set ambitious but achievable targets
400 * ✅ Respond to metric signals
401 * ✅ Continuously validate metrics still matter
402 * ✅ Use metrics for system improvement, not people evaluation
403 * ✅ Remember: metrics serve users, not the other way around
404
405 == 13. Related Pages ==
406
407 * [[Automation Philosophy>>FactHarbor.Organisation.Automation-Philosophy]] - Why we monitor systems, not outputs
408 * [[Continuous Improvement>>FactHarbor.Organisation.How-We-Work-Together.Continuous-Improvement]] - How we use metrics to improve
409 * [[Governance>>Archive.FactHarbor 2026\.02\.08.Organisation.Governance.WebHome]] - Quarterly performance reviews
410 ---
411 **Remember**: We measure the SYSTEM, not individual outputs. Metrics drive IMPROVEMENT, not judgment.--