Wiki source code of System Performance Metrics
Last modified by Robert Schaub on 2026/02/08 08:32
Hide last authors
| author | version | line-number | content |
|---|---|---|---|
| |
1.1 | 1 | = System Performance Metrics = |
| |
1.2 | 2 | |
| |
1.1 | 3 | **What we monitor to ensure AKEL performs well.** |
| |
1.2 | 4 | |
| |
1.1 | 5 | == 1. Purpose == |
| |
1.2 | 6 | |
| |
1.1 | 7 | These metrics tell us: |
| |
1.2 | 8 | |
| |
1.1 | 9 | * ✅ Is AKEL performing within acceptable ranges? |
| 10 | * ✅ Where should we focus improvement efforts? | ||
| 11 | * ✅ When do humans need to intervene? | ||
| 12 | * ✅ Are our changes improving things? | ||
| 13 | **Principle**: Measure to improve, not to judge. | ||
| |
1.2 | 14 | |
| |
1.1 | 15 | == 2. Metric Categories == |
| |
1.2 | 16 | |
| |
1.1 | 17 | === 2.1 AKEL Performance === |
| |
1.2 | 18 | |
| |
1.1 | 19 | **Processing speed and reliability** |
| |
1.2 | 20 | |
| |
1.1 | 21 | === 2.2 Content Quality === |
| |
1.2 | 22 | |
| |
1.1 | 23 | **Output quality and user satisfaction** |
| |
1.2 | 24 | |
| |
1.1 | 25 | === 2.3 System Health === |
| |
1.2 | 26 | |
| |
1.1 | 27 | **Infrastructure and operational metrics** |
| |
1.2 | 28 | |
| |
1.1 | 29 | === 2.4 User Experience === |
| |
1.2 | 30 | |
| |
1.1 | 31 | **How users interact with the system** |
| |
1.2 | 32 | |
| |
1.1 | 33 | == 3. AKEL Performance Metrics == |
| |
1.2 | 34 | |
| |
1.1 | 35 | === 3.1 Processing Time === |
| |
1.2 | 36 | |
| |
1.1 | 37 | **Metric**: Time from claim submission to verdict publication |
| 38 | **Measurements**: | ||
| |
1.2 | 39 | |
| |
1.1 | 40 | * P50 (median): 50% of claims processed within X seconds |
| 41 | * P95: 95% of claims processed within Y seconds | ||
| 42 | * P99: 99% of claims processed within Z seconds | ||
| 43 | **Targets**: | ||
| 44 | * P50: ≤ 12 seconds | ||
| 45 | * P95: ≤ 18 seconds | ||
| 46 | * P99: ≤ 25 seconds | ||
| 47 | **Alert thresholds**: | ||
| 48 | * P95 > 20 seconds: Monitor closely | ||
| 49 | * P95 > 25 seconds: Investigate immediately | ||
| 50 | * P95 > 30 seconds: Emergency - intervention required | ||
| 51 | **Why it matters**: Slow processing = poor user experience | ||
| 52 | **Improvement ideas**: | ||
| 53 | * Optimize evidence extraction | ||
| 54 | * Better caching | ||
| 55 | * Parallel processing | ||
| 56 | * Database query optimization | ||
| |
1.2 | 57 | |
| |
1.1 | 58 | === 3.2 Success Rate === |
| |
1.2 | 59 | |
| |
1.1 | 60 | **Metric**: % of claims successfully processed without errors |
| 61 | **Target**: ≥ 99% | ||
| 62 | **Alert thresholds**: | ||
| |
1.2 | 63 | |
| |
1.1 | 64 | * 98-99%: Monitor |
| 65 | * 95-98%: Investigate | ||
| 66 | * <95%: Emergency | ||
| 67 | **Common failure causes**: | ||
| 68 | * Timeout (evidence extraction took too long) | ||
| 69 | * Parse error (claim text unparsable) | ||
| 70 | * External API failure (source unavailable) | ||
| 71 | * Resource exhaustion (memory/CPU) | ||
| 72 | **Why it matters**: Errors frustrate users and reduce trust | ||
| |
1.2 | 73 | |
| |
1.1 | 74 | === 3.3 Evidence Completeness === |
| |
1.2 | 75 | |
| |
1.1 | 76 | **Metric**: % of claims where AKEL found sufficient evidence |
| 77 | **Measurement**: Claims with ≥3 pieces of evidence from ≥2 distinct sources | ||
| 78 | **Target**: ≥ 80% | ||
| 79 | **Alert thresholds**: | ||
| |
1.2 | 80 | |
| |
1.1 | 81 | * 75-80%: Monitor |
| 82 | * 70-75%: Investigate | ||
| 83 | * <70%: Intervention needed | ||
| 84 | **Why it matters**: Incomplete evidence = low confidence verdicts | ||
| 85 | **Improvement ideas**: | ||
| 86 | * Better search algorithms | ||
| 87 | * More source integrations | ||
| 88 | * Improved relevance scoring | ||
| |
1.2 | 89 | |
| |
1.1 | 90 | === 3.4 Source Diversity === |
| |
1.2 | 91 | |
| |
1.1 | 92 | **Metric**: Average number of distinct sources per claim |
| 93 | **Target**: ≥ 3.0 sources per claim | ||
| 94 | **Alert thresholds**: | ||
| |
1.2 | 95 | |
| |
1.1 | 96 | * 2.5-3.0: Monitor |
| 97 | * 2.0-2.5: Investigate | ||
| 98 | * <2.0: Intervention needed | ||
| 99 | **Why it matters**: Multiple sources increase confidence and reduce bias | ||
| |
1.2 | 100 | |
| |
1.1 | 101 | === 3.5 Scenario Coverage === |
| |
1.2 | 102 | |
| |
1.1 | 103 | **Metric**: % of claims with at least one scenario extracted |
| 104 | **Target**: ≥ 75% | ||
| 105 | **Why it matters**: Scenarios provide context for verdicts | ||
| |
1.2 | 106 | |
| |
1.1 | 107 | == 4. Content Quality Metrics == |
| |
1.2 | 108 | |
| |
1.1 | 109 | === 4.1 Confidence Distribution === |
| |
1.2 | 110 | |
| |
1.1 | 111 | **Metric**: Distribution of confidence scores across claims |
| 112 | **Target**: Roughly normal distribution | ||
| |
1.2 | 113 | |
| 114 | * 10% very low confidence (0.0-0.3) | ||
| 115 | * 20% low confidence (0.3-0.5) | ||
| 116 | * 40% medium confidence (0.5-0.7) | ||
| 117 | * 20% high confidence (0.7-0.9) | ||
| 118 | * 10% very high confidence (0.9-1.0) | ||
| |
1.1 | 119 | **Alert thresholds**: |
| 120 | * >30% very low confidence: Evidence extraction issues | ||
| 121 | * >30% very high confidence: Too aggressive/overconfident | ||
| 122 | * Heavily skewed distribution: Systematic bias | ||
| 123 | **Why it matters**: Confidence should reflect actual uncertainty | ||
| |
1.2 | 124 | |
| |
1.1 | 125 | === 4.2 Contradiction Rate === |
| |
1.2 | 126 | |
| |
1.1 | 127 | **Metric**: % of claims with internal contradictions detected |
| 128 | **Target**: ≤ 5% | ||
| 129 | **Alert thresholds**: | ||
| |
1.2 | 130 | |
| |
1.1 | 131 | * 5-10%: Monitor |
| 132 | * 10-15%: Investigate | ||
| 133 | * >15%: Intervention needed | ||
| 134 | **Why it matters**: High contradiction rate suggests poor evidence quality or logic errors | ||
| |
1.2 | 135 | |
| |
1.1 | 136 | === 4.3 User Feedback Ratio === |
| |
1.2 | 137 | |
| |
1.1 | 138 | **Metric**: Helpful vs unhelpful user ratings |
| 139 | **Target**: ≥ 70% helpful | ||
| 140 | **Alert thresholds**: | ||
| |
1.2 | 141 | |
| |
1.1 | 142 | * 60-70%: Monitor |
| 143 | * 50-60%: Investigate | ||
| 144 | * <50%: Emergency | ||
| 145 | **Why it matters**: Direct measure of user satisfaction | ||
| |
1.2 | 146 | |
| |
1.1 | 147 | === 4.4 False Positive/Negative Rate === |
| |
1.2 | 148 | |
| |
1.1 | 149 | **Metric**: When humans review flagged items, how often was AKEL right? |
| 150 | **Measurement**: | ||
| |
1.2 | 151 | |
| |
1.1 | 152 | * False positive: AKEL flagged for review, but actually fine |
| 153 | * False negative: Missed something that should've been flagged | ||
| 154 | **Target**: | ||
| 155 | * False positive rate: ≤ 20% | ||
| 156 | * False negative rate: ≤ 5% | ||
| 157 | **Why it matters**: Balance between catching problems and not crying wolf | ||
| |
1.2 | 158 | |
| |
1.1 | 159 | == 5. System Health Metrics == |
| |
1.2 | 160 | |
| |
1.1 | 161 | === 5.1 Uptime === |
| |
1.2 | 162 | |
| |
1.1 | 163 | **Metric**: % of time system is available and functional |
| 164 | **Target**: ≥ 99.9% (less than 45 minutes downtime per month) | ||
| 165 | **Alert**: Immediate notification on any downtime | ||
| 166 | **Why it matters**: Users expect 24/7 availability | ||
| |
1.2 | 167 | |
| |
1.1 | 168 | === 5.2 Error Rate === |
| |
1.2 | 169 | |
| |
1.1 | 170 | **Metric**: Errors per 1000 requests |
| 171 | **Target**: ≤ 1 error per 1000 requests (0.1%) | ||
| 172 | **Alert thresholds**: | ||
| |
1.2 | 173 | |
| |
1.1 | 174 | * 1-5 per 1000: Monitor |
| 175 | * 5-10 per 1000: Investigate | ||
| 176 | * >10 per 1000: Emergency | ||
| 177 | **Why it matters**: Errors disrupt user experience | ||
| |
1.2 | 178 | |
| |
1.1 | 179 | === 5.3 Database Performance === |
| |
1.2 | 180 | |
| |
1.1 | 181 | **Metrics**: |
| |
1.2 | 182 | |
| |
1.1 | 183 | * Query response time (P95) |
| 184 | * Connection pool utilization | ||
| 185 | * Slow query frequency | ||
| 186 | **Targets**: | ||
| 187 | * P95 query time: ≤ 50ms | ||
| 188 | * Connection pool: ≤ 80% utilized | ||
| 189 | * Slow queries (>1s): ≤ 10 per hour | ||
| 190 | **Why it matters**: Database bottlenecks slow entire system | ||
| |
1.2 | 191 | |
| |
1.1 | 192 | === 5.4 Cache Hit Rate === |
| |
1.2 | 193 | |
| |
1.1 | 194 | **Metric**: % of requests served from cache vs. database |
| 195 | **Target**: ≥ 80% | ||
| 196 | **Why it matters**: Higher cache hit rate = faster responses, less DB load | ||
| |
1.2 | 197 | |
| |
1.1 | 198 | === 5.5 Resource Utilization === |
| |
1.2 | 199 | |
| |
1.1 | 200 | **Metrics**: |
| |
1.2 | 201 | |
| |
1.1 | 202 | * CPU utilization |
| 203 | * Memory utilization | ||
| 204 | * Disk I/O | ||
| 205 | * Network bandwidth | ||
| 206 | **Targets**: | ||
| 207 | * Average CPU: ≤ 60% | ||
| 208 | * Peak CPU: ≤ 85% | ||
| 209 | * Memory: ≤ 80% | ||
| 210 | * Disk I/O: ≤ 70% | ||
| 211 | **Alert**: Any metric consistently >85% | ||
| 212 | **Why it matters**: Headroom for traffic spikes, prevents resource exhaustion | ||
| |
1.2 | 213 | |
| |
1.1 | 214 | == 6. User Experience Metrics == |
| |
1.2 | 215 | |
| |
1.1 | 216 | === 6.1 Time to First Verdict === |
| |
1.2 | 217 | |
| |
1.1 | 218 | **Metric**: Time from user submitting claim to seeing initial verdict |
| 219 | **Target**: ≤ 15 seconds | ||
| 220 | **Why it matters**: User perception of speed | ||
| |
1.2 | 221 | |
| |
1.1 | 222 | === 6.2 Claim Submission Rate === |
| |
1.2 | 223 | |
| |
1.1 | 224 | **Metric**: Claims submitted per day/hour |
| 225 | **Monitoring**: Track trends, detect anomalies | ||
| 226 | **Why it matters**: Understand usage patterns, capacity planning | ||
| |
1.2 | 227 | |
| |
1.1 | 228 | === 6.3 User Retention === |
| |
1.2 | 229 | |
| |
1.1 | 230 | **Metric**: % of users who return after first visit |
| 231 | **Target**: ≥ 30% (1-week retention) | ||
| 232 | **Why it matters**: Indicates system usefulness | ||
| |
1.2 | 233 | |
| |
1.1 | 234 | === 6.4 Feature Usage === |
| |
1.2 | 235 | |
| |
1.1 | 236 | **Metrics**: |
| |
1.2 | 237 | |
| |
1.1 | 238 | * % of users who explore evidence |
| 239 | * % who check scenarios | ||
| 240 | * % who view source track records | ||
| 241 | **Why it matters**: Understand how users interact with system | ||
| |
1.2 | 242 | |
| |
1.1 | 243 | == 7. Metric Dashboard == |
| |
1.2 | 244 | |
| |
1.1 | 245 | === 7.1 Real-Time Dashboard === |
| |
1.2 | 246 | |
| |
1.1 | 247 | **Always visible**: |
| |
1.2 | 248 | |
| |
1.1 | 249 | * Current processing time (P95) |
| 250 | * Success rate (last hour) | ||
| 251 | * Error rate (last hour) | ||
| 252 | * System health status | ||
| 253 | **Update frequency**: Every 30 seconds | ||
| |
1.2 | 254 | |
| |
1.1 | 255 | === 7.2 Daily Dashboard === |
| |
1.2 | 256 | |
| |
1.1 | 257 | **Reviewed daily**: |
| |
1.2 | 258 | |
| |
1.1 | 259 | * All AKEL performance metrics |
| 260 | * Content quality metrics | ||
| 261 | * System health trends | ||
| 262 | * User feedback summary | ||
| |
1.2 | 263 | |
| |
1.1 | 264 | === 7.3 Weekly Reports === |
| |
1.2 | 265 | |
| |
1.1 | 266 | **Reviewed weekly**: |
| |
1.2 | 267 | |
| |
1.1 | 268 | * Trends over time |
| 269 | * Week-over-week comparisons | ||
| 270 | * Improvement priorities | ||
| 271 | * Outstanding issues | ||
| |
1.2 | 272 | |
| |
1.1 | 273 | === 7.4 Monthly/Quarterly Reports === |
| |
1.2 | 274 | |
| |
1.1 | 275 | **Comprehensive analysis**: |
| |
1.2 | 276 | |
| |
1.1 | 277 | * Long-term trends |
| 278 | * Seasonal patterns | ||
| 279 | * Strategic metrics | ||
| 280 | * Goal progress | ||
| |
1.2 | 281 | |
| |
1.1 | 282 | == 8. Alert System == |
| |
1.2 | 283 | |
| |
1.1 | 284 | === 8.1 Alert Levels === |
| |
1.2 | 285 | |
| |
1.1 | 286 | **Info**: Metric outside target, but within acceptable range |
| |
1.2 | 287 | |
| |
1.1 | 288 | * Action: Note in daily review |
| 289 | * Example: P95 processing time 19s (target 18s, acceptable <20s) | ||
| 290 | **Warning**: Metric outside acceptable range | ||
| 291 | * Action: Investigate within 24 hours | ||
| 292 | * Example: Success rate 97% (acceptable >98%) | ||
| 293 | **Critical**: Metric severely degraded | ||
| 294 | * Action: Investigate immediately | ||
| 295 | * Example: Error rate 2% (acceptable <0.5%) | ||
| 296 | **Emergency**: System failure or severe degradation | ||
| 297 | * Action: Page on-call, all hands | ||
| 298 | * Example: Uptime <95%, P95 >30s | ||
| |
1.2 | 299 | |
| |
1.1 | 300 | === 8.2 Alert Channels === |
| |
1.2 | 301 | |
| |
1.1 | 302 | **Slack/Discord**: All alerts |
| 303 | **Email**: Warning and above | ||
| 304 | **SMS**: Critical and emergency only | ||
| 305 | **PagerDuty**: Emergency only | ||
| |
1.2 | 306 | |
| |
1.1 | 307 | === 8.3 On-Call Rotation === |
| |
1.2 | 308 | |
| |
1.1 | 309 | **Technical Coordinator**: Primary on-call |
| 310 | **Backup**: Designated team member | ||
| 311 | **Responsibilities**: | ||
| |
1.2 | 312 | |
| |
1.1 | 313 | * Respond to alerts within SLA |
| 314 | * Investigate and diagnose issues | ||
| 315 | * Implement fixes or escalate | ||
| 316 | * Document incidents | ||
| |
1.2 | 317 | |
| |
1.1 | 318 | == 9. Metric-Driven Improvement == |
| |
1.2 | 319 | |
| |
1.1 | 320 | === 9.1 Prioritization === |
| |
1.2 | 321 | |
| |
1.1 | 322 | **Focus improvements on**: |
| |
1.2 | 323 | |
| |
1.1 | 324 | * Metrics furthest from target |
| 325 | * Metrics with biggest user impact | ||
| 326 | * Metrics easiest to improve | ||
| 327 | * Strategic priorities | ||
| |
1.2 | 328 | |
| |
1.1 | 329 | === 9.2 Success Criteria === |
| |
1.2 | 330 | |
| |
1.1 | 331 | **Every improvement project should**: |
| |
1.2 | 332 | |
| |
1.1 | 333 | * Target specific metrics |
| 334 | * Set concrete improvement goals | ||
| 335 | * Measure before and after | ||
| 336 | * Document learnings | ||
| 337 | **Example**: "Reduce P95 processing time from 20s to 16s by optimizing evidence extraction" | ||
| |
1.2 | 338 | |
| |
1.1 | 339 | === 9.3 A/B Testing === |
| |
1.2 | 340 | |
| |
1.1 | 341 | **When feasible**: |
| |
1.2 | 342 | |
| |
1.1 | 343 | * Run two versions |
| 344 | * Measure metric differences | ||
| 345 | * Choose based on data | ||
| 346 | * Roll out winner | ||
| |
1.2 | 347 | |
| |
1.1 | 348 | == 10. Bias and Fairness Metrics == |
| |
1.2 | 349 | |
| |
1.1 | 350 | === 10.1 Domain Balance === |
| |
1.2 | 351 | |
| |
1.1 | 352 | **Metric**: Confidence distribution by domain |
| 353 | **Target**: Similar distributions across domains | ||
| 354 | **Alert**: One domain consistently much lower/higher confidence | ||
| 355 | **Why it matters**: Ensure no systematic domain bias | ||
| |
1.2 | 356 | |
| |
1.1 | 357 | === 10.2 Source Type Balance === |
| |
1.2 | 358 | |
| |
1.1 | 359 | **Metric**: Evidence distribution by source type |
| 360 | **Target**: Diverse source types represented | ||
| 361 | **Alert**: Over-reliance on one source type | ||
| 362 | **Why it matters**: Prevent source type bias | ||
| |
1.2 | 363 | |
| |
1.1 | 364 | === 10.3 Geographic Balance === |
| |
1.2 | 365 | |
| |
1.1 | 366 | **Metric**: Source geographic distribution |
| 367 | **Target**: Multiple regions represented | ||
| 368 | **Alert**: Over-concentration in one region | ||
| 369 | **Why it matters**: Reduce geographic/cultural bias | ||
| |
1.2 | 370 | |
| |
1.1 | 371 | == 11. Experimental Metrics == |
| |
1.2 | 372 | |
| |
1.1 | 373 | **New metrics to test**: |
| |
1.2 | 374 | |
| |
1.1 | 375 | * User engagement time |
| 376 | * Evidence exploration depth | ||
| 377 | * Cross-reference usage | ||
| 378 | * Mobile vs desktop usage | ||
| 379 | **Process**: | ||
| |
1.2 | 380 | |
| |
1.1 | 381 | 1. Define metric hypothesis |
| 382 | 2. Implement tracking | ||
| 383 | 3. Collect data for 1 month | ||
| 384 | 4. Evaluate usefulness | ||
| 385 | 5. Add to standard set or discard | ||
| |
1.2 | 386 | |
| |
1.1 | 387 | == 12. Anti-Patterns == |
| |
1.2 | 388 | |
| |
1.1 | 389 | **Don't**: |
| |
1.2 | 390 | |
| |
1.1 | 391 | * ❌ Measure too many things (focus on what matters) |
| 392 | * ❌ Set unrealistic targets (demotivating) | ||
| 393 | * ❌ Ignore metrics when inconvenient | ||
| 394 | * ❌ Game metrics (destroys their value) | ||
| 395 | * ❌ Blame individuals for metric failures | ||
| 396 | * ❌ Let metrics become the goal (they're tools) | ||
| 397 | **Do**: | ||
| 398 | * ✅ Focus on actionable metrics | ||
| 399 | * ✅ Set ambitious but achievable targets | ||
| 400 | * ✅ Respond to metric signals | ||
| 401 | * ✅ Continuously validate metrics still matter | ||
| 402 | * ✅ Use metrics for system improvement, not people evaluation | ||
| 403 | * ✅ Remember: metrics serve users, not the other way around | ||
| |
1.2 | 404 | |
| |
1.1 | 405 | == 13. Related Pages == |
| |
1.2 | 406 | |
| |
1.1 | 407 | * [[Automation Philosophy>>FactHarbor.Organisation.Automation-Philosophy]] - Why we monitor systems, not outputs |
| 408 | * [[Continuous Improvement>>FactHarbor.Organisation.How-We-Work-Together.Continuous-Improvement]] - How we use metrics to improve | ||
| |
1.2 | 409 | * [[Governance>>Archive.FactHarbor 2026\.02\.08.Organisation.Governance.WebHome]] - Quarterly performance reviews |
| |
1.1 | 410 | --- |
| |
1.2 | 411 | **Remember**: We measure the SYSTEM, not individual outputs. Metrics drive IMPROVEMENT, not judgment.-- |