Wiki source code of System Performance Metrics
Last modified by Robert Schaub on 2026/02/08 08:32
Show last authors
| author | version | line-number | content |
|---|---|---|---|
| 1 | = System Performance Metrics = | ||
| 2 | |||
| 3 | **What we monitor to ensure AKEL performs well.** | ||
| 4 | |||
| 5 | == 1. Purpose == | ||
| 6 | |||
| 7 | These metrics tell us: | ||
| 8 | |||
| 9 | * ✅ Is AKEL performing within acceptable ranges? | ||
| 10 | * ✅ Where should we focus improvement efforts? | ||
| 11 | * ✅ When do humans need to intervene? | ||
| 12 | * ✅ Are our changes improving things? | ||
| 13 | **Principle**: Measure to improve, not to judge. | ||
| 14 | |||
| 15 | == 2. Metric Categories == | ||
| 16 | |||
| 17 | === 2.1 AKEL Performance === | ||
| 18 | |||
| 19 | **Processing speed and reliability** | ||
| 20 | |||
| 21 | === 2.2 Content Quality === | ||
| 22 | |||
| 23 | **Output quality and user satisfaction** | ||
| 24 | |||
| 25 | === 2.3 System Health === | ||
| 26 | |||
| 27 | **Infrastructure and operational metrics** | ||
| 28 | |||
| 29 | === 2.4 User Experience === | ||
| 30 | |||
| 31 | **How users interact with the system** | ||
| 32 | |||
| 33 | == 3. AKEL Performance Metrics == | ||
| 34 | |||
| 35 | === 3.1 Processing Time === | ||
| 36 | |||
| 37 | **Metric**: Time from claim submission to verdict publication | ||
| 38 | **Measurements**: | ||
| 39 | |||
| 40 | * P50 (median): 50% of claims processed within X seconds | ||
| 41 | * P95: 95% of claims processed within Y seconds | ||
| 42 | * P99: 99% of claims processed within Z seconds | ||
| 43 | **Targets**: | ||
| 44 | * P50: ≤ 12 seconds | ||
| 45 | * P95: ≤ 18 seconds | ||
| 46 | * P99: ≤ 25 seconds | ||
| 47 | **Alert thresholds**: | ||
| 48 | * P95 > 20 seconds: Monitor closely | ||
| 49 | * P95 > 25 seconds: Investigate immediately | ||
| 50 | * P95 > 30 seconds: Emergency - intervention required | ||
| 51 | **Why it matters**: Slow processing = poor user experience | ||
| 52 | **Improvement ideas**: | ||
| 53 | * Optimize evidence extraction | ||
| 54 | * Better caching | ||
| 55 | * Parallel processing | ||
| 56 | * Database query optimization | ||
| 57 | |||
| 58 | === 3.2 Success Rate === | ||
| 59 | |||
| 60 | **Metric**: % of claims successfully processed without errors | ||
| 61 | **Target**: ≥ 99% | ||
| 62 | **Alert thresholds**: | ||
| 63 | |||
| 64 | * 98-99%: Monitor | ||
| 65 | * 95-98%: Investigate | ||
| 66 | * <95%: Emergency | ||
| 67 | **Common failure causes**: | ||
| 68 | * Timeout (evidence extraction took too long) | ||
| 69 | * Parse error (claim text unparsable) | ||
| 70 | * External API failure (source unavailable) | ||
| 71 | * Resource exhaustion (memory/CPU) | ||
| 72 | **Why it matters**: Errors frustrate users and reduce trust | ||
| 73 | |||
| 74 | === 3.3 Evidence Completeness === | ||
| 75 | |||
| 76 | **Metric**: % of claims where AKEL found sufficient evidence | ||
| 77 | **Measurement**: Claims with ≥3 pieces of evidence from ≥2 distinct sources | ||
| 78 | **Target**: ≥ 80% | ||
| 79 | **Alert thresholds**: | ||
| 80 | |||
| 81 | * 75-80%: Monitor | ||
| 82 | * 70-75%: Investigate | ||
| 83 | * <70%: Intervention needed | ||
| 84 | **Why it matters**: Incomplete evidence = low confidence verdicts | ||
| 85 | **Improvement ideas**: | ||
| 86 | * Better search algorithms | ||
| 87 | * More source integrations | ||
| 88 | * Improved relevance scoring | ||
| 89 | |||
| 90 | === 3.4 Source Diversity === | ||
| 91 | |||
| 92 | **Metric**: Average number of distinct sources per claim | ||
| 93 | **Target**: ≥ 3.0 sources per claim | ||
| 94 | **Alert thresholds**: | ||
| 95 | |||
| 96 | * 2.5-3.0: Monitor | ||
| 97 | * 2.0-2.5: Investigate | ||
| 98 | * <2.0: Intervention needed | ||
| 99 | **Why it matters**: Multiple sources increase confidence and reduce bias | ||
| 100 | |||
| 101 | === 3.5 Scenario Coverage === | ||
| 102 | |||
| 103 | **Metric**: % of claims with at least one scenario extracted | ||
| 104 | **Target**: ≥ 75% | ||
| 105 | **Why it matters**: Scenarios provide context for verdicts | ||
| 106 | |||
| 107 | == 4. Content Quality Metrics == | ||
| 108 | |||
| 109 | === 4.1 Confidence Distribution === | ||
| 110 | |||
| 111 | **Metric**: Distribution of confidence scores across claims | ||
| 112 | **Target**: Roughly normal distribution | ||
| 113 | |||
| 114 | * 10% very low confidence (0.0-0.3) | ||
| 115 | * 20% low confidence (0.3-0.5) | ||
| 116 | * 40% medium confidence (0.5-0.7) | ||
| 117 | * 20% high confidence (0.7-0.9) | ||
| 118 | * 10% very high confidence (0.9-1.0) | ||
| 119 | **Alert thresholds**: | ||
| 120 | * >30% very low confidence: Evidence extraction issues | ||
| 121 | * >30% very high confidence: Too aggressive/overconfident | ||
| 122 | * Heavily skewed distribution: Systematic bias | ||
| 123 | **Why it matters**: Confidence should reflect actual uncertainty | ||
| 124 | |||
| 125 | === 4.2 Contradiction Rate === | ||
| 126 | |||
| 127 | **Metric**: % of claims with internal contradictions detected | ||
| 128 | **Target**: ≤ 5% | ||
| 129 | **Alert thresholds**: | ||
| 130 | |||
| 131 | * 5-10%: Monitor | ||
| 132 | * 10-15%: Investigate | ||
| 133 | * >15%: Intervention needed | ||
| 134 | **Why it matters**: High contradiction rate suggests poor evidence quality or logic errors | ||
| 135 | |||
| 136 | === 4.3 User Feedback Ratio === | ||
| 137 | |||
| 138 | **Metric**: Helpful vs unhelpful user ratings | ||
| 139 | **Target**: ≥ 70% helpful | ||
| 140 | **Alert thresholds**: | ||
| 141 | |||
| 142 | * 60-70%: Monitor | ||
| 143 | * 50-60%: Investigate | ||
| 144 | * <50%: Emergency | ||
| 145 | **Why it matters**: Direct measure of user satisfaction | ||
| 146 | |||
| 147 | === 4.4 False Positive/Negative Rate === | ||
| 148 | |||
| 149 | **Metric**: When humans review flagged items, how often was AKEL right? | ||
| 150 | **Measurement**: | ||
| 151 | |||
| 152 | * False positive: AKEL flagged for review, but actually fine | ||
| 153 | * False negative: Missed something that should've been flagged | ||
| 154 | **Target**: | ||
| 155 | * False positive rate: ≤ 20% | ||
| 156 | * False negative rate: ≤ 5% | ||
| 157 | **Why it matters**: Balance between catching problems and not crying wolf | ||
| 158 | |||
| 159 | == 5. System Health Metrics == | ||
| 160 | |||
| 161 | === 5.1 Uptime === | ||
| 162 | |||
| 163 | **Metric**: % of time system is available and functional | ||
| 164 | **Target**: ≥ 99.9% (less than 45 minutes downtime per month) | ||
| 165 | **Alert**: Immediate notification on any downtime | ||
| 166 | **Why it matters**: Users expect 24/7 availability | ||
| 167 | |||
| 168 | === 5.2 Error Rate === | ||
| 169 | |||
| 170 | **Metric**: Errors per 1000 requests | ||
| 171 | **Target**: ≤ 1 error per 1000 requests (0.1%) | ||
| 172 | **Alert thresholds**: | ||
| 173 | |||
| 174 | * 1-5 per 1000: Monitor | ||
| 175 | * 5-10 per 1000: Investigate | ||
| 176 | * >10 per 1000: Emergency | ||
| 177 | **Why it matters**: Errors disrupt user experience | ||
| 178 | |||
| 179 | === 5.3 Database Performance === | ||
| 180 | |||
| 181 | **Metrics**: | ||
| 182 | |||
| 183 | * Query response time (P95) | ||
| 184 | * Connection pool utilization | ||
| 185 | * Slow query frequency | ||
| 186 | **Targets**: | ||
| 187 | * P95 query time: ≤ 50ms | ||
| 188 | * Connection pool: ≤ 80% utilized | ||
| 189 | * Slow queries (>1s): ≤ 10 per hour | ||
| 190 | **Why it matters**: Database bottlenecks slow entire system | ||
| 191 | |||
| 192 | === 5.4 Cache Hit Rate === | ||
| 193 | |||
| 194 | **Metric**: % of requests served from cache vs. database | ||
| 195 | **Target**: ≥ 80% | ||
| 196 | **Why it matters**: Higher cache hit rate = faster responses, less DB load | ||
| 197 | |||
| 198 | === 5.5 Resource Utilization === | ||
| 199 | |||
| 200 | **Metrics**: | ||
| 201 | |||
| 202 | * CPU utilization | ||
| 203 | * Memory utilization | ||
| 204 | * Disk I/O | ||
| 205 | * Network bandwidth | ||
| 206 | **Targets**: | ||
| 207 | * Average CPU: ≤ 60% | ||
| 208 | * Peak CPU: ≤ 85% | ||
| 209 | * Memory: ≤ 80% | ||
| 210 | * Disk I/O: ≤ 70% | ||
| 211 | **Alert**: Any metric consistently >85% | ||
| 212 | **Why it matters**: Headroom for traffic spikes, prevents resource exhaustion | ||
| 213 | |||
| 214 | == 6. User Experience Metrics == | ||
| 215 | |||
| 216 | === 6.1 Time to First Verdict === | ||
| 217 | |||
| 218 | **Metric**: Time from user submitting claim to seeing initial verdict | ||
| 219 | **Target**: ≤ 15 seconds | ||
| 220 | **Why it matters**: User perception of speed | ||
| 221 | |||
| 222 | === 6.2 Claim Submission Rate === | ||
| 223 | |||
| 224 | **Metric**: Claims submitted per day/hour | ||
| 225 | **Monitoring**: Track trends, detect anomalies | ||
| 226 | **Why it matters**: Understand usage patterns, capacity planning | ||
| 227 | |||
| 228 | === 6.3 User Retention === | ||
| 229 | |||
| 230 | **Metric**: % of users who return after first visit | ||
| 231 | **Target**: ≥ 30% (1-week retention) | ||
| 232 | **Why it matters**: Indicates system usefulness | ||
| 233 | |||
| 234 | === 6.4 Feature Usage === | ||
| 235 | |||
| 236 | **Metrics**: | ||
| 237 | |||
| 238 | * % of users who explore evidence | ||
| 239 | * % who check scenarios | ||
| 240 | * % who view source track records | ||
| 241 | **Why it matters**: Understand how users interact with system | ||
| 242 | |||
| 243 | == 7. Metric Dashboard == | ||
| 244 | |||
| 245 | === 7.1 Real-Time Dashboard === | ||
| 246 | |||
| 247 | **Always visible**: | ||
| 248 | |||
| 249 | * Current processing time (P95) | ||
| 250 | * Success rate (last hour) | ||
| 251 | * Error rate (last hour) | ||
| 252 | * System health status | ||
| 253 | **Update frequency**: Every 30 seconds | ||
| 254 | |||
| 255 | === 7.2 Daily Dashboard === | ||
| 256 | |||
| 257 | **Reviewed daily**: | ||
| 258 | |||
| 259 | * All AKEL performance metrics | ||
| 260 | * Content quality metrics | ||
| 261 | * System health trends | ||
| 262 | * User feedback summary | ||
| 263 | |||
| 264 | === 7.3 Weekly Reports === | ||
| 265 | |||
| 266 | **Reviewed weekly**: | ||
| 267 | |||
| 268 | * Trends over time | ||
| 269 | * Week-over-week comparisons | ||
| 270 | * Improvement priorities | ||
| 271 | * Outstanding issues | ||
| 272 | |||
| 273 | === 7.4 Monthly/Quarterly Reports === | ||
| 274 | |||
| 275 | **Comprehensive analysis**: | ||
| 276 | |||
| 277 | * Long-term trends | ||
| 278 | * Seasonal patterns | ||
| 279 | * Strategic metrics | ||
| 280 | * Goal progress | ||
| 281 | |||
| 282 | == 8. Alert System == | ||
| 283 | |||
| 284 | === 8.1 Alert Levels === | ||
| 285 | |||
| 286 | **Info**: Metric outside target, but within acceptable range | ||
| 287 | |||
| 288 | * Action: Note in daily review | ||
| 289 | * Example: P95 processing time 19s (target 18s, acceptable <20s) | ||
| 290 | **Warning**: Metric outside acceptable range | ||
| 291 | * Action: Investigate within 24 hours | ||
| 292 | * Example: Success rate 97% (acceptable >98%) | ||
| 293 | **Critical**: Metric severely degraded | ||
| 294 | * Action: Investigate immediately | ||
| 295 | * Example: Error rate 2% (acceptable <0.5%) | ||
| 296 | **Emergency**: System failure or severe degradation | ||
| 297 | * Action: Page on-call, all hands | ||
| 298 | * Example: Uptime <95%, P95 >30s | ||
| 299 | |||
| 300 | === 8.2 Alert Channels === | ||
| 301 | |||
| 302 | **Slack/Discord**: All alerts | ||
| 303 | **Email**: Warning and above | ||
| 304 | **SMS**: Critical and emergency only | ||
| 305 | **PagerDuty**: Emergency only | ||
| 306 | |||
| 307 | === 8.3 On-Call Rotation === | ||
| 308 | |||
| 309 | **Technical Coordinator**: Primary on-call | ||
| 310 | **Backup**: Designated team member | ||
| 311 | **Responsibilities**: | ||
| 312 | |||
| 313 | * Respond to alerts within SLA | ||
| 314 | * Investigate and diagnose issues | ||
| 315 | * Implement fixes or escalate | ||
| 316 | * Document incidents | ||
| 317 | |||
| 318 | == 9. Metric-Driven Improvement == | ||
| 319 | |||
| 320 | === 9.1 Prioritization === | ||
| 321 | |||
| 322 | **Focus improvements on**: | ||
| 323 | |||
| 324 | * Metrics furthest from target | ||
| 325 | * Metrics with biggest user impact | ||
| 326 | * Metrics easiest to improve | ||
| 327 | * Strategic priorities | ||
| 328 | |||
| 329 | === 9.2 Success Criteria === | ||
| 330 | |||
| 331 | **Every improvement project should**: | ||
| 332 | |||
| 333 | * Target specific metrics | ||
| 334 | * Set concrete improvement goals | ||
| 335 | * Measure before and after | ||
| 336 | * Document learnings | ||
| 337 | **Example**: "Reduce P95 processing time from 20s to 16s by optimizing evidence extraction" | ||
| 338 | |||
| 339 | === 9.3 A/B Testing === | ||
| 340 | |||
| 341 | **When feasible**: | ||
| 342 | |||
| 343 | * Run two versions | ||
| 344 | * Measure metric differences | ||
| 345 | * Choose based on data | ||
| 346 | * Roll out winner | ||
| 347 | |||
| 348 | == 10. Bias and Fairness Metrics == | ||
| 349 | |||
| 350 | === 10.1 Domain Balance === | ||
| 351 | |||
| 352 | **Metric**: Confidence distribution by domain | ||
| 353 | **Target**: Similar distributions across domains | ||
| 354 | **Alert**: One domain consistently much lower/higher confidence | ||
| 355 | **Why it matters**: Ensure no systematic domain bias | ||
| 356 | |||
| 357 | === 10.2 Source Type Balance === | ||
| 358 | |||
| 359 | **Metric**: Evidence distribution by source type | ||
| 360 | **Target**: Diverse source types represented | ||
| 361 | **Alert**: Over-reliance on one source type | ||
| 362 | **Why it matters**: Prevent source type bias | ||
| 363 | |||
| 364 | === 10.3 Geographic Balance === | ||
| 365 | |||
| 366 | **Metric**: Source geographic distribution | ||
| 367 | **Target**: Multiple regions represented | ||
| 368 | **Alert**: Over-concentration in one region | ||
| 369 | **Why it matters**: Reduce geographic/cultural bias | ||
| 370 | |||
| 371 | == 11. Experimental Metrics == | ||
| 372 | |||
| 373 | **New metrics to test**: | ||
| 374 | |||
| 375 | * User engagement time | ||
| 376 | * Evidence exploration depth | ||
| 377 | * Cross-reference usage | ||
| 378 | * Mobile vs desktop usage | ||
| 379 | **Process**: | ||
| 380 | |||
| 381 | 1. Define metric hypothesis | ||
| 382 | 2. Implement tracking | ||
| 383 | 3. Collect data for 1 month | ||
| 384 | 4. Evaluate usefulness | ||
| 385 | 5. Add to standard set or discard | ||
| 386 | |||
| 387 | == 12. Anti-Patterns == | ||
| 388 | |||
| 389 | **Don't**: | ||
| 390 | |||
| 391 | * ❌ Measure too many things (focus on what matters) | ||
| 392 | * ❌ Set unrealistic targets (demotivating) | ||
| 393 | * ❌ Ignore metrics when inconvenient | ||
| 394 | * ❌ Game metrics (destroys their value) | ||
| 395 | * ❌ Blame individuals for metric failures | ||
| 396 | * ❌ Let metrics become the goal (they're tools) | ||
| 397 | **Do**: | ||
| 398 | * ✅ Focus on actionable metrics | ||
| 399 | * ✅ Set ambitious but achievable targets | ||
| 400 | * ✅ Respond to metric signals | ||
| 401 | * ✅ Continuously validate metrics still matter | ||
| 402 | * ✅ Use metrics for system improvement, not people evaluation | ||
| 403 | * ✅ Remember: metrics serve users, not the other way around | ||
| 404 | |||
| 405 | == 13. Related Pages == | ||
| 406 | |||
| 407 | * [[Automation Philosophy>>FactHarbor.Organisation.Automation-Philosophy]] - Why we monitor systems, not outputs | ||
| 408 | * [[Continuous Improvement>>FactHarbor.Organisation.How-We-Work-Together.Continuous-Improvement]] - How we use metrics to improve | ||
| 409 | * [[Governance>>Archive.FactHarbor 2026\.02\.08.Organisation.Governance.WebHome]] - Quarterly performance reviews | ||
| 410 | --- | ||
| 411 | **Remember**: We measure the SYSTEM, not individual outputs. Metrics drive IMPROVEMENT, not judgment.-- |