Wiki source code of System Performance Metrics
Last modified by Robert Schaub on 2025/12/18 12:03
Hide last authors
| author | version | line-number | content |
|---|---|---|---|
| |
1.1 | 1 | = System Performance Metrics = |
| 2 | **What we monitor to ensure AKEL performs well.** | ||
| 3 | == 1. Purpose == | ||
| 4 | These metrics tell us: | ||
| 5 | * ✅ Is AKEL performing within acceptable ranges? | ||
| 6 | * ✅ Where should we focus improvement efforts? | ||
| 7 | * ✅ When do humans need to intervene? | ||
| 8 | * ✅ Are our changes improving things? | ||
| 9 | **Principle**: Measure to improve, not to judge. | ||
| 10 | == 2. Metric Categories == | ||
| 11 | === 2.1 AKEL Performance === | ||
| 12 | **Processing speed and reliability** | ||
| 13 | === 2.2 Content Quality === | ||
| 14 | **Output quality and user satisfaction** | ||
| 15 | === 2.3 System Health === | ||
| 16 | **Infrastructure and operational metrics** | ||
| 17 | === 2.4 User Experience === | ||
| 18 | **How users interact with the system** | ||
| 19 | == 3. AKEL Performance Metrics == | ||
| 20 | === 3.1 Processing Time === | ||
| 21 | **Metric**: Time from claim submission to verdict publication | ||
| 22 | **Measurements**: | ||
| 23 | * P50 (median): 50% of claims processed within X seconds | ||
| 24 | * P95: 95% of claims processed within Y seconds | ||
| 25 | * P99: 99% of claims processed within Z seconds | ||
| 26 | **Targets**: | ||
| 27 | * P50: ≤ 12 seconds | ||
| 28 | * P95: ≤ 18 seconds | ||
| 29 | * P99: ≤ 25 seconds | ||
| 30 | **Alert thresholds**: | ||
| 31 | * P95 > 20 seconds: Monitor closely | ||
| 32 | * P95 > 25 seconds: Investigate immediately | ||
| 33 | * P95 > 30 seconds: Emergency - intervention required | ||
| 34 | **Why it matters**: Slow processing = poor user experience | ||
| 35 | **Improvement ideas**: | ||
| 36 | * Optimize evidence extraction | ||
| 37 | * Better caching | ||
| 38 | * Parallel processing | ||
| 39 | * Database query optimization | ||
| 40 | === 3.2 Success Rate === | ||
| 41 | **Metric**: % of claims successfully processed without errors | ||
| 42 | **Target**: ≥ 99% | ||
| 43 | **Alert thresholds**: | ||
| 44 | * 98-99%: Monitor | ||
| 45 | * 95-98%: Investigate | ||
| 46 | * <95%: Emergency | ||
| 47 | **Common failure causes**: | ||
| 48 | * Timeout (evidence extraction took too long) | ||
| 49 | * Parse error (claim text unparsable) | ||
| 50 | * External API failure (source unavailable) | ||
| 51 | * Resource exhaustion (memory/CPU) | ||
| 52 | **Why it matters**: Errors frustrate users and reduce trust | ||
| 53 | === 3.3 Evidence Completeness === | ||
| 54 | **Metric**: % of claims where AKEL found sufficient evidence | ||
| 55 | **Measurement**: Claims with ≥3 pieces of evidence from ≥2 distinct sources | ||
| 56 | **Target**: ≥ 80% | ||
| 57 | **Alert thresholds**: | ||
| 58 | * 75-80%: Monitor | ||
| 59 | * 70-75%: Investigate | ||
| 60 | * <70%: Intervention needed | ||
| 61 | **Why it matters**: Incomplete evidence = low confidence verdicts | ||
| 62 | **Improvement ideas**: | ||
| 63 | * Better search algorithms | ||
| 64 | * More source integrations | ||
| 65 | * Improved relevance scoring | ||
| 66 | === 3.4 Source Diversity === | ||
| 67 | **Metric**: Average number of distinct sources per claim | ||
| 68 | **Target**: ≥ 3.0 sources per claim | ||
| 69 | **Alert thresholds**: | ||
| 70 | * 2.5-3.0: Monitor | ||
| 71 | * 2.0-2.5: Investigate | ||
| 72 | * <2.0: Intervention needed | ||
| 73 | **Why it matters**: Multiple sources increase confidence and reduce bias | ||
| 74 | === 3.5 Scenario Coverage === | ||
| 75 | **Metric**: % of claims with at least one scenario extracted | ||
| 76 | **Target**: ≥ 75% | ||
| 77 | **Why it matters**: Scenarios provide context for verdicts | ||
| 78 | == 4. Content Quality Metrics == | ||
| 79 | === 4.1 Confidence Distribution === | ||
| 80 | **Metric**: Distribution of confidence scores across claims | ||
| 81 | **Target**: Roughly normal distribution | ||
| 82 | * ~10% very low confidence (0.0-0.3) | ||
| 83 | * ~20% low confidence (0.3-0.5) | ||
| 84 | * ~40% medium confidence (0.5-0.7) | ||
| 85 | * ~20% high confidence (0.7-0.9) | ||
| 86 | * ~10% very high confidence (0.9-1.0) | ||
| 87 | **Alert thresholds**: | ||
| 88 | * >30% very low confidence: Evidence extraction issues | ||
| 89 | * >30% very high confidence: Too aggressive/overconfident | ||
| 90 | * Heavily skewed distribution: Systematic bias | ||
| 91 | **Why it matters**: Confidence should reflect actual uncertainty | ||
| 92 | === 4.2 Contradiction Rate === | ||
| 93 | **Metric**: % of claims with internal contradictions detected | ||
| 94 | **Target**: ≤ 5% | ||
| 95 | **Alert thresholds**: | ||
| 96 | * 5-10%: Monitor | ||
| 97 | * 10-15%: Investigate | ||
| 98 | * >15%: Intervention needed | ||
| 99 | **Why it matters**: High contradiction rate suggests poor evidence quality or logic errors | ||
| 100 | === 4.3 User Feedback Ratio === | ||
| 101 | **Metric**: Helpful vs unhelpful user ratings | ||
| 102 | **Target**: ≥ 70% helpful | ||
| 103 | **Alert thresholds**: | ||
| 104 | * 60-70%: Monitor | ||
| 105 | * 50-60%: Investigate | ||
| 106 | * <50%: Emergency | ||
| 107 | **Why it matters**: Direct measure of user satisfaction | ||
| 108 | === 4.4 False Positive/Negative Rate === | ||
| 109 | **Metric**: When humans review flagged items, how often was AKEL right? | ||
| 110 | **Measurement**: | ||
| 111 | * False positive: AKEL flagged for review, but actually fine | ||
| 112 | * False negative: Missed something that should've been flagged | ||
| 113 | **Target**: | ||
| 114 | * False positive rate: ≤ 20% | ||
| 115 | * False negative rate: ≤ 5% | ||
| 116 | **Why it matters**: Balance between catching problems and not crying wolf | ||
| 117 | == 5. System Health Metrics == | ||
| 118 | === 5.1 Uptime === | ||
| 119 | **Metric**: % of time system is available and functional | ||
| 120 | **Target**: ≥ 99.9% (less than 45 minutes downtime per month) | ||
| 121 | **Alert**: Immediate notification on any downtime | ||
| 122 | **Why it matters**: Users expect 24/7 availability | ||
| 123 | === 5.2 Error Rate === | ||
| 124 | **Metric**: Errors per 1000 requests | ||
| 125 | **Target**: ≤ 1 error per 1000 requests (0.1%) | ||
| 126 | **Alert thresholds**: | ||
| 127 | * 1-5 per 1000: Monitor | ||
| 128 | * 5-10 per 1000: Investigate | ||
| 129 | * >10 per 1000: Emergency | ||
| 130 | **Why it matters**: Errors disrupt user experience | ||
| 131 | === 5.3 Database Performance === | ||
| 132 | **Metrics**: | ||
| 133 | * Query response time (P95) | ||
| 134 | * Connection pool utilization | ||
| 135 | * Slow query frequency | ||
| 136 | **Targets**: | ||
| 137 | * P95 query time: ≤ 50ms | ||
| 138 | * Connection pool: ≤ 80% utilized | ||
| 139 | * Slow queries (>1s): ≤ 10 per hour | ||
| 140 | **Why it matters**: Database bottlenecks slow entire system | ||
| 141 | === 5.4 Cache Hit Rate === | ||
| 142 | **Metric**: % of requests served from cache vs. database | ||
| 143 | **Target**: ≥ 80% | ||
| 144 | **Why it matters**: Higher cache hit rate = faster responses, less DB load | ||
| 145 | === 5.5 Resource Utilization === | ||
| 146 | **Metrics**: | ||
| 147 | * CPU utilization | ||
| 148 | * Memory utilization | ||
| 149 | * Disk I/O | ||
| 150 | * Network bandwidth | ||
| 151 | **Targets**: | ||
| 152 | * Average CPU: ≤ 60% | ||
| 153 | * Peak CPU: ≤ 85% | ||
| 154 | * Memory: ≤ 80% | ||
| 155 | * Disk I/O: ≤ 70% | ||
| 156 | **Alert**: Any metric consistently >85% | ||
| 157 | **Why it matters**: Headroom for traffic spikes, prevents resource exhaustion | ||
| 158 | == 6. User Experience Metrics == | ||
| 159 | === 6.1 Time to First Verdict === | ||
| 160 | **Metric**: Time from user submitting claim to seeing initial verdict | ||
| 161 | **Target**: ≤ 15 seconds | ||
| 162 | **Why it matters**: User perception of speed | ||
| 163 | === 6.2 Claim Submission Rate === | ||
| 164 | **Metric**: Claims submitted per day/hour | ||
| 165 | **Monitoring**: Track trends, detect anomalies | ||
| 166 | **Why it matters**: Understand usage patterns, capacity planning | ||
| 167 | === 6.3 User Retention === | ||
| 168 | **Metric**: % of users who return after first visit | ||
| 169 | **Target**: ≥ 30% (1-week retention) | ||
| 170 | **Why it matters**: Indicates system usefulness | ||
| 171 | === 6.4 Feature Usage === | ||
| 172 | **Metrics**: | ||
| 173 | * % of users who explore evidence | ||
| 174 | * % who check scenarios | ||
| 175 | * % who view source track records | ||
| 176 | **Why it matters**: Understand how users interact with system | ||
| 177 | == 7. Metric Dashboard == | ||
| 178 | === 7.1 Real-Time Dashboard === | ||
| 179 | **Always visible**: | ||
| 180 | * Current processing time (P95) | ||
| 181 | * Success rate (last hour) | ||
| 182 | * Error rate (last hour) | ||
| 183 | * System health status | ||
| 184 | **Update frequency**: Every 30 seconds | ||
| 185 | === 7.2 Daily Dashboard === | ||
| 186 | **Reviewed daily**: | ||
| 187 | * All AKEL performance metrics | ||
| 188 | * Content quality metrics | ||
| 189 | * System health trends | ||
| 190 | * User feedback summary | ||
| 191 | === 7.3 Weekly Reports === | ||
| 192 | **Reviewed weekly**: | ||
| 193 | * Trends over time | ||
| 194 | * Week-over-week comparisons | ||
| 195 | * Improvement priorities | ||
| 196 | * Outstanding issues | ||
| 197 | === 7.4 Monthly/Quarterly Reports === | ||
| 198 | **Comprehensive analysis**: | ||
| 199 | * Long-term trends | ||
| 200 | * Seasonal patterns | ||
| 201 | * Strategic metrics | ||
| 202 | * Goal progress | ||
| 203 | == 8. Alert System == | ||
| 204 | === 8.1 Alert Levels === | ||
| 205 | **Info**: Metric outside target, but within acceptable range | ||
| 206 | * Action: Note in daily review | ||
| 207 | * Example: P95 processing time 19s (target 18s, acceptable <20s) | ||
| 208 | **Warning**: Metric outside acceptable range | ||
| 209 | * Action: Investigate within 24 hours | ||
| 210 | * Example: Success rate 97% (acceptable >98%) | ||
| 211 | **Critical**: Metric severely degraded | ||
| 212 | * Action: Investigate immediately | ||
| 213 | * Example: Error rate 2% (acceptable <0.5%) | ||
| 214 | **Emergency**: System failure or severe degradation | ||
| 215 | * Action: Page on-call, all hands | ||
| 216 | * Example: Uptime <95%, P95 >30s | ||
| 217 | === 8.2 Alert Channels === | ||
| 218 | **Slack/Discord**: All alerts | ||
| 219 | **Email**: Warning and above | ||
| 220 | **SMS**: Critical and emergency only | ||
| 221 | **PagerDuty**: Emergency only | ||
| 222 | === 8.3 On-Call Rotation === | ||
| 223 | **Technical Coordinator**: Primary on-call | ||
| 224 | **Backup**: Designated team member | ||
| 225 | **Responsibilities**: | ||
| 226 | * Respond to alerts within SLA | ||
| 227 | * Investigate and diagnose issues | ||
| 228 | * Implement fixes or escalate | ||
| 229 | * Document incidents | ||
| 230 | == 9. Metric-Driven Improvement == | ||
| 231 | === 9.1 Prioritization === | ||
| 232 | **Focus improvements on**: | ||
| 233 | * Metrics furthest from target | ||
| 234 | * Metrics with biggest user impact | ||
| 235 | * Metrics easiest to improve | ||
| 236 | * Strategic priorities | ||
| 237 | === 9.2 Success Criteria === | ||
| 238 | **Every improvement project should**: | ||
| 239 | * Target specific metrics | ||
| 240 | * Set concrete improvement goals | ||
| 241 | * Measure before and after | ||
| 242 | * Document learnings | ||
| 243 | **Example**: "Reduce P95 processing time from 20s to 16s by optimizing evidence extraction" | ||
| 244 | === 9.3 A/B Testing === | ||
| 245 | **When feasible**: | ||
| 246 | * Run two versions | ||
| 247 | * Measure metric differences | ||
| 248 | * Choose based on data | ||
| 249 | * Roll out winner | ||
| 250 | == 10. Bias and Fairness Metrics == | ||
| 251 | === 10.1 Domain Balance === | ||
| 252 | **Metric**: Confidence distribution by domain | ||
| 253 | **Target**: Similar distributions across domains | ||
| 254 | **Alert**: One domain consistently much lower/higher confidence | ||
| 255 | **Why it matters**: Ensure no systematic domain bias | ||
| 256 | === 10.2 Source Type Balance === | ||
| 257 | **Metric**: Evidence distribution by source type | ||
| 258 | **Target**: Diverse source types represented | ||
| 259 | **Alert**: Over-reliance on one source type | ||
| 260 | **Why it matters**: Prevent source type bias | ||
| 261 | === 10.3 Geographic Balance === | ||
| 262 | **Metric**: Source geographic distribution | ||
| 263 | **Target**: Multiple regions represented | ||
| 264 | **Alert**: Over-concentration in one region | ||
| 265 | **Why it matters**: Reduce geographic/cultural bias | ||
| 266 | == 11. Experimental Metrics == | ||
| 267 | **New metrics to test**: | ||
| 268 | * User engagement time | ||
| 269 | * Evidence exploration depth | ||
| 270 | * Cross-reference usage | ||
| 271 | * Mobile vs desktop usage | ||
| 272 | **Process**: | ||
| 273 | 1. Define metric hypothesis | ||
| 274 | 2. Implement tracking | ||
| 275 | 3. Collect data for 1 month | ||
| 276 | 4. Evaluate usefulness | ||
| 277 | 5. Add to standard set or discard | ||
| 278 | == 12. Anti-Patterns == | ||
| 279 | **Don't**: | ||
| 280 | * ❌ Measure too many things (focus on what matters) | ||
| 281 | * ❌ Set unrealistic targets (demotivating) | ||
| 282 | * ❌ Ignore metrics when inconvenient | ||
| 283 | * ❌ Game metrics (destroys their value) | ||
| 284 | * ❌ Blame individuals for metric failures | ||
| 285 | * ❌ Let metrics become the goal (they're tools) | ||
| 286 | **Do**: | ||
| 287 | * ✅ Focus on actionable metrics | ||
| 288 | * ✅ Set ambitious but achievable targets | ||
| 289 | * ✅ Respond to metric signals | ||
| 290 | * ✅ Continuously validate metrics still matter | ||
| 291 | * ✅ Use metrics for system improvement, not people evaluation | ||
| 292 | * ✅ Remember: metrics serve users, not the other way around | ||
| 293 | == 13. Related Pages == | ||
| 294 | * [[Automation Philosophy>>FactHarbor.Organisation.Automation-Philosophy]] - Why we monitor systems, not outputs | ||
| 295 | * [[Continuous Improvement>>FactHarbor.Organisation.How-We-Work-Together.Continuous-Improvement]] - How we use metrics to improve | ||
| 296 | * [[Governance>>FactHarbor.Organisation.Governance.WebHome]] - Quarterly performance reviews | ||
| 297 | --- | ||
| 298 | **Remember**: We measure the SYSTEM, not individual outputs. Metrics drive IMPROVEMENT, not judgment. |