Changes for page System Performance Metrics
Last modified by Robert Schaub on 2026/02/08 08:32
From version 1.3
edited by Robert Schaub
on 2026/02/08 08:32
on 2026/02/08 08:32
Change comment:
Update document after refactoring.
Summary
-
Page properties (2 modified, 0 added, 0 removed)
Details
- Page properties
-
- Parent
-
... ... @@ -1,1 +1,1 @@ 1 -WebHome 1 +FactHarbor.Specification.WebHome - Content
-
... ... @@ -1,42 +1,25 @@ 1 1 = System Performance Metrics = 2 - 3 3 **What we monitor to ensure AKEL performs well.** 4 - 5 5 == 1. Purpose == 6 - 7 7 These metrics tell us: 8 - 9 9 * ✅ Is AKEL performing within acceptable ranges? 10 10 * ✅ Where should we focus improvement efforts? 11 11 * ✅ When do humans need to intervene? 12 12 * ✅ Are our changes improving things? 13 13 **Principle**: Measure to improve, not to judge. 14 - 15 15 == 2. Metric Categories == 16 - 17 17 === 2.1 AKEL Performance === 18 - 19 19 **Processing speed and reliability** 20 - 21 21 === 2.2 Content Quality === 22 - 23 23 **Output quality and user satisfaction** 24 - 25 25 === 2.3 System Health === 26 - 27 27 **Infrastructure and operational metrics** 28 - 29 29 === 2.4 User Experience === 30 - 31 31 **How users interact with the system** 32 - 33 33 == 3. AKEL Performance Metrics == 34 - 35 35 === 3.1 Processing Time === 36 - 37 37 **Metric**: Time from claim submission to verdict publication 38 38 **Measurements**: 39 - 40 40 * P50 (median): 50% of claims processed within X seconds 41 41 * P95: 95% of claims processed within Y seconds 42 42 * P99: 99% of claims processed within Z seconds ... ... @@ -54,13 +54,10 @@ 54 54 * Better caching 55 55 * Parallel processing 56 56 * Database query optimization 57 - 58 58 === 3.2 Success Rate === 59 - 60 60 **Metric**: % of claims successfully processed without errors 61 61 **Target**: ≥ 99% 62 62 **Alert thresholds**: 63 - 64 64 * 98-99%: Monitor 65 65 * 95-98%: Investigate 66 66 * <95%: Emergency ... ... @@ -70,14 +70,11 @@ 70 70 * External API failure (source unavailable) 71 71 * Resource exhaustion (memory/CPU) 72 72 **Why it matters**: Errors frustrate users and reduce trust 73 - 74 74 === 3.3 Evidence Completeness === 75 - 76 76 **Metric**: % of claims where AKEL found sufficient evidence 77 77 **Measurement**: Claims with ≥3 pieces of evidence from ≥2 distinct sources 78 78 **Target**: ≥ 80% 79 79 **Alert thresholds**: 80 - 81 81 * 75-80%: Monitor 82 82 * 70-75%: Investigate 83 83 * <70%: Intervention needed ... ... @@ -86,69 +86,51 @@ 86 86 * Better search algorithms 87 87 * More source integrations 88 88 * Improved relevance scoring 89 - 90 90 === 3.4 Source Diversity === 91 - 92 92 **Metric**: Average number of distinct sources per claim 93 93 **Target**: ≥ 3.0 sources per claim 94 94 **Alert thresholds**: 95 - 96 96 * 2.5-3.0: Monitor 97 97 * 2.0-2.5: Investigate 98 98 * <2.0: Intervention needed 99 99 **Why it matters**: Multiple sources increase confidence and reduce bias 100 - 101 101 === 3.5 Scenario Coverage === 102 - 103 103 **Metric**: % of claims with at least one scenario extracted 104 104 **Target**: ≥ 75% 105 105 **Why it matters**: Scenarios provide context for verdicts 106 - 107 107 == 4. Content Quality Metrics == 108 - 109 109 === 4.1 Confidence Distribution === 110 - 111 111 **Metric**: Distribution of confidence scores across claims 112 112 **Target**: Roughly normal distribution 113 - 114 -* 10% very low confidence (0.0-0.3) 115 -* 20% low confidence (0.3-0.5) 116 -* 40% medium confidence (0.5-0.7) 117 -* 20% high confidence (0.7-0.9) 118 -* 10% very high confidence (0.9-1.0) 82 +* ~10% very low confidence (0.0-0.3) 83 +* ~20% low confidence (0.3-0.5) 84 +* ~40% medium confidence (0.5-0.7) 85 +* ~20% high confidence (0.7-0.9) 86 +* ~10% very high confidence (0.9-1.0) 119 119 **Alert thresholds**: 120 120 * >30% very low confidence: Evidence extraction issues 121 121 * >30% very high confidence: Too aggressive/overconfident 122 122 * Heavily skewed distribution: Systematic bias 123 123 **Why it matters**: Confidence should reflect actual uncertainty 124 - 125 125 === 4.2 Contradiction Rate === 126 - 127 127 **Metric**: % of claims with internal contradictions detected 128 128 **Target**: ≤ 5% 129 129 **Alert thresholds**: 130 - 131 131 * 5-10%: Monitor 132 132 * 10-15%: Investigate 133 133 * >15%: Intervention needed 134 134 **Why it matters**: High contradiction rate suggests poor evidence quality or logic errors 135 - 136 136 === 4.3 User Feedback Ratio === 137 - 138 138 **Metric**: Helpful vs unhelpful user ratings 139 139 **Target**: ≥ 70% helpful 140 140 **Alert thresholds**: 141 - 142 142 * 60-70%: Monitor 143 143 * 50-60%: Investigate 144 144 * <50%: Emergency 145 145 **Why it matters**: Direct measure of user satisfaction 146 - 147 147 === 4.4 False Positive/Negative Rate === 148 - 149 149 **Metric**: When humans review flagged items, how often was AKEL right? 150 150 **Measurement**: 151 - 152 152 * False positive: AKEL flagged for review, but actually fine 153 153 * False negative: Missed something that should've been flagged 154 154 **Target**: ... ... @@ -155,31 +155,22 @@ 155 155 * False positive rate: ≤ 20% 156 156 * False negative rate: ≤ 5% 157 157 **Why it matters**: Balance between catching problems and not crying wolf 158 - 159 159 == 5. System Health Metrics == 160 - 161 161 === 5.1 Uptime === 162 - 163 163 **Metric**: % of time system is available and functional 164 164 **Target**: ≥ 99.9% (less than 45 minutes downtime per month) 165 165 **Alert**: Immediate notification on any downtime 166 166 **Why it matters**: Users expect 24/7 availability 167 - 168 168 === 5.2 Error Rate === 169 - 170 170 **Metric**: Errors per 1000 requests 171 171 **Target**: ≤ 1 error per 1000 requests (0.1%) 172 172 **Alert thresholds**: 173 - 174 174 * 1-5 per 1000: Monitor 175 175 * 5-10 per 1000: Investigate 176 176 * >10 per 1000: Emergency 177 177 **Why it matters**: Errors disrupt user experience 178 - 179 179 === 5.3 Database Performance === 180 - 181 181 **Metrics**: 182 - 183 183 * Query response time (P95) 184 184 * Connection pool utilization 185 185 * Slow query frequency ... ... @@ -188,17 +188,12 @@ 188 188 * Connection pool: ≤ 80% utilized 189 189 * Slow queries (>1s): ≤ 10 per hour 190 190 **Why it matters**: Database bottlenecks slow entire system 191 - 192 192 === 5.4 Cache Hit Rate === 193 - 194 194 **Metric**: % of requests served from cache vs. database 195 195 **Target**: ≥ 80% 196 196 **Why it matters**: Higher cache hit rate = faster responses, less DB load 197 - 198 198 === 5.5 Resource Utilization === 199 - 200 200 **Metrics**: 201 - 202 202 * CPU utilization 203 203 * Memory utilization 204 204 * Disk I/O ... ... @@ -210,81 +210,54 @@ 210 210 * Disk I/O: ≤ 70% 211 211 **Alert**: Any metric consistently >85% 212 212 **Why it matters**: Headroom for traffic spikes, prevents resource exhaustion 213 - 214 214 == 6. User Experience Metrics == 215 - 216 216 === 6.1 Time to First Verdict === 217 - 218 218 **Metric**: Time from user submitting claim to seeing initial verdict 219 219 **Target**: ≤ 15 seconds 220 220 **Why it matters**: User perception of speed 221 - 222 222 === 6.2 Claim Submission Rate === 223 - 224 224 **Metric**: Claims submitted per day/hour 225 225 **Monitoring**: Track trends, detect anomalies 226 226 **Why it matters**: Understand usage patterns, capacity planning 227 - 228 228 === 6.3 User Retention === 229 - 230 230 **Metric**: % of users who return after first visit 231 231 **Target**: ≥ 30% (1-week retention) 232 232 **Why it matters**: Indicates system usefulness 233 - 234 234 === 6.4 Feature Usage === 235 - 236 236 **Metrics**: 237 - 238 238 * % of users who explore evidence 239 239 * % who check scenarios 240 240 * % who view source track records 241 241 **Why it matters**: Understand how users interact with system 242 - 243 243 == 7. Metric Dashboard == 244 - 245 245 === 7.1 Real-Time Dashboard === 246 - 247 247 **Always visible**: 248 - 249 249 * Current processing time (P95) 250 250 * Success rate (last hour) 251 251 * Error rate (last hour) 252 252 * System health status 253 253 **Update frequency**: Every 30 seconds 254 - 255 255 === 7.2 Daily Dashboard === 256 - 257 257 **Reviewed daily**: 258 - 259 259 * All AKEL performance metrics 260 260 * Content quality metrics 261 261 * System health trends 262 262 * User feedback summary 263 - 264 264 === 7.3 Weekly Reports === 265 - 266 266 **Reviewed weekly**: 267 - 268 268 * Trends over time 269 269 * Week-over-week comparisons 270 270 * Improvement priorities 271 271 * Outstanding issues 272 - 273 273 === 7.4 Monthly/Quarterly Reports === 274 - 275 275 **Comprehensive analysis**: 276 - 277 277 * Long-term trends 278 278 * Seasonal patterns 279 279 * Strategic metrics 280 280 * Goal progress 281 - 282 282 == 8. Alert System == 283 - 284 284 === 8.1 Alert Levels === 285 - 286 286 **Info**: Metric outside target, but within acceptable range 287 - 288 288 * Action: Note in daily review 289 289 * Example: P95 processing time 19s (target 18s, acceptable <20s) 290 290 **Warning**: Metric outside acceptable range ... ... @@ -296,98 +296,69 @@ 296 296 **Emergency**: System failure or severe degradation 297 297 * Action: Page on-call, all hands 298 298 * Example: Uptime <95%, P95 >30s 299 - 300 300 === 8.2 Alert Channels === 301 - 302 302 **Slack/Discord**: All alerts 303 303 **Email**: Warning and above 304 304 **SMS**: Critical and emergency only 305 305 **PagerDuty**: Emergency only 306 - 307 307 === 8.3 On-Call Rotation === 308 - 309 309 **Technical Coordinator**: Primary on-call 310 310 **Backup**: Designated team member 311 311 **Responsibilities**: 312 - 313 313 * Respond to alerts within SLA 314 314 * Investigate and diagnose issues 315 315 * Implement fixes or escalate 316 316 * Document incidents 317 - 318 318 == 9. Metric-Driven Improvement == 319 - 320 320 === 9.1 Prioritization === 321 - 322 322 **Focus improvements on**: 323 - 324 324 * Metrics furthest from target 325 325 * Metrics with biggest user impact 326 326 * Metrics easiest to improve 327 327 * Strategic priorities 328 - 329 329 === 9.2 Success Criteria === 330 - 331 331 **Every improvement project should**: 332 - 333 333 * Target specific metrics 334 334 * Set concrete improvement goals 335 335 * Measure before and after 336 336 * Document learnings 337 337 **Example**: "Reduce P95 processing time from 20s to 16s by optimizing evidence extraction" 338 - 339 339 === 9.3 A/B Testing === 340 - 341 341 **When feasible**: 342 - 343 343 * Run two versions 344 344 * Measure metric differences 345 345 * Choose based on data 346 346 * Roll out winner 347 - 348 348 == 10. Bias and Fairness Metrics == 349 - 350 350 === 10.1 Domain Balance === 351 - 352 352 **Metric**: Confidence distribution by domain 353 353 **Target**: Similar distributions across domains 354 354 **Alert**: One domain consistently much lower/higher confidence 355 355 **Why it matters**: Ensure no systematic domain bias 356 - 357 357 === 10.2 Source Type Balance === 358 - 359 359 **Metric**: Evidence distribution by source type 360 360 **Target**: Diverse source types represented 361 361 **Alert**: Over-reliance on one source type 362 362 **Why it matters**: Prevent source type bias 363 - 364 364 === 10.3 Geographic Balance === 365 - 366 366 **Metric**: Source geographic distribution 367 367 **Target**: Multiple regions represented 368 368 **Alert**: Over-concentration in one region 369 369 **Why it matters**: Reduce geographic/cultural bias 370 - 371 371 == 11. Experimental Metrics == 372 - 373 373 **New metrics to test**: 374 - 375 375 * User engagement time 376 376 * Evidence exploration depth 377 377 * Cross-reference usage 378 378 * Mobile vs desktop usage 379 379 **Process**: 380 - 381 381 1. Define metric hypothesis 382 382 2. Implement tracking 383 383 3. Collect data for 1 month 384 384 4. Evaluate usefulness 385 385 5. Add to standard set or discard 386 - 387 387 == 12. Anti-Patterns == 388 - 389 389 **Don't**: 390 - 391 391 * ❌ Measure too many things (focus on what matters) 392 392 * ❌ Set unrealistic targets (demotivating) 393 393 * ❌ Ignore metrics when inconvenient ... ... @@ -401,11 +401,9 @@ 401 401 * ✅ Continuously validate metrics still matter 402 402 * ✅ Use metrics for system improvement, not people evaluation 403 403 * ✅ Remember: metrics serve users, not the other way around 404 - 405 405 == 13. Related Pages == 406 - 407 407 * [[Automation Philosophy>>FactHarbor.Organisation.Automation-Philosophy]] - Why we monitor systems, not outputs 408 408 * [[Continuous Improvement>>FactHarbor.Organisation.How-We-Work-Together.Continuous-Improvement]] - How we use metrics to improve 409 -* [[Governance>> Archive.FactHarbor2026\.02\.08.Organisation.Governance.WebHome]] - Quarterly performance reviews296 +* [[Governance>>FactHarbor.Organisation.Governance.WebHome]] - Quarterly performance reviews 410 410 --- 411 -**Remember**: We measure the SYSTEM, not individual outputs. Metrics drive IMPROVEMENT, not judgment. --298 +**Remember**: We measure the SYSTEM, not individual outputs. Metrics drive IMPROVEMENT, not judgment.