Wiki source code of System Performance Metrics

Last modified by Robert Schaub on 2025/12/18 12:03

Hide last authors
Robert Schaub 1.1 1 = System Performance Metrics =
2 **What we monitor to ensure AKEL performs well.**
3 == 1. Purpose ==
4 These metrics tell us:
5 * ✅ Is AKEL performing within acceptable ranges?
6 * ✅ Where should we focus improvement efforts?
7 * ✅ When do humans need to intervene?
8 * ✅ Are our changes improving things?
9 **Principle**: Measure to improve, not to judge.
10 == 2. Metric Categories ==
11 === 2.1 AKEL Performance ===
12 **Processing speed and reliability**
13 === 2.2 Content Quality ===
14 **Output quality and user satisfaction**
15 === 2.3 System Health ===
16 **Infrastructure and operational metrics**
17 === 2.4 User Experience ===
18 **How users interact with the system**
19 == 3. AKEL Performance Metrics ==
20 === 3.1 Processing Time ===
21 **Metric**: Time from claim submission to verdict publication
22 **Measurements**:
23 * P50 (median): 50% of claims processed within X seconds
24 * P95: 95% of claims processed within Y seconds
25 * P99: 99% of claims processed within Z seconds
26 **Targets**:
27 * P50: ≤ 12 seconds
28 * P95: ≤ 18 seconds
29 * P99: ≤ 25 seconds
30 **Alert thresholds**:
31 * P95 > 20 seconds: Monitor closely
32 * P95 > 25 seconds: Investigate immediately
33 * P95 > 30 seconds: Emergency - intervention required
34 **Why it matters**: Slow processing = poor user experience
35 **Improvement ideas**:
36 * Optimize evidence extraction
37 * Better caching
38 * Parallel processing
39 * Database query optimization
40 === 3.2 Success Rate ===
41 **Metric**: % of claims successfully processed without errors
42 **Target**: ≥ 99%
43 **Alert thresholds**:
44 * 98-99%: Monitor
45 * 95-98%: Investigate
46 * <95%: Emergency
47 **Common failure causes**:
48 * Timeout (evidence extraction took too long)
49 * Parse error (claim text unparsable)
50 * External API failure (source unavailable)
51 * Resource exhaustion (memory/CPU)
52 **Why it matters**: Errors frustrate users and reduce trust
53 === 3.3 Evidence Completeness ===
54 **Metric**: % of claims where AKEL found sufficient evidence
55 **Measurement**: Claims with ≥3 pieces of evidence from ≥2 distinct sources
56 **Target**: ≥ 80%
57 **Alert thresholds**:
58 * 75-80%: Monitor
59 * 70-75%: Investigate
60 * <70%: Intervention needed
61 **Why it matters**: Incomplete evidence = low confidence verdicts
62 **Improvement ideas**:
63 * Better search algorithms
64 * More source integrations
65 * Improved relevance scoring
66 === 3.4 Source Diversity ===
67 **Metric**: Average number of distinct sources per claim
68 **Target**: ≥ 3.0 sources per claim
69 **Alert thresholds**:
70 * 2.5-3.0: Monitor
71 * 2.0-2.5: Investigate
72 * <2.0: Intervention needed
73 **Why it matters**: Multiple sources increase confidence and reduce bias
74 === 3.5 Scenario Coverage ===
75 **Metric**: % of claims with at least one scenario extracted
76 **Target**: ≥ 75%
77 **Why it matters**: Scenarios provide context for verdicts
78 == 4. Content Quality Metrics ==
79 === 4.1 Confidence Distribution ===
80 **Metric**: Distribution of confidence scores across claims
81 **Target**: Roughly normal distribution
82 * ~10% very low confidence (0.0-0.3)
83 * ~20% low confidence (0.3-0.5)
84 * ~40% medium confidence (0.5-0.7)
85 * ~20% high confidence (0.7-0.9)
86 * ~10% very high confidence (0.9-1.0)
87 **Alert thresholds**:
88 * >30% very low confidence: Evidence extraction issues
89 * >30% very high confidence: Too aggressive/overconfident
90 * Heavily skewed distribution: Systematic bias
91 **Why it matters**: Confidence should reflect actual uncertainty
92 === 4.2 Contradiction Rate ===
93 **Metric**: % of claims with internal contradictions detected
94 **Target**: ≤ 5%
95 **Alert thresholds**:
96 * 5-10%: Monitor
97 * 10-15%: Investigate
98 * >15%: Intervention needed
99 **Why it matters**: High contradiction rate suggests poor evidence quality or logic errors
100 === 4.3 User Feedback Ratio ===
101 **Metric**: Helpful vs unhelpful user ratings
102 **Target**: ≥ 70% helpful
103 **Alert thresholds**:
104 * 60-70%: Monitor
105 * 50-60%: Investigate
106 * <50%: Emergency
107 **Why it matters**: Direct measure of user satisfaction
108 === 4.4 False Positive/Negative Rate ===
109 **Metric**: When humans review flagged items, how often was AKEL right?
110 **Measurement**:
111 * False positive: AKEL flagged for review, but actually fine
112 * False negative: Missed something that should've been flagged
113 **Target**:
114 * False positive rate: ≤ 20%
115 * False negative rate: ≤ 5%
116 **Why it matters**: Balance between catching problems and not crying wolf
117 == 5. System Health Metrics ==
118 === 5.1 Uptime ===
119 **Metric**: % of time system is available and functional
120 **Target**: ≥ 99.9% (less than 45 minutes downtime per month)
121 **Alert**: Immediate notification on any downtime
122 **Why it matters**: Users expect 24/7 availability
123 === 5.2 Error Rate ===
124 **Metric**: Errors per 1000 requests
125 **Target**: ≤ 1 error per 1000 requests (0.1%)
126 **Alert thresholds**:
127 * 1-5 per 1000: Monitor
128 * 5-10 per 1000: Investigate
129 * >10 per 1000: Emergency
130 **Why it matters**: Errors disrupt user experience
131 === 5.3 Database Performance ===
132 **Metrics**:
133 * Query response time (P95)
134 * Connection pool utilization
135 * Slow query frequency
136 **Targets**:
137 * P95 query time: ≤ 50ms
138 * Connection pool: ≤ 80% utilized
139 * Slow queries (>1s): ≤ 10 per hour
140 **Why it matters**: Database bottlenecks slow entire system
141 === 5.4 Cache Hit Rate ===
142 **Metric**: % of requests served from cache vs. database
143 **Target**: ≥ 80%
144 **Why it matters**: Higher cache hit rate = faster responses, less DB load
145 === 5.5 Resource Utilization ===
146 **Metrics**:
147 * CPU utilization
148 * Memory utilization
149 * Disk I/O
150 * Network bandwidth
151 **Targets**:
152 * Average CPU: ≤ 60%
153 * Peak CPU: ≤ 85%
154 * Memory: ≤ 80%
155 * Disk I/O: ≤ 70%
156 **Alert**: Any metric consistently >85%
157 **Why it matters**: Headroom for traffic spikes, prevents resource exhaustion
158 == 6. User Experience Metrics ==
159 === 6.1 Time to First Verdict ===
160 **Metric**: Time from user submitting claim to seeing initial verdict
161 **Target**: ≤ 15 seconds
162 **Why it matters**: User perception of speed
163 === 6.2 Claim Submission Rate ===
164 **Metric**: Claims submitted per day/hour
165 **Monitoring**: Track trends, detect anomalies
166 **Why it matters**: Understand usage patterns, capacity planning
167 === 6.3 User Retention ===
168 **Metric**: % of users who return after first visit
169 **Target**: ≥ 30% (1-week retention)
170 **Why it matters**: Indicates system usefulness
171 === 6.4 Feature Usage ===
172 **Metrics**:
173 * % of users who explore evidence
174 * % who check scenarios
175 * % who view source track records
176 **Why it matters**: Understand how users interact with system
177 == 7. Metric Dashboard ==
178 === 7.1 Real-Time Dashboard ===
179 **Always visible**:
180 * Current processing time (P95)
181 * Success rate (last hour)
182 * Error rate (last hour)
183 * System health status
184 **Update frequency**: Every 30 seconds
185 === 7.2 Daily Dashboard ===
186 **Reviewed daily**:
187 * All AKEL performance metrics
188 * Content quality metrics
189 * System health trends
190 * User feedback summary
191 === 7.3 Weekly Reports ===
192 **Reviewed weekly**:
193 * Trends over time
194 * Week-over-week comparisons
195 * Improvement priorities
196 * Outstanding issues
197 === 7.4 Monthly/Quarterly Reports ===
198 **Comprehensive analysis**:
199 * Long-term trends
200 * Seasonal patterns
201 * Strategic metrics
202 * Goal progress
203 == 8. Alert System ==
204 === 8.1 Alert Levels ===
205 **Info**: Metric outside target, but within acceptable range
206 * Action: Note in daily review
207 * Example: P95 processing time 19s (target 18s, acceptable <20s)
208 **Warning**: Metric outside acceptable range
209 * Action: Investigate within 24 hours
210 * Example: Success rate 97% (acceptable >98%)
211 **Critical**: Metric severely degraded
212 * Action: Investigate immediately
213 * Example: Error rate 2% (acceptable <0.5%)
214 **Emergency**: System failure or severe degradation
215 * Action: Page on-call, all hands
216 * Example: Uptime <95%, P95 >30s
217 === 8.2 Alert Channels ===
218 **Slack/Discord**: All alerts
219 **Email**: Warning and above
220 **SMS**: Critical and emergency only
221 **PagerDuty**: Emergency only
222 === 8.3 On-Call Rotation ===
223 **Technical Coordinator**: Primary on-call
224 **Backup**: Designated team member
225 **Responsibilities**:
226 * Respond to alerts within SLA
227 * Investigate and diagnose issues
228 * Implement fixes or escalate
229 * Document incidents
230 == 9. Metric-Driven Improvement ==
231 === 9.1 Prioritization ===
232 **Focus improvements on**:
233 * Metrics furthest from target
234 * Metrics with biggest user impact
235 * Metrics easiest to improve
236 * Strategic priorities
237 === 9.2 Success Criteria ===
238 **Every improvement project should**:
239 * Target specific metrics
240 * Set concrete improvement goals
241 * Measure before and after
242 * Document learnings
243 **Example**: "Reduce P95 processing time from 20s to 16s by optimizing evidence extraction"
244 === 9.3 A/B Testing ===
245 **When feasible**:
246 * Run two versions
247 * Measure metric differences
248 * Choose based on data
249 * Roll out winner
250 == 10. Bias and Fairness Metrics ==
251 === 10.1 Domain Balance ===
252 **Metric**: Confidence distribution by domain
253 **Target**: Similar distributions across domains
254 **Alert**: One domain consistently much lower/higher confidence
255 **Why it matters**: Ensure no systematic domain bias
256 === 10.2 Source Type Balance ===
257 **Metric**: Evidence distribution by source type
258 **Target**: Diverse source types represented
259 **Alert**: Over-reliance on one source type
260 **Why it matters**: Prevent source type bias
261 === 10.3 Geographic Balance ===
262 **Metric**: Source geographic distribution
263 **Target**: Multiple regions represented
264 **Alert**: Over-concentration in one region
265 **Why it matters**: Reduce geographic/cultural bias
266 == 11. Experimental Metrics ==
267 **New metrics to test**:
268 * User engagement time
269 * Evidence exploration depth
270 * Cross-reference usage
271 * Mobile vs desktop usage
272 **Process**:
273 1. Define metric hypothesis
274 2. Implement tracking
275 3. Collect data for 1 month
276 4. Evaluate usefulness
277 5. Add to standard set or discard
278 == 12. Anti-Patterns ==
279 **Don't**:
280 * ❌ Measure too many things (focus on what matters)
281 * ❌ Set unrealistic targets (demotivating)
282 * ❌ Ignore metrics when inconvenient
283 * ❌ Game metrics (destroys their value)
284 * ❌ Blame individuals for metric failures
285 * ❌ Let metrics become the goal (they're tools)
286 **Do**:
287 * ✅ Focus on actionable metrics
288 * ✅ Set ambitious but achievable targets
289 * ✅ Respond to metric signals
290 * ✅ Continuously validate metrics still matter
291 * ✅ Use metrics for system improvement, not people evaluation
292 * ✅ Remember: metrics serve users, not the other way around
293 == 13. Related Pages ==
294 * [[Automation Philosophy>>FactHarbor.Organisation.Automation-Philosophy]] - Why we monitor systems, not outputs
295 * [[Continuous Improvement>>FactHarbor.Organisation.How-We-Work-Together.Continuous-Improvement]] - How we use metrics to improve
296 * [[Governance>>FactHarbor.Organisation.Governance.WebHome]] - Quarterly performance reviews
297 ---
298 **Remember**: We measure the SYSTEM, not individual outputs. Metrics drive IMPROVEMENT, not judgment.