System Performance Metrics
System Performance Metrics
What we monitor to ensure AKEL performs well.
1. Purpose
These metrics tell us:
- ✅ Is AKEL performing within acceptable ranges?
- ✅ Where should we focus improvement efforts?
- ✅ When do humans need to intervene?
- ✅ Are our changes improving things?
Principle: Measure to improve, not to judge.
2. Metric Categories
2.1 AKEL Performance
Processing speed and reliability
2.2 Content Quality
Output quality and user satisfaction
2.3 System Health
Infrastructure and operational metrics
2.4 User Experience
How users interact with the system
3. AKEL Performance Metrics
3.1 Processing Time
Metric: Time from claim submission to verdict publication
Measurements:
- P50 (median): 50% of claims processed within X seconds
- P95: 95% of claims processed within Y seconds
- P99: 99% of claims processed within Z seconds
Targets: - P50: ≤ 12 seconds
- P95: ≤ 18 seconds
- P99: ≤ 25 seconds
Alert thresholds: - P95 > 20 seconds: Monitor closely
- P95 > 25 seconds: Investigate immediately
- P95 > 30 seconds: Emergency - intervention required
Why it matters: Slow processing = poor user experience
Improvement ideas: - Optimize evidence extraction
- Better caching
- Parallel processing
- Database query optimization
3.2 Success Rate
Metric: % of claims successfully processed without errors
Target: ≥ 99%
Alert thresholds:
- 98-99%: Monitor
- 95-98%: Investigate
- <95%: Emergency
Common failure causes: - Timeout (evidence extraction took too long)
- Parse error (claim text unparsable)
- External API failure (source unavailable)
- Resource exhaustion (memory/CPU)
Why it matters: Errors frustrate users and reduce trust
3.3 Evidence Completeness
Metric: % of claims where AKEL found sufficient evidence
Measurement: Claims with ≥3 pieces of evidence from ≥2 distinct sources
Target: ≥ 80%
Alert thresholds:
- 75-80%: Monitor
- 70-75%: Investigate
- <70%: Intervention needed
Why it matters: Incomplete evidence = low confidence verdicts
Improvement ideas: - Better search algorithms
- More source integrations
- Improved relevance scoring
3.4 Source Diversity
Metric: Average number of distinct sources per claim
Target: ≥ 3.0 sources per claim
Alert thresholds:
- 2.5-3.0: Monitor
- 2.0-2.5: Investigate
- <2.0: Intervention needed
Why it matters: Multiple sources increase confidence and reduce bias
3.5 Scenario Coverage
Metric: % of claims with at least one scenario extracted
Target: ≥ 75%
Why it matters: Scenarios provide context for verdicts
4. Content Quality Metrics
4.1 Confidence Distribution
Metric: Distribution of confidence scores across claims
Target: Roughly normal distribution
- 10% very low confidence (0.0-0.3)
- 20% low confidence (0.3-0.5)
- 40% medium confidence (0.5-0.7)
- 20% high confidence (0.7-0.9)
- 10% very high confidence (0.9-1.0)
Alert thresholds: - >30% very low confidence: Evidence extraction issues
- >30% very high confidence: Too aggressive/overconfident
- Heavily skewed distribution: Systematic bias
Why it matters: Confidence should reflect actual uncertainty
4.2 Contradiction Rate
Metric: % of claims with internal contradictions detected
Target: ≤ 5%
Alert thresholds:
- 5-10%: Monitor
- 10-15%: Investigate
- >15%: Intervention needed
Why it matters: High contradiction rate suggests poor evidence quality or logic errors
4.3 User Feedback Ratio
Metric: Helpful vs unhelpful user ratings
Target: ≥ 70% helpful
Alert thresholds:
- 60-70%: Monitor
- 50-60%: Investigate
- <50%: Emergency
Why it matters: Direct measure of user satisfaction
4.4 False Positive/Negative Rate
Metric: When humans review flagged items, how often was AKEL right?
Measurement:
- False positive: AKEL flagged for review, but actually fine
- False negative: Missed something that should've been flagged
Target: - False positive rate: ≤ 20%
- False negative rate: ≤ 5%
Why it matters: Balance between catching problems and not crying wolf
5. System Health Metrics
5.1 Uptime
Metric: % of time system is available and functional
Target: ≥ 99.9% (less than 45 minutes downtime per month)
Alert: Immediate notification on any downtime
Why it matters: Users expect 24/7 availability
5.2 Error Rate
Metric: Errors per 1000 requests
Target: ≤ 1 error per 1000 requests (0.1%)
Alert thresholds:
- 1-5 per 1000: Monitor
- 5-10 per 1000: Investigate
- >10 per 1000: Emergency
Why it matters: Errors disrupt user experience
5.3 Database Performance
Metrics:
- Query response time (P95)
- Connection pool utilization
- Slow query frequency
Targets: - P95 query time: ≤ 50ms
- Connection pool: ≤ 80% utilized
- Slow queries (>1s): ≤ 10 per hour
Why it matters: Database bottlenecks slow entire system
5.4 Cache Hit Rate
Metric: % of requests served from cache vs. database
Target: ≥ 80%
Why it matters: Higher cache hit rate = faster responses, less DB load
5.5 Resource Utilization
Metrics:
- CPU utilization
- Memory utilization
- Disk I/O
- Network bandwidth
Targets: - Average CPU: ≤ 60%
- Peak CPU: ≤ 85%
- Memory: ≤ 80%
- Disk I/O: ≤ 70%
Alert: Any metric consistently >85%
Why it matters: Headroom for traffic spikes, prevents resource exhaustion
6. User Experience Metrics
6.1 Time to First Verdict
Metric: Time from user submitting claim to seeing initial verdict
Target: ≤ 15 seconds
Why it matters: User perception of speed
6.2 Claim Submission Rate
Metric: Claims submitted per day/hour
Monitoring: Track trends, detect anomalies
Why it matters: Understand usage patterns, capacity planning
6.3 User Retention
Metric: % of users who return after first visit
Target: ≥ 30% (1-week retention)
Why it matters: Indicates system usefulness
6.4 Feature Usage
Metrics:
- % of users who explore evidence
- % who check scenarios
- % who view source track records
Why it matters: Understand how users interact with system
7. Metric Dashboard
7.1 Real-Time Dashboard
Always visible:
- Current processing time (P95)
- Success rate (last hour)
- Error rate (last hour)
- System health status
Update frequency: Every 30 seconds
7.2 Daily Dashboard
Reviewed daily:
- All AKEL performance metrics
- Content quality metrics
- System health trends
- User feedback summary
7.3 Weekly Reports
Reviewed weekly:
- Trends over time
- Week-over-week comparisons
- Improvement priorities
- Outstanding issues
7.4 Monthly/Quarterly Reports
Comprehensive analysis:
- Long-term trends
- Seasonal patterns
- Strategic metrics
- Goal progress
8. Alert System
8.1 Alert Levels
Info: Metric outside target, but within acceptable range
- Action: Note in daily review
- Example: P95 processing time 19s (target 18s, acceptable <20s)
Warning: Metric outside acceptable range - Action: Investigate within 24 hours
- Example: Success rate 97% (acceptable >98%)
Critical: Metric severely degraded - Action: Investigate immediately
- Example: Error rate 2% (acceptable <0.5%)
Emergency: System failure or severe degradation - Action: Page on-call, all hands
- Example: Uptime <95%, P95 >30s
8.2 Alert Channels
Slack/Discord: All alerts
Email: Warning and above
SMS: Critical and emergency only
PagerDuty: Emergency only
8.3 On-Call Rotation
Technical Coordinator: Primary on-call
Backup: Designated team member
Responsibilities:
- Respond to alerts within SLA
- Investigate and diagnose issues
- Implement fixes or escalate
- Document incidents
9. Metric-Driven Improvement
9.1 Prioritization
Focus improvements on:
- Metrics furthest from target
- Metrics with biggest user impact
- Metrics easiest to improve
- Strategic priorities
9.2 Success Criteria
Every improvement project should:
- Target specific metrics
- Set concrete improvement goals
- Measure before and after
- Document learnings
Example: "Reduce P95 processing time from 20s to 16s by optimizing evidence extraction"
9.3 A/B Testing
When feasible:
- Run two versions
- Measure metric differences
- Choose based on data
- Roll out winner
10. Bias and Fairness Metrics
10.1 Domain Balance
Metric: Confidence distribution by domain
Target: Similar distributions across domains
Alert: One domain consistently much lower/higher confidence
Why it matters: Ensure no systematic domain bias
10.2 Source Type Balance
Metric: Evidence distribution by source type
Target: Diverse source types represented
Alert: Over-reliance on one source type
Why it matters: Prevent source type bias
10.3 Geographic Balance
Metric: Source geographic distribution
Target: Multiple regions represented
Alert: Over-concentration in one region
Why it matters: Reduce geographic/cultural bias
11. Experimental Metrics
New metrics to test:
- User engagement time
- Evidence exploration depth
- Cross-reference usage
- Mobile vs desktop usage
Process:
- Define metric hypothesis
2. Implement tracking
3. Collect data for 1 month
4. Evaluate usefulness
5. Add to standard set or discard
12. Anti-Patterns
Don't:
- ❌ Measure too many things (focus on what matters)
- ❌ Set unrealistic targets (demotivating)
- ❌ Ignore metrics when inconvenient
- ❌ Game metrics (destroys their value)
- ❌ Blame individuals for metric failures
- ❌ Let metrics become the goal (they're tools)
Do: - ✅ Focus on actionable metrics
- ✅ Set ambitious but achievable targets
- ✅ Respond to metric signals
- ✅ Continuously validate metrics still matter
- ✅ Use metrics for system improvement, not people evaluation
- ✅ Remember: metrics serve users, not the other way around
13. Related Pages
- Automation Philosophy - Why we monitor systems, not outputs
- Continuous Improvement - How we use metrics to improve
- Governance - Quarterly performance reviews
-
Remember: We measure the SYSTEM, not individual outputs. Metrics drive IMPROVEMENT, not judgment.