System Performance Metrics

1

= System Performance Metrics =

2

3

**What we monitor to ensure AKEL performs well.**

== 1. Purpose ==

These metrics tell us:

8

9

* ✅ Is AKEL performing within acceptable ranges?

10

* ✅ Where should we focus improvement efforts?

11

* ✅ When do humans need to intervene?

12

* ✅ Are our changes improving things?

13

**Principle**: Measure to improve, not to judge.

14

15

== 2. Metric Categories ==

16

17

=== 2.1 AKEL Performance ===

18

19

**Processing speed and reliability**

20

21

=== 2.2 Content Quality ===

22

23

**Output quality and user satisfaction**

24

25

=== 2.3 System Health ===

26

27

**Infrastructure and operational metrics**

28

29

=== 2.4 User Experience ===

30

31

**How users interact with the system**

32

33

== 3. AKEL Performance Metrics ==

34

35

=== 3.1 Processing Time ===

36

37

**Metric**: Time from claim submission to verdict publication

38

**Measurements**:

39

40

* P50 (median): 50% of claims processed within X seconds

41

* P95: 95% of claims processed within Y seconds

42

* P99: 99% of claims processed within Z seconds

**Targets**:

* P50: ≤ 12 seconds

* P95: ≤ 18 seconds

* P99: ≤ 25 seconds

**Alert thresholds**:

48

* P95 > 20 seconds: Monitor closely

49

* P95 > 25 seconds: Investigate immediately

50

* P95 > 30 seconds: Emergency - intervention required

51

**Why it matters**: Slow processing = poor user experience

52

**Improvement ideas**:

53

* Optimize evidence extraction

54

* Better caching

55

* Parallel processing

56

* Database query optimization

57

58

=== 3.2 Success Rate ===

59

60

**Metric**: % of claims successfully processed without errors

61

**Target**: ≥ 99%

62

**Alert thresholds**:

63

64

* 98-99%: Monitor

65

* 95-98%: Investigate

66

* <95%: Emergency

67

**Common failure causes**:

68

* Timeout (evidence extraction took too long)

69

* Parse error (claim text unparsable)

70

* External API failure (source unavailable)

71

* Resource exhaustion (memory/CPU)

72

**Why it matters**: Errors frustrate users and reduce trust

73

74

=== 3.3 Evidence Completeness ===

75

76

**Metric**: % of claims where AKEL found sufficient evidence

77

**Measurement**: Claims with ≥3 pieces of evidence from ≥2 distinct sources

78

**Target**: ≥ 80%

79

**Alert thresholds**:

80

81

* 75-80%: Monitor

82

* 70-75%: Investigate

83

* <70%: Intervention needed

84

**Why it matters**: Incomplete evidence = low confidence verdicts

85

**Improvement ideas**:

86

* Better search algorithms

87

* More source integrations

88

* Improved relevance scoring

89

90

=== 3.4 Source Diversity ===

91

92

**Metric**: Average number of distinct sources per claim

93

**Target**: ≥ 3.0 sources per claim

94

**Alert thresholds**:

95

96

* 2.5-3.0: Monitor

97

* 2.0-2.5: Investigate

98

* <2.0: Intervention needed

99

**Why it matters**: Multiple sources increase confidence and reduce bias

100

101

=== 3.5 Scenario Coverage ===

102

103

**Metric**: % of claims with at least one scenario extracted

104

**Target**: ≥ 75%

105

**Why it matters**: Scenarios provide context for verdicts

106

107

== 4. Content Quality Metrics ==

108

109

=== 4.1 Confidence Distribution ===

110

111

**Metric**: Distribution of confidence scores across claims

112

**Target**: Roughly normal distribution

113

114

* 10% very low confidence (0.0-0.3)

115

* 20% low confidence (0.3-0.5)

116

* 40% medium confidence (0.5-0.7)

117

* 20% high confidence (0.7-0.9)

118

* 10% very high confidence (0.9-1.0)

119

**Alert thresholds**:

120

* >30% very low confidence: Evidence extraction issues

121

* >30% very high confidence: Too aggressive/overconfident

122

* Heavily skewed distribution: Systematic bias

123

**Why it matters**: Confidence should reflect actual uncertainty

124

125

=== 4.2 Contradiction Rate ===

126

127

**Metric**: % of claims with internal contradictions detected

128

**Target**: ≤ 5%

129

**Alert thresholds**:

130

131

* 5-10%: Monitor

132

* 10-15%: Investigate

133

* >15%: Intervention needed

134

**Why it matters**: High contradiction rate suggests poor evidence quality or logic errors

135

136

=== 4.3 User Feedback Ratio ===

137

138

**Metric**: Helpful vs unhelpful user ratings

139

**Target**: ≥ 70% helpful

140

**Alert thresholds**:

141

142

* 60-70%: Monitor

143

* 50-60%: Investigate

144

* <50%: Emergency

145

**Why it matters**: Direct measure of user satisfaction

146

147

=== 4.4 False Positive/Negative Rate ===

148

149

**Metric**: When humans review flagged items, how often was AKEL right?

150

**Measurement**:

151

152

* False positive: AKEL flagged for review, but actually fine

153

* False negative: Missed something that should've been flagged

154

**Target**:

155

* False positive rate: ≤ 20%

156

* False negative rate: ≤ 5%

157

**Why it matters**: Balance between catching problems and not crying wolf

158

159

== 5. System Health Metrics ==

=== 5.1 Uptime ===

**Metric**: % of time system is available and functional

164

**Target**: ≥ 99.9% (less than 45 minutes downtime per month)

165

**Alert**: Immediate notification on any downtime

166

**Why it matters**: Users expect 24/7 availability

167

168

=== 5.2 Error Rate ===

169

170

**Metric**: Errors per 1000 requests

171

**Target**: ≤ 1 error per 1000 requests (0.1%)

172

**Alert thresholds**:

173

174

* 1-5 per 1000: Monitor

175

* 5-10 per 1000: Investigate

176

* >10 per 1000: Emergency

177

**Why it matters**: Errors disrupt user experience

178

179

=== 5.3 Database Performance ===

**Metrics**:

* Query response time (P95)

184

* Connection pool utilization

185

* Slow query frequency

186

**Targets**:

187

* P95 query time: ≤ 50ms

188

* Connection pool: ≤ 80% utilized

189

* Slow queries (>1s): ≤ 10 per hour

190

**Why it matters**: Database bottlenecks slow entire system

191

192

=== 5.4 Cache Hit Rate ===

193

194

**Metric**: % of requests served from cache vs. database

195

**Target**: ≥ 80%

196

**Why it matters**: Higher cache hit rate = faster responses, less DB load

197

198

=== 5.5 Resource Utilization ===

**Metrics**:

* CPU utilization

* Memory utilization

* Disk I/O

* Network bandwidth

**Targets**:

* Average CPU: ≤ 60%

* Peak CPU: ≤ 85%

* Memory: ≤ 80%

* Disk I/O: ≤ 70%

**Alert**: Any metric consistently >85%

212

**Why it matters**: Headroom for traffic spikes, prevents resource exhaustion

213

214

== 6. User Experience Metrics ==

215

216

=== 6.1 Time to First Verdict ===

217

218

**Metric**: Time from user submitting claim to seeing initial verdict

219

**Target**: ≤ 15 seconds

220

**Why it matters**: User perception of speed

221

222

=== 6.2 Claim Submission Rate ===

223

224

**Metric**: Claims submitted per day/hour

225

**Monitoring**: Track trends, detect anomalies

226

**Why it matters**: Understand usage patterns, capacity planning

227

228

=== 6.3 User Retention ===

229

230

**Metric**: % of users who return after first visit

231

**Target**: ≥ 30% (1-week retention)

232

**Why it matters**: Indicates system usefulness

233

234

=== 6.4 Feature Usage ===

**Metrics**:

* % of users who explore evidence

239

* % who check scenarios

240

* % who view source track records

241

**Why it matters**: Understand how users interact with system

242

243

== 7. Metric Dashboard ==

244

245

=== 7.1 Real-Time Dashboard ===

**Always visible**:

* Current processing time (P95)

250

* Success rate (last hour)

251

* Error rate (last hour)

252

* System health status

253

**Update frequency**: Every 30 seconds

254

255

=== 7.2 Daily Dashboard ===

**Reviewed daily**:

* All AKEL performance metrics

260

* Content quality metrics

261

* System health trends

262

* User feedback summary

263

264

=== 7.3 Weekly Reports ===

**Reviewed weekly**:

* Trends over time

* Week-over-week comparisons

270

* Improvement priorities

271

* Outstanding issues

272

273

=== 7.4 Monthly/Quarterly Reports ===

274

275

**Comprehensive analysis**:

* Long-term trends

* Seasonal patterns

* Strategic metrics

* Goal progress

== 8. Alert System ==

283

284

=== 8.1 Alert Levels ===

285

286

**Info**: Metric outside target, but within acceptable range

287

288

* Action: Note in daily review

289

* Example: P95 processing time 19s (target 18s, acceptable <20s)

290

**Warning**: Metric outside acceptable range

291

* Action: Investigate within 24 hours

292

* Example: Success rate 97% (acceptable >98%)

293

**Critical**: Metric severely degraded

294

* Action: Investigate immediately

295

* Example: Error rate 2% (acceptable <0.5%)

296

**Emergency**: System failure or severe degradation

297

* Action: Page on-call, all hands

298

* Example: Uptime <95%, P95 >30s

299

300

=== 8.2 Alert Channels ===

301

302

**Slack/Discord**: All alerts

303

**Email**: Warning and above

304

**SMS**: Critical and emergency only

305

**PagerDuty**: Emergency only

306

307

=== 8.3 On-Call Rotation ===

308

309

**Technical Coordinator**: Primary on-call

310

**Backup**: Designated team member

311

**Responsibilities**:

312

313

* Respond to alerts within SLA

314

* Investigate and diagnose issues

315

* Implement fixes or escalate

316

* Document incidents

317

318

== 9. Metric-Driven Improvement ==

319

320

=== 9.1 Prioritization ===

321

322

**Focus improvements on**:

323

324

* Metrics furthest from target

325

* Metrics with biggest user impact

326

* Metrics easiest to improve

327

* Strategic priorities

328

329

=== 9.2 Success Criteria ===

330

331

**Every improvement project should**:

332

333

* Target specific metrics

334

* Set concrete improvement goals

335

* Measure before and after

336

* Document learnings

337

**Example**: "Reduce P95 processing time from 20s to 16s by optimizing evidence extraction"

338

339

=== 9.3 A/B Testing ===

**When feasible**:

* Run two versions

* Measure metric differences

345

* Choose based on data

346

* Roll out winner

347

348

== 10. Bias and Fairness Metrics ==

349

350

=== 10.1 Domain Balance ===

351

352

**Metric**: Confidence distribution by domain

353

**Target**: Similar distributions across domains

354

**Alert**: One domain consistently much lower/higher confidence

355

**Why it matters**: Ensure no systematic domain bias

356

357

=== 10.2 Source Type Balance ===

358

359

**Metric**: Evidence distribution by source type

360

**Target**: Diverse source types represented

361

**Alert**: Over-reliance on one source type

362

**Why it matters**: Prevent source type bias

363

364

=== 10.3 Geographic Balance ===

365

366

**Metric**: Source geographic distribution

367

**Target**: Multiple regions represented

368

**Alert**: Over-concentration in one region

369

**Why it matters**: Reduce geographic/cultural bias

370

371

== 11. Experimental Metrics ==

372

373

**New metrics to test**:

374

375

* User engagement time

376

* Evidence exploration depth

377

* Cross-reference usage

378

* Mobile vs desktop usage

379

**Process**:

380

381

1. Define metric hypothesis

382

2. Implement tracking

383

3. Collect data for 1 month

384

4. Evaluate usefulness

385

5. Add to standard set or discard

386

387

== 12. Anti-Patterns ==

**Don't**:

* ❌ Measure too many things (focus on what matters)

392

* ❌ Set unrealistic targets (demotivating)

393

* ❌ Ignore metrics when inconvenient

394

* ❌ Game metrics (destroys their value)

395

* ❌ Blame individuals for metric failures

396

* ❌ Let metrics become the goal (they're tools)

397

**Do**:

398

* ✅ Focus on actionable metrics

399

* ✅ Set ambitious but achievable targets

400

* ✅ Respond to metric signals

401

* ✅ Continuously validate metrics still matter

402

* ✅ Use metrics for system improvement, not people evaluation

403

* ✅ Remember: metrics serve users, not the other way around

404

405

== 13. Related Pages ==

406

407

* [[Automation Philosophy>>FactHarbor.Organisation.Automation-Philosophy]] - Why we monitor systems, not outputs

408

* [[Continuous Improvement>>FactHarbor.Organisation.How-We-Work-Together.Continuous-Improvement]] - How we use metrics to improve

409

* [[Governance>>Archive.FactHarbor 2026\.02\.08.Organisation.Governance.WebHome]] - Quarterly performance reviews

410

---

411

**Remember**: We measure the SYSTEM, not individual outputs. Metrics drive IMPROVEMENT, not judgment.--

Wiki source code of System Performance Metrics