Changes for page POC1 API & Schemas Specification
Last modified by Robert Schaub on 2025/12/24 18:26
From version 2.2
edited by Robert Schaub
on 2025/12/24 16:28
on 2025/12/24 16:28
Change comment:
There is no comment for this version
Summary
-
Page properties (1 modified, 0 added, 0 removed)
Details
- Page properties
-
- Content
-
... ... @@ -41,23 +41,24 @@ 41 41 42 42 FactHarbor POC1 uses a **3-stage architecture** designed for claim-level caching and cost efficiency: 43 43 44 -{{{graph TD 45 - A[Article Input] --> B[Stage 1: Extract Claims] 46 - B --> C{For Each Claim} 47 - C --> D[Check Cache] 48 - D -->|Cache HIT| E[Return Cached Verdict] 49 - D -->|Cache MISS| F[Stage 2: Analyze Claim] 50 - F --> G[Store in Cache] 51 - G --> E 52 - E --> H[Stage 3: Holistic Assessment] 53 - H --> I[Final Report] 54 -}}} 44 +{{mermaid}} 45 +graph TD 46 + A[Article Input] --> B[Stage 1: Extract Claims] 47 + B --> C{For Each Claim} 48 + C --> D[Check Cache] 49 + D -->|Cache HIT| E[Return Cached Verdict] 50 + D -->|Cache MISS| F[Stage 2: Analyze Claim] 51 + F --> G[Store in Cache] 52 + G --> E 53 + E --> H[Stage 3: Holistic Assessment] 54 + H --> I[Final Report] 55 +{{/mermaid}} 55 55 56 56 ==== Stage 1: Claim Extraction (Haiku, no cache) ==== 57 57 58 58 * **Input:** Article text 59 59 * **Output:** 5 canonical claims (normalized, deduplicated) 60 -* **Model:** Claude Haiku 4 61 +* **Model:** Claude Haiku 4 (default, configurable via LLM abstraction layer) 61 61 * **Cost:** $0.003 per article 62 62 * **Cache strategy:** No caching (article-specific) 63 63 ... ... @@ -65,7 +65,7 @@ 65 65 66 66 * **Input:** Single canonical claim 67 67 * **Output:** Scenarios + Evidence + Verdicts 68 -* **Model:** Claude Sonnet 3.5 69 +* **Model:** Claude Sonnet 3.5 (default, configurable via LLM abstraction layer) 69 69 * **Cost:** $0.081 per NEW claim 70 70 * **Cache strategy:** Redis, 90-day TTL 71 71 * **Cache key:** claim:v1norm1:{language}:{sha256(canonical_claim)} ... ... @@ -74,10 +74,14 @@ 74 74 75 75 * **Input:** Article + Claim verdicts (from cache or Stage 2) 76 76 * **Output:** Article verdict + Fallacies + Logic quality 77 -* **Model:** Claude Sonnet 3.5 78 +* **Model:** Claude Sonnet 3.5 (default, configurable via LLM abstraction layer) 78 78 * **Cost:** $0.030 per article 79 79 * **Cache strategy:** No caching (article-specific) 80 80 82 + 83 + 84 +**Note:** Stage 3 implements **Approach 1 (Single-Pass Holistic Analysis)** from the [[Article Verdict Problem>>Test.FactHarbor.Specification.POC.Article-Verdict-Problem]]. While claim analysis (Stage 2) is cached for efficiency, the holistic assessment maintains the integrated evaluation philosophy of Approach 1. 85 + 81 81 === Total Cost Formula: === 82 82 83 83 {{{Cost = $0.003 (extraction) + (N_new_claims × $0.081) + $0.030 (holistic) ... ... @@ -145,27 +145,27 @@ 145 145 ==== User Experience Example: ==== 146 146 147 147 {{{{ 148 - "status": "cache_only_mode",149 - "message": "Monthly credit limit reached. Showing cached results only.",150 - "cache_coverage": {151 - "claims_total": 5,152 - "claims_cached": 3,153 - "claims_missing": 2,154 - "coverage_percent": 60155 - },156 - "cached_claims": [157 - {"claim_id": "C1", "verdict": "Likely", "confidence": 0.82},158 - {"claim_id": "C2", "verdict": "Highly Likely", "confidence": 0.91},159 - {"claim_id": "C4", "verdict": "Unclear", "confidence": 0.55}160 - ],161 - "missing_claims": [162 - {"claim_id": "C3", "claim_text": "...", "estimated_cost": "$0.081"},163 - {"claim_id": "C5", "claim_text": "...", "estimated_cost": "$0.081"}164 - ],165 - "upgrade_options": {166 - "top_up": "$5 for 20-70 more articles",167 - "pro_tier": "$50/month unlimited"168 - }153 + "status": "cache_only_mode", 154 + "message": "Monthly credit limit reached. Showing cached results only.", 155 + "cache_coverage": { 156 + "claims_total": 5, 157 + "claims_cached": 3, 158 + "claims_missing": 2, 159 + "coverage_percent": 60 160 + }, 161 + "cached_claims": [ 162 + {"claim_id": "C1", "verdict": "Likely", "confidence": 0.82}, 163 + {"claim_id": "C2", "verdict": "Highly Likely", "confidence": 0.91}, 164 + {"claim_id": "C4", "verdict": "Unclear", "confidence": 0.55} 165 + ], 166 + "missing_claims": [ 167 + {"claim_id": "C3", "claim_text": "...", "estimated_cost": "$0.081"}, 168 + {"claim_id": "C5", "claim_text": "...", "estimated_cost": "$0.081"} 169 + ], 170 + "upgrade_options": { 171 + "top_up": "$5 for 20-70 more articles", 172 + "pro_tier": "$50/month unlimited" 173 + } 169 169 } 170 170 }}} 171 171 ... ... @@ -178,6 +178,328 @@ 178 178 179 179 ---- 180 180 186 + 187 + 188 +== 6. LLM Abstraction Layer == 189 + 190 +=== 6.1 Design Principle === 191 + 192 +**FactHarbor uses provider-agnostic LLM abstraction** to avoid vendor lock-in and enable: 193 + 194 +* **Provider switching:** Change LLM providers without code changes 195 +* **Cost optimization:** Use different providers for different stages 196 +* **Resilience:** Automatic fallback if primary provider fails 197 +* **Cross-checking:** Compare outputs from multiple providers 198 +* **A/B testing:** Test new models without deployment changes 199 + 200 +**Implementation:** All LLM calls go through an abstraction layer that routes to configured providers. 201 + 202 +---- 203 + 204 +=== 6.2 LLM Provider Interface === 205 + 206 +**Abstract Interface:** 207 + 208 +{{{ 209 +interface LLMProvider { 210 + // Core methods 211 + complete(prompt: string, options: CompletionOptions): Promise<CompletionResponse> 212 + stream(prompt: string, options: CompletionOptions): AsyncIterator<StreamChunk> 213 + 214 + // Provider metadata 215 + getName(): string 216 + getMaxTokens(): number 217 + getCostPer1kTokens(): { input: number, output: number } 218 + 219 + // Health check 220 + isAvailable(): Promise<boolean> 221 +} 222 + 223 +interface CompletionOptions { 224 + model?: string 225 + maxTokens?: number 226 + temperature?: number 227 + stopSequences?: string[] 228 + systemPrompt?: string 229 +} 230 +}}} 231 + 232 +---- 233 + 234 +=== 6.3 Supported Providers (POC1) === 235 + 236 +**Primary Provider (Default):** 237 + 238 +* **Anthropic Claude API** 239 + * Models: Claude Haiku 4, Claude Sonnet 3.5, Claude Opus 4 240 + * Used by default in POC1 241 + * Best quality for holistic analysis 242 + 243 +**Secondary Providers (Future):** 244 + 245 +* **OpenAI API** 246 + * Models: GPT-4o, GPT-4o-mini 247 + * For cost comparison 248 + 249 +* **Google Vertex AI** 250 + * Models: Gemini 1.5 Pro, Gemini 1.5 Flash 251 + * For diversity in evidence gathering 252 + 253 +* **Local Models** (Post-POC) 254 + * Models: Llama 3.1, Mistral 255 + * For privacy-sensitive deployments 256 + 257 +---- 258 + 259 +=== 6.4 Provider Configuration === 260 + 261 +**Environment Variables:** 262 + 263 +{{{ 264 +# Primary provider 265 +LLM_PRIMARY_PROVIDER=anthropic 266 +ANTHROPIC_API_KEY=sk-ant-... 267 + 268 +# Fallback provider 269 +LLM_FALLBACK_PROVIDER=openai 270 +OPENAI_API_KEY=sk-... 271 + 272 +# Provider selection per stage 273 +LLM_STAGE1_PROVIDER=anthropic 274 +LLM_STAGE1_MODEL=claude-haiku-4 275 +LLM_STAGE2_PROVIDER=anthropic 276 +LLM_STAGE2_MODEL=claude-sonnet-3-5 277 +LLM_STAGE3_PROVIDER=anthropic 278 +LLM_STAGE3_MODEL=claude-sonnet-3-5 279 + 280 +# Cost limits 281 +LLM_MAX_COST_PER_REQUEST=1.00 282 +}}} 283 + 284 +**Database Configuration (Alternative):** 285 + 286 +{{{{ 287 +{ 288 + "providers": [ 289 + { 290 + "name": "anthropic", 291 + "api_key_ref": "vault://anthropic-api-key", 292 + "enabled": true, 293 + "priority": 1 294 + }, 295 + { 296 + "name": "openai", 297 + "api_key_ref": "vault://openai-api-key", 298 + "enabled": true, 299 + "priority": 2 300 + } 301 + ], 302 + "stage_config": { 303 + "stage1": { 304 + "provider": "anthropic", 305 + "model": "claude-haiku-4", 306 + "max_tokens": 4096, 307 + "temperature": 0.0 308 + }, 309 + "stage2": { 310 + "provider": "anthropic", 311 + "model": "claude-sonnet-3-5", 312 + "max_tokens": 16384, 313 + "temperature": 0.3 314 + }, 315 + "stage3": { 316 + "provider": "anthropic", 317 + "model": "claude-sonnet-3-5", 318 + "max_tokens": 8192, 319 + "temperature": 0.2 320 + } 321 + } 322 +} 323 +}}} 324 + 325 +---- 326 + 327 +=== 6.5 Stage-Specific Models (POC1 Defaults) === 328 + 329 +**Stage 1: Claim Extraction** 330 + 331 +* **Default:** Anthropic Claude Haiku 4 332 +* **Alternative:** OpenAI GPT-4o-mini, Google Gemini 1.5 Flash 333 +* **Rationale:** Fast, cheap, simple task 334 +* **Cost:** ~$0.003 per article 335 + 336 +**Stage 2: Claim Analysis** (CACHEABLE) 337 + 338 +* **Default:** Anthropic Claude Sonnet 3.5 339 +* **Alternative:** OpenAI GPT-4o, Google Gemini 1.5 Pro 340 +* **Rationale:** High-quality analysis, cached 90 days 341 +* **Cost:** ~$0.081 per NEW claim 342 + 343 +**Stage 3: Holistic Assessment** 344 + 345 +* **Default:** Anthropic Claude Sonnet 3.5 346 +* **Alternative:** OpenAI GPT-4o, Claude Opus 4 (for high-stakes) 347 +* **Rationale:** Complex reasoning, logical fallacy detection 348 +* **Cost:** ~$0.030 per article 349 + 350 +**Cost Comparison (Example):** 351 + 352 +|=Stage|=Anthropic (Default)|=OpenAI Alternative|=Google Alternative 353 +|Stage 1|Claude Haiku 4 ($0.003)|GPT-4o-mini ($0.002)|Gemini Flash ($0.002) 354 +|Stage 2|Claude Sonnet 3.5 ($0.081)|GPT-4o ($0.045)|Gemini Pro ($0.050) 355 +|Stage 3|Claude Sonnet 3.5 ($0.030)|GPT-4o ($0.018)|Gemini Pro ($0.020) 356 +|**Total (0% cache)**|**$0.114**|**$0.065**|**$0.072** 357 + 358 +**Note:** POC1 uses Anthropic exclusively for consistency. Multi-provider support planned for POC2. 359 + 360 +---- 361 + 362 +=== 6.6 Failover Strategy === 363 + 364 +**Automatic Failover:** 365 + 366 +{{{ 367 +async function completeLLM(stage: string, prompt: string): Promise<string> { 368 + const primaryProvider = getProviderForStage(stage) 369 + const fallbackProvider = getFallbackProvider() 370 + 371 + try { 372 + return await primaryProvider.complete(prompt) 373 + } catch (error) { 374 + if (error.type === 'rate_limit' || error.type === 'service_unavailable') { 375 + logger.warn(`Primary provider failed, using fallback`) 376 + return await fallbackProvider.complete(prompt) 377 + } 378 + throw error 379 + } 380 +} 381 +}}} 382 + 383 +**Fallback Priority:** 384 + 385 +1. **Primary:** Configured provider for stage 386 +2. **Secondary:** Fallback provider (if configured) 387 +3. **Cache:** Return cached result (if available for Stage 2) 388 +4. **Error:** Return 503 Service Unavailable 389 + 390 +---- 391 + 392 +=== 6.7 Provider Selection API === 393 + 394 +**Admin Endpoint:** POST /admin/v1/llm/configure 395 + 396 +**Update provider for specific stage:** 397 + 398 +{{{{ 399 +{ 400 + "stage": "stage2", 401 + "provider": "openai", 402 + "model": "gpt-4o", 403 + "max_tokens": 16384, 404 + "temperature": 0.3 405 +} 406 +}}} 407 + 408 +**Response:** 200 OK 409 + 410 +{{{{ 411 +{ 412 + "message": "LLM configuration updated", 413 + "stage": "stage2", 414 + "previous": { 415 + "provider": "anthropic", 416 + "model": "claude-sonnet-3-5" 417 + }, 418 + "current": { 419 + "provider": "openai", 420 + "model": "gpt-4o" 421 + }, 422 + "cost_impact": { 423 + "previous_cost_per_claim": 0.081, 424 + "new_cost_per_claim": 0.045, 425 + "savings_percent": 44 426 + } 427 +} 428 +}}} 429 + 430 +**Get current configuration:** 431 + 432 +GET /admin/v1/llm/config 433 + 434 +{{{{ 435 +{ 436 + "providers": ["anthropic", "openai"], 437 + "primary": "anthropic", 438 + "fallback": "openai", 439 + "stages": { 440 + "stage1": { 441 + "provider": "anthropic", 442 + "model": "claude-haiku-4", 443 + "cost_per_request": 0.003 444 + }, 445 + "stage2": { 446 + "provider": "anthropic", 447 + "model": "claude-sonnet-3-5", 448 + "cost_per_new_claim": 0.081 449 + }, 450 + "stage3": { 451 + "provider": "anthropic", 452 + "model": "claude-sonnet-3-5", 453 + "cost_per_request": 0.030 454 + } 455 + } 456 +} 457 +}}} 458 + 459 +---- 460 + 461 +=== 6.8 Implementation Notes === 462 + 463 +**Provider Adapter Pattern:** 464 + 465 +{{{ 466 +class AnthropicProvider implements LLMProvider { 467 + async complete(prompt: string, options: CompletionOptions) { 468 + const response = await anthropic.messages.create({ 469 + model: options.model || 'claude-sonnet-3-5', 470 + max_tokens: options.maxTokens || 4096, 471 + messages: [{ role: 'user', content: prompt }], 472 + system: options.systemPrompt 473 + }) 474 + return response.content[0].text 475 + } 476 +} 477 + 478 +class OpenAIProvider implements LLMProvider { 479 + async complete(prompt: string, options: CompletionOptions) { 480 + const response = await openai.chat.completions.create({ 481 + model: options.model || 'gpt-4o', 482 + max_tokens: options.maxTokens || 4096, 483 + messages: [ 484 + { role: 'system', content: options.systemPrompt }, 485 + { role: 'user', content: prompt } 486 + ] 487 + }) 488 + return response.choices[0].message.content 489 + } 490 +} 491 +}}} 492 + 493 +**Provider Registry:** 494 + 495 +{{{ 496 +const providers = new Map<string, LLMProvider>() 497 +providers.set('anthropic', new AnthropicProvider()) 498 +providers.set('openai', new OpenAIProvider()) 499 +providers.set('google', new GoogleProvider()) 500 + 501 +function getProvider(name: string): LLMProvider { 502 + return providers.get(name) || providers.get(config.primaryProvider) 503 +} 504 +}}} 505 + 506 +---- 507 + 181 181 == 3. REST API Contract == 182 182 183 183 === 3.1 User Credit Tracking === ... ... @@ -187,19 +187,19 @@ 187 187 **Response:** 200 OK 188 188 189 189 {{{{ 190 - "user_id": "user_abc123",191 - "tier": "free",192 - "credit_limit": 10.00,193 - "credit_used": 7.42,194 - "credit_remaining": 2.58,195 - "reset_date": "2025-02-01T00:00:00Z",196 - "cache_only_mode": false,197 - "usage_stats": {198 - "articles_analyzed": 67,199 - "claims_from_cache": 189,200 - "claims_newly_analyzed": 113,201 - "cache_hit_rate": 0.626202 - }517 + "user_id": "user_abc123", 518 + "tier": "free", 519 + "credit_limit": 10.00, 520 + "credit_used": 7.42, 521 + "credit_remaining": 2.58, 522 + "reset_date": "2025-02-01T00:00:00Z", 523 + "cache_only_mode": false, 524 + "usage_stats": { 525 + "articles_analyzed": 67, 526 + "claims_from_cache": 189, 527 + "claims_newly_analyzed": 113, 528 + "cache_hit_rate": 0.626 529 + } 203 203 } 204 204 }}} 205 205 ... ... @@ -220,11 +220,11 @@ 220 220 OR use the client.request_id field: 221 221 222 222 {{{{ 223 - "input_url": "...",224 - "client": {225 - "request_id": "client-uuid-12345",226 - "source_label": "optional"227 - }550 + "input_url": "...", 551 + "client": { 552 + "request_id": "client-uuid-12345", 553 + "source_label": "optional" 554 + } 228 228 } 229 229 }}} 230 230 ... ... @@ -238,11 +238,11 @@ 238 238 **Example Response (Idempotent):** 239 239 240 240 {{{{ 241 - "job_id": "01J...ULID",242 - "status": "RUNNING",243 - "idempotent": true,244 - "original_request_at": "2025-12-24T10:31:00Z",245 - "message": "Returning existing job (idempotency key matched)"568 + "job_id": "01J...ULID", 569 + "status": "RUNNING", 570 + "idempotent": true, 571 + "original_request_at": "2025-12-24T10:31:00Z", 572 + "message": "Returning existing job (idempotency key matched)" 246 246 } 247 247 }}} 248 248 ... ... @@ -249,21 +249,21 @@ 249 249 ==== Request Body: ==== 250 250 251 251 {{{{ 252 - "input_type": "url",253 - "input_url": "https://example.com/medical-report-01",254 - "input_text": null,255 - "options": {256 - "browsing": "on",257 - "depth": "standard",258 - "max_claims": 5,259 - "scenarios_per_claim": 2,260 - "max_evidence_per_scenario": 6,261 - "context_aware_analysis": true262 - },263 - "client": {264 - "request_id": "optional-client-tracking-id",265 - "source_label": "optional"266 - }579 + "input_type": "url", 580 + "input_url": "https://example.com/medical-report-01", 581 + "input_text": null, 582 + "options": { 583 + "browsing": "on", 584 + "depth": "standard", 585 + "max_claims": 5, 586 + "scenarios_per_claim": 2, 587 + "max_evidence_per_scenario": 6, 588 + "context_aware_analysis": true 589 + }, 590 + "client": { 591 + "request_id": "optional-client-tracking-id", 592 + "source_label": "optional" 593 + } 267 267 } 268 268 }}} 269 269 ... ... @@ -279,27 +279,27 @@ 279 279 **Response:** 202 Accepted 280 280 281 281 {{{{ 282 - "job_id": "01J...ULID",283 - "status": "QUEUED",284 - "created_at": "2025-12-24T10:31:00Z",285 - "estimated_cost": 0.114,286 - "cost_breakdown": {287 - "stage1_extraction": 0.003,288 - "stage2_new_claims": 0.081,289 - "stage2_cached_claims": 0.000,290 - "stage3_holistic": 0.030291 - },292 - "cache_info": {293 - "claims_to_extract": 5,294 - "estimated_cache_hits": 4,295 - "estimated_new_claims": 1296 - },297 - "links": {298 - "self": "/v1/jobs/01J...ULID",299 - "result": "/v1/jobs/01J...ULID/result",300 - "report": "/v1/jobs/01J...ULID/report",301 - "events": "/v1/jobs/01J...ULID/events"302 - }609 + "job_id": "01J...ULID", 610 + "status": "QUEUED", 611 + "created_at": "2025-12-24T10:31:00Z", 612 + "estimated_cost": 0.114, 613 + "cost_breakdown": { 614 + "stage1_extraction": 0.003, 615 + "stage2_new_claims": 0.081, 616 + "stage2_cached_claims": 0.000, 617 + "stage3_holistic": 0.030 618 + }, 619 + "cache_info": { 620 + "claims_to_extract": 5, 621 + "estimated_cache_hits": 4, 622 + "estimated_new_claims": 1 623 + }, 624 + "links": { 625 + "self": "/v1/jobs/01J...ULID", 626 + "result": "/v1/jobs/01J...ULID/result", 627 + "report": "/v1/jobs/01J...ULID/report", 628 + "events": "/v1/jobs/01J...ULID/events" 629 + } 303 303 } 304 304 }}} 305 305 ... ... @@ -308,12 +308,12 @@ 308 308 402 Payment Required - Free tier limit reached, cache-only mode 309 309 310 310 {{{{ 311 - "error": "credit_limit_reached",312 - "message": "Monthly credit limit reached. Entering cache-only mode.",313 - "cache_only_mode": true,314 - "credit_remaining": 0.00,315 - "reset_date": "2025-02-01T00:00:00Z",316 - "action": "Resubmit with cache_preference=allow_partial for cached results"638 + "error": "credit_limit_reached", 639 + "message": "Monthly credit limit reached. Entering cache-only mode.", 640 + "cache_only_mode": true, 641 + "credit_remaining": 0.00, 642 + "reset_date": "2025-02-01T00:00:00Z", 643 + "action": "Resubmit with cache_preference=allow_partial for cached results" 317 317 } 318 318 }}} 319 319 ... ... @@ -324,29 +324,29 @@ 324 324 === 4.1 Stage 1 Output: ClaimExtraction === 325 325 326 326 {{{{ 327 - "job_id": "01J...ULID",328 - "stage": "stage1_extraction",329 - "article_metadata": {330 - "title": "Article title",331 - "source_url": "https://example.com/article",332 - "extracted_text_length": 5234,333 - "language": "en"334 - },335 - "claims": [336 - {337 - "claim_id": "C1",338 - "claim_text": "Original claim text from article",339 - "canonical_claim": "Normalized, deduplicated phrasing",340 - "claim_hash": "sha256:abc123...",341 - "is_central_to_thesis": true,342 - "claim_type": "causal",343 - "evaluability": "evaluable",344 - "risk_tier": "B",345 - "domain": "public_health"346 - }347 - ],348 - "article_thesis": "Main argument detected",349 - "cost": 0.003654 + "job_id": "01J...ULID", 655 + "stage": "stage1_extraction", 656 + "article_metadata": { 657 + "title": "Article title", 658 + "source_url": "https://example.com/article", 659 + "extracted_text_length": 5234, 660 + "language": "en" 661 + }, 662 + "claims": [ 663 + { 664 + "claim_id": "C1", 665 + "claim_text": "Original claim text from article", 666 + "canonical_claim": "Normalized, deduplicated phrasing", 667 + "claim_hash": "sha256:abc123...", 668 + "is_central_to_thesis": true, 669 + "claim_type": "causal", 670 + "evaluability": "evaluable", 671 + "risk_tier": "B", 672 + "domain": "public_health" 673 + } 674 + ], 675 + "article_thesis": "Main argument detected", 676 + "cost": 0.003 350 350 } 351 351 }}} 352 352 ... ... @@ -432,7 +432,7 @@ 432 432 **Data Structure:** 433 433 434 434 {{{SET claim:v1norm1:en:abc123...def456 '{...ClaimAnalysis JSON...}' 435 -EXPIRE claim:v1norm1:en:abc123...def456 7776000 # 90 days762 +EXPIRE claim:v1norm1:en:abc123...def456 7776000 # 90 days 436 436 }}} 437 437 438 438 ---- ... ... @@ -444,44 +444,44 @@ 444 444 **Algorithm: Canonical Claim Normalization v1** 445 445 446 446 {{{def normalize_claim_v1(claim_text: str, language: str) -> str: 447 - """448 - Normalizes claim to canonical form for cache key generation.449 - Version: v1norm1 (POC1)450 - """451 - import re452 - import unicodedata453 - 454 - # Step 1: Unicode normalization (NFC)455 - text = unicodedata.normalize('NFC', claim_text)456 - 457 - # Step 2: Lowercase458 - text = text.lower()459 - 460 - # Step 3: Remove punctuation (except hyphens in words)461 - text = re.sub(r'[^\w\s-]', '', text)462 - 463 - # Step 4: Normalize whitespace (collapse multiple spaces)464 - text = re.sub(r'\s+', ' ', text).strip()465 - 466 - # Step 5: Numeric normalization467 - text = text.replace('%', ' percent')468 - # Spell out single-digit numbers469 - num_to_word = {'0':'zero', '1':'one', '2':'two', '3':'three',470 - '4':'four', '5':'five', '6':'six', '7':'seven',471 - '8':'eight', '9':'nine'}472 - for num, word in num_to_word.items():473 - text = re.sub(rf'\b{num}\b', word, text)474 - 475 - # Step 6: Common abbreviations (English only in v1)476 - if language == 'en':477 - text = text.replace('covid-19', 'covid')478 - text = text.replace('u.s.', 'us')479 - text = text.replace('u.k.', 'uk')480 - 481 - # Step 7: NO entity normalization in v1482 - # (Trump vs Donald Trump vs President Trump remain distinct)483 - 484 - return text774 + """ 775 + Normalizes claim to canonical form for cache key generation. 776 + Version: v1norm1 (POC1) 777 + """ 778 + import re 779 + import unicodedata 780 + 781 + # Step 1: Unicode normalization (NFC) 782 + text = unicodedata.normalize('NFC', claim_text) 783 + 784 + # Step 2: Lowercase 785 + text = text.lower() 786 + 787 + # Step 3: Remove punctuation (except hyphens in words) 788 + text = re.sub(r'[^\w\s-]', '', text) 789 + 790 + # Step 4: Normalize whitespace (collapse multiple spaces) 791 + text = re.sub(r'\s+', ' ', text).strip() 792 + 793 + # Step 5: Numeric normalization 794 + text = text.replace('%', ' percent') 795 + # Spell out single-digit numbers 796 + num_to_word = {'0':'zero', '1':'one', '2':'two', '3':'three', 797 + '4':'four', '5':'five', '6':'six', '7':'seven', 798 + '8':'eight', '9':'nine'} 799 + for num, word in num_to_word.items(): 800 + text = re.sub(rf'\b{num}\b', word, text) 801 + 802 + # Step 6: Common abbreviations (English only in v1) 803 + if language == 'en': 804 + text = text.replace('covid-19', 'covid') 805 + text = text.replace('u.s.', 'us') 806 + text = text.replace('u.k.', 'uk') 807 + 808 + # Step 7: NO entity normalization in v1 809 + # (Trump vs Donald Trump vs President Trump remain distinct) 810 + 811 + return text 485 485 486 486 # Version identifier (include in cache namespace) 487 487 CANONICALIZER_VERSION = "v1norm1" ... ... @@ -494,19 +494,19 @@ 494 494 cache_key = f"claim:{CANONICALIZER_VERSION}:{language}:{sha256(canonical)}" 495 495 496 496 Example: 497 - claim: "COVID-19 vaccines are 95% effective"498 - canonical: "covid vaccines are 95 percent effective"499 - sha256: abc123...def456500 - key: "claim:v1norm1:en:abc123...def456"824 + claim: "COVID-19 vaccines are 95% effective" 825 + canonical: "covid vaccines are 95 percent effective" 826 + sha256: abc123...def456 827 + key: "claim:v1norm1:en:abc123...def456" 501 501 }}} 502 502 503 503 **Cache Metadata MUST Include:** 504 504 505 505 {{{{ 506 - "canonical_claim": "covid vaccines are 95 percent effective",507 - "canonicalizer_version": "v1norm1",508 - "language": "en",509 - "original_claim_samples": ["COVID-19 vaccines are 95% effective"]833 + "canonical_claim": "covid vaccines are 95 percent effective", 834 + "canonicalizer_version": "v1norm1", 835 + "language": "en", 836 + "original_claim_samples": ["COVID-19 vaccines are 95% effective"] 510 510 } 511 511 }}} 512 512