Last modified by Robert Schaub on 2025/12/24 18:26

Show last authors
1 = POC1 API & Schemas Specification =
2
3 ----
4
5 == Version History ==
6
7 |=Version|=Date|=Changes
8 |0.4.1|2025-12-24|Applied 9 critical fixes: file format notice, verdict taxonomy, canonicalization algorithm, Stage 1 cost policy, BullMQ fix, language in cache key, historical claims TTL, idempotency, copyright policy
9 |0.4|2025-12-24|**BREAKING:** 3-stage pipeline with claim-level caching, user tier system, cache-only mode for free users, Redis cache architecture
10 |0.3.1|2025-12-24|Fixed single-prompt strategy, SSE clarification, schema canonicalization, cost constraints
11 |0.3|2025-12-24|Added complete API endpoints, LLM config, risk tiers, scraping details
12
13 ----
14
15 == 1. Core Objective (POC1) ==
16
17 The primary technical goal of POC1 is to validate **Approach 1 (Single-Pass Holistic Analysis)** while implementing **claim-level caching** to achieve cost sustainability.
18
19 The system must prove that AI can identify an article's **Main Thesis** and determine if supporting claims logically support that thesis without committing fallacies.
20
21 === Success Criteria: ===
22
23 * Test with 30 diverse articles
24 * Target: ≥70% accuracy detecting misleading articles
25 * Cost: <$0.25 per NEW analysis (uncached)
26 * Cost: $0.00 for cached claim reuse
27 * Cache hit rate: ≥50% after 1,000 articles
28 * Processing time: <2 minutes (standard depth)
29
30 === Economic Model: ===
31
32 * **Free tier:** $10 credit per month (~~40-140 articles depending on cache hits)
33 * **After limit:** Cache-only mode (instant, free access to cached claims)
34 * **Paid tier:** Unlimited new analyses
35
36 ----
37
38 == 2. Architecture Overview ==
39
40 === 2.1 3-Stage Pipeline with Caching ===
41
42 FactHarbor POC1 uses a **3-stage architecture** designed for claim-level caching and cost efficiency:
43
44 {{mermaid}}
45 graph TD
46 A[Article Input] --> B[Stage 1: Extract Claims]
47 B --> C{For Each Claim}
48 C --> D[Check Cache]
49 D -->|Cache HIT| E[Return Cached Verdict]
50 D -->|Cache MISS| F[Stage 2: Analyze Claim]
51 F --> G[Store in Cache]
52 G --> E
53 E --> H[Stage 3: Holistic Assessment]
54 H --> I[Final Report]
55 {{/mermaid}}
56
57 ==== Stage 1: Claim Extraction (Haiku, no cache) ====
58
59 * **Input:** Article text
60 * **Output:** 5 canonical claims (normalized, deduplicated)
61 * **Model:** Claude Haiku 4 (default, configurable via LLM abstraction layer)
62 * **Cost:** $0.003 per article
63 * **Cache strategy:** No caching (article-specific)
64
65 ==== Stage 2: Claim Analysis (Sonnet, CACHED) ====
66
67 * **Input:** Single canonical claim
68 * **Output:** Scenarios + Evidence + Verdicts
69 * **Model:** Claude Sonnet 3.5 (default, configurable via LLM abstraction layer)
70 * **Cost:** $0.081 per NEW claim
71 * **Cache strategy:** Redis, 90-day TTL
72 * **Cache key:** claim:v1norm1:{language}:{sha256(canonical_claim)}
73
74 ==== Stage 3: Holistic Assessment (Sonnet, no cache) ====
75
76 * **Input:** Article + Claim verdicts (from cache or Stage 2)
77 * **Output:** Article verdict + Fallacies + Logic quality
78 * **Model:** Claude Sonnet 3.5 (default, configurable via LLM abstraction layer)
79 * **Cost:** $0.030 per article
80 * **Cache strategy:** No caching (article-specific)
81
82
83
84 **Note:** Stage 3 implements **Approach 1 (Single-Pass Holistic Analysis)** from the [[Article Verdict Problem>>Test.FactHarbor.Specification.POC.Article-Verdict-Problem]]. While claim analysis (Stage 2) is cached for efficiency, the holistic assessment maintains the integrated evaluation philosophy of Approach 1.
85
86 === Total Cost Formula: ===
87
88 {{{Cost = $0.003 (extraction) + (N_new_claims × $0.081) + $0.030 (holistic)
89
90 Examples:
91 - 0 new claims (100% cache hit): $0.033
92 - 1 new claim (80% cache hit): $0.114
93 - 3 new claims (40% cache hit): $0.276
94 - 5 new claims (0% cache hit): $0.438
95 }}}
96
97 ----
98
99 === 2.2 User Tier System ===
100
101 |=Tier|=Monthly Credit|=After Limit|=Cache Access|=Analytics
102 |**Free**|$10|Cache-only mode|✅ Full|Basic
103 |**Pro** (future)|$50|Continues|✅ Full|Advanced
104 |**Enterprise** (future)|Custom|Continues|✅ Full + Priority|Full
105
106 **Free Tier Economics:**
107
108 * $10 credit = 40-140 articles analyzed (depending on cache hit rate)
109 * Average 70 articles/month at 70% cache hit rate
110 * After limit: Cache-only mode
111
112 ----
113
114 === 2.3 Cache-Only Mode (Free Tier Feature) ===
115
116 When free users reach their $10 monthly limit, they enter **Cache-Only Mode**:
117
118 ==== What Cache-Only Mode Provides: ====
119
120 ✅ **Claim Extraction (Platform-Funded):**
121
122 * Stage 1 extraction runs at $0.003 per article
123 * **Cost: Absorbed by platform** (not charged to user credit)
124 * Rationale: Extraction is necessary to check cache, and cost is negligible
125 * Rate limit: Max 50 extractions/day in cache-only mode (prevents abuse)
126
127 ✅ **Instant Access to Cached Claims:**
128
129 * Any claim that exists in cache → Full verdict returned
130 * Cost: $0 (no LLM calls)
131 * Response time: <100ms
132
133 ✅ **Partial Article Analysis:**
134
135 * Check each claim against cache
136 * Return verdicts for ALL cached claims
137 * For uncached claims: Return "status": "cache_miss"
138
139 ✅ **Cache Coverage Report:**
140
141 * "3 of 5 claims available in cache (60% coverage)"
142 * Links to cached analyses
143 * Estimated cost to complete: $0.162 (2 new claims)
144
145 ❌ **Not Available in Cache-Only Mode:**
146
147 * New claim analysis (Stage 2 LLM calls blocked)
148 * Full holistic assessment (Stage 3 blocked if any claims missing)
149
150 ==== User Experience Example: ====
151
152 {{{{
153 "status": "cache_only_mode",
154 "message": "Monthly credit limit reached. Showing cached results only.",
155 "cache_coverage": {
156 "claims_total": 5,
157 "claims_cached": 3,
158 "claims_missing": 2,
159 "coverage_percent": 60
160 },
161 "cached_claims": [
162 {"claim_id": "C1", "verdict": "Likely", "confidence": 0.82},
163 {"claim_id": "C2", "verdict": "Highly Likely", "confidence": 0.91},
164 {"claim_id": "C4", "verdict": "Unclear", "confidence": 0.55}
165 ],
166 "missing_claims": [
167 {"claim_id": "C3", "claim_text": "...", "estimated_cost": "$0.081"},
168 {"claim_id": "C5", "claim_text": "...", "estimated_cost": "$0.081"}
169 ],
170 "upgrade_options": {
171 "top_up": "$5 for 20-70 more articles",
172 "pro_tier": "$50/month unlimited"
173 }
174 }
175 }}}
176
177 **Design Rationale:**
178
179 * Free users still get value (cached claims often answer their question)
180 * Demonstrates FactHarbor's value (partial results encourage upgrade)
181 * Sustainable for platform (no additional cost)
182 * Fair to all users (everyone contributes to cache)
183
184 ----
185
186
187
188 == 6. LLM Abstraction Layer ==
189
190 === 6.1 Design Principle ===
191
192 **FactHarbor uses provider-agnostic LLM abstraction** to avoid vendor lock-in and enable:
193
194 * **Provider switching:** Change LLM providers without code changes
195 * **Cost optimization:** Use different providers for different stages
196 * **Resilience:** Automatic fallback if primary provider fails
197 * **Cross-checking:** Compare outputs from multiple providers
198 * **A/B testing:** Test new models without deployment changes
199
200 **Implementation:** All LLM calls go through an abstraction layer that routes to configured providers.
201
202 ----
203
204 === 6.2 LLM Provider Interface ===
205
206 **Abstract Interface:**
207
208 {{{
209 interface LLMProvider {
210 // Core methods
211 complete(prompt: string, options: CompletionOptions): Promise<CompletionResponse>
212 stream(prompt: string, options: CompletionOptions): AsyncIterator<StreamChunk>
213
214 // Provider metadata
215 getName(): string
216 getMaxTokens(): number
217 getCostPer1kTokens(): { input: number, output: number }
218
219 // Health check
220 isAvailable(): Promise<boolean>
221 }
222
223 interface CompletionOptions {
224 model?: string
225 maxTokens?: number
226 temperature?: number
227 stopSequences?: string[]
228 systemPrompt?: string
229 }
230 }}}
231
232 ----
233
234 === 6.3 Supported Providers (POC1) ===
235
236 **Primary Provider (Default):**
237
238 * **Anthropic Claude API**
239 * Models: Claude Haiku 4, Claude Sonnet 3.5, Claude Opus 4
240 * Used by default in POC1
241 * Best quality for holistic analysis
242
243 **Secondary Providers (Future):**
244
245 * **OpenAI API**
246 * Models: GPT-4o, GPT-4o-mini
247 * For cost comparison
248
249 * **Google Vertex AI**
250 * Models: Gemini 1.5 Pro, Gemini 1.5 Flash
251 * For diversity in evidence gathering
252
253 * **Local Models** (Post-POC)
254 * Models: Llama 3.1, Mistral
255 * For privacy-sensitive deployments
256
257 ----
258
259 === 6.4 Provider Configuration ===
260
261 **Environment Variables:**
262
263 {{{
264 # Primary provider
265 LLM_PRIMARY_PROVIDER=anthropic
266 ANTHROPIC_API_KEY=sk-ant-...
267
268 # Fallback provider
269 LLM_FALLBACK_PROVIDER=openai
270 OPENAI_API_KEY=sk-...
271
272 # Provider selection per stage
273 LLM_STAGE1_PROVIDER=anthropic
274 LLM_STAGE1_MODEL=claude-haiku-4
275 LLM_STAGE2_PROVIDER=anthropic
276 LLM_STAGE2_MODEL=claude-sonnet-3-5
277 LLM_STAGE3_PROVIDER=anthropic
278 LLM_STAGE3_MODEL=claude-sonnet-3-5
279
280 # Cost limits
281 LLM_MAX_COST_PER_REQUEST=1.00
282 }}}
283
284 **Database Configuration (Alternative):**
285
286 {{{{
287 {
288 "providers": [
289 {
290 "name": "anthropic",
291 "api_key_ref": "vault://anthropic-api-key",
292 "enabled": true,
293 "priority": 1
294 },
295 {
296 "name": "openai",
297 "api_key_ref": "vault://openai-api-key",
298 "enabled": true,
299 "priority": 2
300 }
301 ],
302 "stage_config": {
303 "stage1": {
304 "provider": "anthropic",
305 "model": "claude-haiku-4",
306 "max_tokens": 4096,
307 "temperature": 0.0
308 },
309 "stage2": {
310 "provider": "anthropic",
311 "model": "claude-sonnet-3-5",
312 "max_tokens": 16384,
313 "temperature": 0.3
314 },
315 "stage3": {
316 "provider": "anthropic",
317 "model": "claude-sonnet-3-5",
318 "max_tokens": 8192,
319 "temperature": 0.2
320 }
321 }
322 }
323 }}}
324
325 ----
326
327 === 6.5 Stage-Specific Models (POC1 Defaults) ===
328
329 **Stage 1: Claim Extraction**
330
331 * **Default:** Anthropic Claude Haiku 4
332 * **Alternative:** OpenAI GPT-4o-mini, Google Gemini 1.5 Flash
333 * **Rationale:** Fast, cheap, simple task
334 * **Cost:** ~$0.003 per article
335
336 **Stage 2: Claim Analysis** (CACHEABLE)
337
338 * **Default:** Anthropic Claude Sonnet 3.5
339 * **Alternative:** OpenAI GPT-4o, Google Gemini 1.5 Pro
340 * **Rationale:** High-quality analysis, cached 90 days
341 * **Cost:** ~$0.081 per NEW claim
342
343 **Stage 3: Holistic Assessment**
344
345 * **Default:** Anthropic Claude Sonnet 3.5
346 * **Alternative:** OpenAI GPT-4o, Claude Opus 4 (for high-stakes)
347 * **Rationale:** Complex reasoning, logical fallacy detection
348 * **Cost:** ~$0.030 per article
349
350 **Cost Comparison (Example):**
351
352 |=Stage|=Anthropic (Default)|=OpenAI Alternative|=Google Alternative
353 |Stage 1|Claude Haiku 4 ($0.003)|GPT-4o-mini ($0.002)|Gemini Flash ($0.002)
354 |Stage 2|Claude Sonnet 3.5 ($0.081)|GPT-4o ($0.045)|Gemini Pro ($0.050)
355 |Stage 3|Claude Sonnet 3.5 ($0.030)|GPT-4o ($0.018)|Gemini Pro ($0.020)
356 |**Total (0% cache)**|**$0.114**|**$0.065**|**$0.072**
357
358 **Note:** POC1 uses Anthropic exclusively for consistency. Multi-provider support planned for POC2.
359
360 ----
361
362 === 6.6 Failover Strategy ===
363
364 **Automatic Failover:**
365
366 {{{
367 async function completeLLM(stage: string, prompt: string): Promise<string> {
368 const primaryProvider = getProviderForStage(stage)
369 const fallbackProvider = getFallbackProvider()
370
371 try {
372 return await primaryProvider.complete(prompt)
373 } catch (error) {
374 if (error.type === 'rate_limit' || error.type === 'service_unavailable') {
375 logger.warn(`Primary provider failed, using fallback`)
376 return await fallbackProvider.complete(prompt)
377 }
378 throw error
379 }
380 }
381 }}}
382
383 **Fallback Priority:**
384
385 1. **Primary:** Configured provider for stage
386 2. **Secondary:** Fallback provider (if configured)
387 3. **Cache:** Return cached result (if available for Stage 2)
388 4. **Error:** Return 503 Service Unavailable
389
390 ----
391
392 === 6.7 Provider Selection API ===
393
394 **Admin Endpoint:** POST /admin/v1/llm/configure
395
396 **Update provider for specific stage:**
397
398 {{{{
399 {
400 "stage": "stage2",
401 "provider": "openai",
402 "model": "gpt-4o",
403 "max_tokens": 16384,
404 "temperature": 0.3
405 }
406 }}}
407
408 **Response:** 200 OK
409
410 {{{{
411 {
412 "message": "LLM configuration updated",
413 "stage": "stage2",
414 "previous": {
415 "provider": "anthropic",
416 "model": "claude-sonnet-3-5"
417 },
418 "current": {
419 "provider": "openai",
420 "model": "gpt-4o"
421 },
422 "cost_impact": {
423 "previous_cost_per_claim": 0.081,
424 "new_cost_per_claim": 0.045,
425 "savings_percent": 44
426 }
427 }
428 }}}
429
430 **Get current configuration:**
431
432 GET /admin/v1/llm/config
433
434 {{{{
435 {
436 "providers": ["anthropic", "openai"],
437 "primary": "anthropic",
438 "fallback": "openai",
439 "stages": {
440 "stage1": {
441 "provider": "anthropic",
442 "model": "claude-haiku-4",
443 "cost_per_request": 0.003
444 },
445 "stage2": {
446 "provider": "anthropic",
447 "model": "claude-sonnet-3-5",
448 "cost_per_new_claim": 0.081
449 },
450 "stage3": {
451 "provider": "anthropic",
452 "model": "claude-sonnet-3-5",
453 "cost_per_request": 0.030
454 }
455 }
456 }
457 }}}
458
459 ----
460
461 === 6.8 Implementation Notes ===
462
463 **Provider Adapter Pattern:**
464
465 {{{
466 class AnthropicProvider implements LLMProvider {
467 async complete(prompt: string, options: CompletionOptions) {
468 const response = await anthropic.messages.create({
469 model: options.model || 'claude-sonnet-3-5',
470 max_tokens: options.maxTokens || 4096,
471 messages: [{ role: 'user', content: prompt }],
472 system: options.systemPrompt
473 })
474 return response.content[0].text
475 }
476 }
477
478 class OpenAIProvider implements LLMProvider {
479 async complete(prompt: string, options: CompletionOptions) {
480 const response = await openai.chat.completions.create({
481 model: options.model || 'gpt-4o',
482 max_tokens: options.maxTokens || 4096,
483 messages: [
484 { role: 'system', content: options.systemPrompt },
485 { role: 'user', content: prompt }
486 ]
487 })
488 return response.choices[0].message.content
489 }
490 }
491 }}}
492
493 **Provider Registry:**
494
495 {{{
496 const providers = new Map<string, LLMProvider>()
497 providers.set('anthropic', new AnthropicProvider())
498 providers.set('openai', new OpenAIProvider())
499 providers.set('google', new GoogleProvider())
500
501 function getProvider(name: string): LLMProvider {
502 return providers.get(name) || providers.get(config.primaryProvider)
503 }
504 }}}
505
506 ----
507
508 == 3. REST API Contract ==
509
510 === 3.1 User Credit Tracking ===
511
512 **Endpoint:** GET /v1/user/credit
513
514 **Response:** 200 OK
515
516 {{{{
517 "user_id": "user_abc123",
518 "tier": "free",
519 "credit_limit": 10.00,
520 "credit_used": 7.42,
521 "credit_remaining": 2.58,
522 "reset_date": "2025-02-01T00:00:00Z",
523 "cache_only_mode": false,
524 "usage_stats": {
525 "articles_analyzed": 67,
526 "claims_from_cache": 189,
527 "claims_newly_analyzed": 113,
528 "cache_hit_rate": 0.626
529 }
530 }
531 }}}
532
533 ----
534
535 === 3.2 Create Analysis Job (3-Stage) ===
536
537 **Endpoint:** POST /v1/analyze
538
539 ==== Idempotency Support: ====
540
541 To prevent duplicate job creation on network retries, clients SHOULD include:
542
543 {{{POST /v1/analyze
544 Idempotency-Key: {client-generated-uuid}
545 }}}
546
547 OR use the client.request_id field:
548
549 {{{{
550 "input_url": "...",
551 "client": {
552 "request_id": "client-uuid-12345",
553 "source_label": "optional"
554 }
555 }
556 }}}
557
558 **Server Behavior:**
559
560 * If Idempotency-Key or request_id seen before (within 24 hours):
561 ** Return existing job (200 OK, not 202 Accepted)
562 ** Do NOT create duplicate job or charge twice
563 * Idempotency keys expire after 24 hours (matches job retention)
564
565 **Example Response (Idempotent):**
566
567 {{{{
568 "job_id": "01J...ULID",
569 "status": "RUNNING",
570 "idempotent": true,
571 "original_request_at": "2025-12-24T10:31:00Z",
572 "message": "Returning existing job (idempotency key matched)"
573 }
574 }}}
575
576 ==== Request Body: ====
577
578 {{{{
579 "input_type": "url",
580 "input_url": "https://example.com/medical-report-01",
581 "input_text": null,
582 "options": {
583 "browsing": "on",
584 "depth": "standard",
585 "max_claims": 5,
586 "scenarios_per_claim": 2,
587 "max_evidence_per_scenario": 6,
588 "context_aware_analysis": true
589 },
590 "client": {
591 "request_id": "optional-client-tracking-id",
592 "source_label": "optional"
593 }
594 }
595 }}}
596
597 **Options:**
598
599 * browsing: on | off (retrieve web sources or just output queries)
600 * depth: standard | deep (evidence thoroughness)
601 * max_claims: 1-10 (default: **5** for cost control)
602 * scenarios_per_claim: 1-5 (default: **2** for cost control)
603 * max_evidence_per_scenario: 3-10 (default: **6**)
604 * context_aware_analysis: true | false (experimental)
605
606 **Response:** 202 Accepted
607
608 {{{{
609 "job_id": "01J...ULID",
610 "status": "QUEUED",
611 "created_at": "2025-12-24T10:31:00Z",
612 "estimated_cost": 0.114,
613 "cost_breakdown": {
614 "stage1_extraction": 0.003,
615 "stage2_new_claims": 0.081,
616 "stage2_cached_claims": 0.000,
617 "stage3_holistic": 0.030
618 },
619 "cache_info": {
620 "claims_to_extract": 5,
621 "estimated_cache_hits": 4,
622 "estimated_new_claims": 1
623 },
624 "links": {
625 "self": "/v1/jobs/01J...ULID",
626 "result": "/v1/jobs/01J...ULID/result",
627 "report": "/v1/jobs/01J...ULID/report",
628 "events": "/v1/jobs/01J...ULID/events"
629 }
630 }
631 }}}
632
633 **Error Responses:**
634
635 402 Payment Required - Free tier limit reached, cache-only mode
636
637 {{{{
638 "error": "credit_limit_reached",
639 "message": "Monthly credit limit reached. Entering cache-only mode.",
640 "cache_only_mode": true,
641 "credit_remaining": 0.00,
642 "reset_date": "2025-02-01T00:00:00Z",
643 "action": "Resubmit with cache_preference=allow_partial for cached results"
644 }
645 }}}
646
647 ----
648
649 == 4. Data Schemas ==
650
651 === 4.1 Stage 1 Output: ClaimExtraction ===
652
653 {{{{
654 "job_id": "01J...ULID",
655 "stage": "stage1_extraction",
656 "article_metadata": {
657 "title": "Article title",
658 "source_url": "https://example.com/article",
659 "extracted_text_length": 5234,
660 "language": "en"
661 },
662 "claims": [
663 {
664 "claim_id": "C1",
665 "claim_text": "Original claim text from article",
666 "canonical_claim": "Normalized, deduplicated phrasing",
667 "claim_hash": "sha256:abc123...",
668 "is_central_to_thesis": true,
669 "claim_type": "causal",
670 "evaluability": "evaluable",
671 "risk_tier": "B",
672 "domain": "public_health"
673 }
674 ],
675 "article_thesis": "Main argument detected",
676 "cost": 0.003
677 }
678 }}}
679
680 ----
681
682 === 4.5 Verdict Label Taxonomy ===
683
684 FactHarbor uses **three distinct verdict taxonomies** depending on analysis level:
685
686 ==== 4.5.1 Scenario Verdict Labels (Stage 2) ====
687
688 Used for individual scenario verdicts within a claim.
689
690 **Enum Values:**
691
692 * Highly Likely - Probability 0.85-1.0, high confidence
693 * Likely - Probability 0.65-0.84, moderate-high confidence
694 * Unclear - Probability 0.35-0.64, or low confidence
695 * Unlikely - Probability 0.16-0.34, moderate-high confidence
696 * Highly Unlikely - Probability 0.0-0.15, high confidence
697 * Unsubstantiated - Insufficient evidence to determine probability
698
699 ==== 4.5.2 Claim Verdict Labels (Rollup) ====
700
701 Used when summarizing a claim across all scenarios.
702
703 **Enum Values:**
704
705 * Supported - Majority of scenarios are Likely or Highly Likely
706 * Refuted - Majority of scenarios are Unlikely or Highly Unlikely
707 * Inconclusive - Mixed scenarios or majority Unclear/Unsubstantiated
708
709 **Mapping Logic:**
710
711 * If ≥60% scenarios are (Highly Likely | Likely) → Supported
712 * If ≥60% scenarios are (Highly Unlikely | Unlikely) → Refuted
713 * Otherwise → Inconclusive
714
715 ==== 4.5.3 Article Verdict Labels (Stage 3) ====
716
717 Used for holistic article-level assessment.
718
719 **Enum Values:**
720
721 * WELL-SUPPORTED - Article thesis logically follows from supported claims
722 * MISLEADING - Claims may be true but article commits logical fallacies
723 * REFUTED - Central claims are refuted, invalidating thesis
724 * UNCERTAIN - Insufficient evidence or highly mixed claim verdicts
725
726 **Note:** Article verdict considers **claim centrality** (central claims override supporting claims).
727
728 ==== 4.5.4 API Field Mapping ====
729
730 |=Level|=API Field|=Enum Name
731 |Scenario|scenarios[].verdict.label|scenario_verdict_label
732 |Claim|claims[].rollup_verdict (optional)|claim_verdict_label
733 |Article|article_holistic_assessment.overall_verdict|article_verdict_label
734
735 ----
736
737 == 5. Cache Architecture ==
738
739 === 5.1 Redis Cache Design ===
740
741 **Technology:** Redis 7.0+ (in-memory key-value store)
742
743 **Cache Key Schema:**
744
745 {{{claim:v1norm1:{language}:{sha256(canonical_claim)}
746 }}}
747
748 **Example:**
749
750 {{{Claim (English): "COVID vaccines are 95% effective"
751 Canonical: "covid vaccines are 95 percent effective"
752 Language: "en"
753 SHA256: abc123...def456
754 Key: claim:v1norm1:en:abc123...def456
755 }}}
756
757 **Rationale:** Prevents cross-language collisions and enables per-language cache analytics.
758
759 **Data Structure:**
760
761 {{{SET claim:v1norm1:en:abc123...def456 '{...ClaimAnalysis JSON...}'
762 EXPIRE claim:v1norm1:en:abc123...def456 7776000 # 90 days
763 }}}
764
765 ----
766
767 === 5.1.1 Canonical Claim Normalization (v1) ===
768
769 The cache key depends on deterministic claim normalization. All implementations MUST follow this algorithm exactly.
770
771 **Algorithm: Canonical Claim Normalization v1**
772
773 {{{def normalize_claim_v1(claim_text: str, language: str) -> str:
774 """
775 Normalizes claim to canonical form for cache key generation.
776 Version: v1norm1 (POC1)
777 """
778 import re
779 import unicodedata
780
781 # Step 1: Unicode normalization (NFC)
782 text = unicodedata.normalize('NFC', claim_text)
783
784 # Step 2: Lowercase
785 text = text.lower()
786
787 # Step 3: Remove punctuation (except hyphens in words)
788 text = re.sub(r'[^\w\s-]', '', text)
789
790 # Step 4: Normalize whitespace (collapse multiple spaces)
791 text = re.sub(r'\s+', ' ', text).strip()
792
793 # Step 5: Numeric normalization
794 text = text.replace('%', ' percent')
795 # Spell out single-digit numbers
796 num_to_word = {'0':'zero', '1':'one', '2':'two', '3':'three',
797 '4':'four', '5':'five', '6':'six', '7':'seven',
798 '8':'eight', '9':'nine'}
799 for num, word in num_to_word.items():
800 text = re.sub(rf'\b{num}\b', word, text)
801
802 # Step 6: Common abbreviations (English only in v1)
803 if language == 'en':
804 text = text.replace('covid-19', 'covid')
805 text = text.replace('u.s.', 'us')
806 text = text.replace('u.k.', 'uk')
807
808 # Step 7: NO entity normalization in v1
809 # (Trump vs Donald Trump vs President Trump remain distinct)
810
811 return text
812
813 # Version identifier (include in cache namespace)
814 CANONICALIZER_VERSION = "v1norm1"
815 }}}
816
817 **Cache Key Formula (Updated):**
818
819 {{{language = "en"
820 canonical = normalize_claim_v1(claim_text, language)
821 cache_key = f"claim:{CANONICALIZER_VERSION}:{language}:{sha256(canonical)}"
822
823 Example:
824 claim: "COVID-19 vaccines are 95% effective"
825 canonical: "covid vaccines are 95 percent effective"
826 sha256: abc123...def456
827 key: "claim:v1norm1:en:abc123...def456"
828 }}}
829
830 **Cache Metadata MUST Include:**
831
832 {{{{
833 "canonical_claim": "covid vaccines are 95 percent effective",
834 "canonicalizer_version": "v1norm1",
835 "language": "en",
836 "original_claim_samples": ["COVID-19 vaccines are 95% effective"]
837 }
838 }}}
839
840 **Version Upgrade Path:**
841
842 * v1norm1 → v1norm2: Cache namespace changes, old keys remain valid until TTL
843 * v1normN → v2norm1: Major version bump, invalidate all v1 caches
844
845 ----
846
847 === 5.1.2 Copyright & Data Retention Policy ===
848
849 **Evidence Excerpt Storage:**
850
851 To comply with copyright law and fair use principles:
852
853 **What We Store:**
854
855 * **Metadata only:** Title, author, publisher, URL, publication date
856 * **Short excerpts:** Max 25 words per quote, max 3 quotes per evidence item
857 * **Summaries:** AI-generated bullet points (not verbatim text)
858 * **No full articles:** Never store complete article text beyond job processing
859
860 **Total per Cached Claim:**
861
862 * Scenarios: 2 per claim
863 * Evidence items: 6 per scenario (12 total)
864 * Quotes: 3 per evidence × 25 words = 75 words per item
865 * **Maximum stored verbatim text:** ~~900 words per claim (12 × 75)
866
867 **Retention:**
868
869 * Cache TTL: 90 days
870 * Job outputs: 24 hours (then archived or deleted)
871 * No persistent full-text article storage
872
873 **Rationale:**
874
875 * Short excerpts for citation = fair use
876 * Summaries are transformative (not copyrightable)
877 * Limited retention (90 days max)
878 * No commercial republication of excerpts
879
880 **DMCA Compliance:**
881
882 * Cache invalidation endpoint available for rights holders
883 * Contact: dmca@factharbor.org
884
885 ----
886
887 == Summary ==
888
889 This WYSIWYG preview shows the **structure and key sections** of the 1,515-line API specification.
890
891 **Full specification includes:**
892
893 * Complete API endpoints (7 total)
894 * All data schemas (ClaimExtraction, ClaimAnalysis, HolisticAssessment, Complete)
895 * Quality gates & validation rules
896 * LLM configuration for all 3 stages
897 * Implementation notes with code samples
898 * Testing strategy
899 * Cross-references to other pages
900
901 **The complete specification is available in:**
902
903 * FactHarbor_POC1_API_and_Schemas_Spec_v0_4_1_PATCHED.md (45 KB standalone)
904 * Export files (TEST/PRODUCTION) for xWiki import