Version 2.2 by Robert Schaub on 2025/12/24 16:28

Show last authors
1 = POC1 API & Schemas Specification =
2
3 ----
4
5 == Version History ==
6
7 |=Version|=Date|=Changes
8 |0.4.1|2025-12-24|Applied 9 critical fixes: file format notice, verdict taxonomy, canonicalization algorithm, Stage 1 cost policy, BullMQ fix, language in cache key, historical claims TTL, idempotency, copyright policy
9 |0.4|2025-12-24|**BREAKING:** 3-stage pipeline with claim-level caching, user tier system, cache-only mode for free users, Redis cache architecture
10 |0.3.1|2025-12-24|Fixed single-prompt strategy, SSE clarification, schema canonicalization, cost constraints
11 |0.3|2025-12-24|Added complete API endpoints, LLM config, risk tiers, scraping details
12
13 ----
14
15 == 1. Core Objective (POC1) ==
16
17 The primary technical goal of POC1 is to validate **Approach 1 (Single-Pass Holistic Analysis)** while implementing **claim-level caching** to achieve cost sustainability.
18
19 The system must prove that AI can identify an article's **Main Thesis** and determine if supporting claims logically support that thesis without committing fallacies.
20
21 === Success Criteria: ===
22
23 * Test with 30 diverse articles
24 * Target: ≥70% accuracy detecting misleading articles
25 * Cost: <$0.25 per NEW analysis (uncached)
26 * Cost: $0.00 for cached claim reuse
27 * Cache hit rate: ≥50% after 1,000 articles
28 * Processing time: <2 minutes (standard depth)
29
30 === Economic Model: ===
31
32 * **Free tier:** $10 credit per month (~~40-140 articles depending on cache hits)
33 * **After limit:** Cache-only mode (instant, free access to cached claims)
34 * **Paid tier:** Unlimited new analyses
35
36 ----
37
38 == 2. Architecture Overview ==
39
40 === 2.1 3-Stage Pipeline with Caching ===
41
42 FactHarbor POC1 uses a **3-stage architecture** designed for claim-level caching and cost efficiency:
43
44 {{{graph TD
45 A[Article Input] --> B[Stage 1: Extract Claims]
46 B --> C{For Each Claim}
47 C --> D[Check Cache]
48 D -->|Cache HIT| E[Return Cached Verdict]
49 D -->|Cache MISS| F[Stage 2: Analyze Claim]
50 F --> G[Store in Cache]
51 G --> E
52 E --> H[Stage 3: Holistic Assessment]
53 H --> I[Final Report]
54 }}}
55
56 ==== Stage 1: Claim Extraction (Haiku, no cache) ====
57
58 * **Input:** Article text
59 * **Output:** 5 canonical claims (normalized, deduplicated)
60 * **Model:** Claude Haiku 4
61 * **Cost:** $0.003 per article
62 * **Cache strategy:** No caching (article-specific)
63
64 ==== Stage 2: Claim Analysis (Sonnet, CACHED) ====
65
66 * **Input:** Single canonical claim
67 * **Output:** Scenarios + Evidence + Verdicts
68 * **Model:** Claude Sonnet 3.5
69 * **Cost:** $0.081 per NEW claim
70 * **Cache strategy:** Redis, 90-day TTL
71 * **Cache key:** claim:v1norm1:{language}:{sha256(canonical_claim)}
72
73 ==== Stage 3: Holistic Assessment (Sonnet, no cache) ====
74
75 * **Input:** Article + Claim verdicts (from cache or Stage 2)
76 * **Output:** Article verdict + Fallacies + Logic quality
77 * **Model:** Claude Sonnet 3.5
78 * **Cost:** $0.030 per article
79 * **Cache strategy:** No caching (article-specific)
80
81 === Total Cost Formula: ===
82
83 {{{Cost = $0.003 (extraction) + (N_new_claims × $0.081) + $0.030 (holistic)
84
85 Examples:
86 - 0 new claims (100% cache hit): $0.033
87 - 1 new claim (80% cache hit): $0.114
88 - 3 new claims (40% cache hit): $0.276
89 - 5 new claims (0% cache hit): $0.438
90 }}}
91
92 ----
93
94 === 2.2 User Tier System ===
95
96 |=Tier|=Monthly Credit|=After Limit|=Cache Access|=Analytics
97 |**Free**|$10|Cache-only mode|✅ Full|Basic
98 |**Pro** (future)|$50|Continues|✅ Full|Advanced
99 |**Enterprise** (future)|Custom|Continues|✅ Full + Priority|Full
100
101 **Free Tier Economics:**
102
103 * $10 credit = 40-140 articles analyzed (depending on cache hit rate)
104 * Average 70 articles/month at 70% cache hit rate
105 * After limit: Cache-only mode
106
107 ----
108
109 === 2.3 Cache-Only Mode (Free Tier Feature) ===
110
111 When free users reach their $10 monthly limit, they enter **Cache-Only Mode**:
112
113 ==== What Cache-Only Mode Provides: ====
114
115 ✅ **Claim Extraction (Platform-Funded):**
116
117 * Stage 1 extraction runs at $0.003 per article
118 * **Cost: Absorbed by platform** (not charged to user credit)
119 * Rationale: Extraction is necessary to check cache, and cost is negligible
120 * Rate limit: Max 50 extractions/day in cache-only mode (prevents abuse)
121
122 ✅ **Instant Access to Cached Claims:**
123
124 * Any claim that exists in cache → Full verdict returned
125 * Cost: $0 (no LLM calls)
126 * Response time: <100ms
127
128 ✅ **Partial Article Analysis:**
129
130 * Check each claim against cache
131 * Return verdicts for ALL cached claims
132 * For uncached claims: Return "status": "cache_miss"
133
134 ✅ **Cache Coverage Report:**
135
136 * "3 of 5 claims available in cache (60% coverage)"
137 * Links to cached analyses
138 * Estimated cost to complete: $0.162 (2 new claims)
139
140 ❌ **Not Available in Cache-Only Mode:**
141
142 * New claim analysis (Stage 2 LLM calls blocked)
143 * Full holistic assessment (Stage 3 blocked if any claims missing)
144
145 ==== User Experience Example: ====
146
147 {{{{
148 "status": "cache_only_mode",
149 "message": "Monthly credit limit reached. Showing cached results only.",
150 "cache_coverage": {
151 "claims_total": 5,
152 "claims_cached": 3,
153 "claims_missing": 2,
154 "coverage_percent": 60
155 },
156 "cached_claims": [
157 {"claim_id": "C1", "verdict": "Likely", "confidence": 0.82},
158 {"claim_id": "C2", "verdict": "Highly Likely", "confidence": 0.91},
159 {"claim_id": "C4", "verdict": "Unclear", "confidence": 0.55}
160 ],
161 "missing_claims": [
162 {"claim_id": "C3", "claim_text": "...", "estimated_cost": "$0.081"},
163 {"claim_id": "C5", "claim_text": "...", "estimated_cost": "$0.081"}
164 ],
165 "upgrade_options": {
166 "top_up": "$5 for 20-70 more articles",
167 "pro_tier": "$50/month unlimited"
168 }
169 }
170 }}}
171
172 **Design Rationale:**
173
174 * Free users still get value (cached claims often answer their question)
175 * Demonstrates FactHarbor's value (partial results encourage upgrade)
176 * Sustainable for platform (no additional cost)
177 * Fair to all users (everyone contributes to cache)
178
179 ----
180
181 == 3. REST API Contract ==
182
183 === 3.1 User Credit Tracking ===
184
185 **Endpoint:** GET /v1/user/credit
186
187 **Response:** 200 OK
188
189 {{{{
190 "user_id": "user_abc123",
191 "tier": "free",
192 "credit_limit": 10.00,
193 "credit_used": 7.42,
194 "credit_remaining": 2.58,
195 "reset_date": "2025-02-01T00:00:00Z",
196 "cache_only_mode": false,
197 "usage_stats": {
198 "articles_analyzed": 67,
199 "claims_from_cache": 189,
200 "claims_newly_analyzed": 113,
201 "cache_hit_rate": 0.626
202 }
203 }
204 }}}
205
206 ----
207
208 === 3.2 Create Analysis Job (3-Stage) ===
209
210 **Endpoint:** POST /v1/analyze
211
212 ==== Idempotency Support: ====
213
214 To prevent duplicate job creation on network retries, clients SHOULD include:
215
216 {{{POST /v1/analyze
217 Idempotency-Key: {client-generated-uuid}
218 }}}
219
220 OR use the client.request_id field:
221
222 {{{{
223 "input_url": "...",
224 "client": {
225 "request_id": "client-uuid-12345",
226 "source_label": "optional"
227 }
228 }
229 }}}
230
231 **Server Behavior:**
232
233 * If Idempotency-Key or request_id seen before (within 24 hours):
234 ** Return existing job (200 OK, not 202 Accepted)
235 ** Do NOT create duplicate job or charge twice
236 * Idempotency keys expire after 24 hours (matches job retention)
237
238 **Example Response (Idempotent):**
239
240 {{{{
241 "job_id": "01J...ULID",
242 "status": "RUNNING",
243 "idempotent": true,
244 "original_request_at": "2025-12-24T10:31:00Z",
245 "message": "Returning existing job (idempotency key matched)"
246 }
247 }}}
248
249 ==== Request Body: ====
250
251 {{{{
252 "input_type": "url",
253 "input_url": "https://example.com/medical-report-01",
254 "input_text": null,
255 "options": {
256 "browsing": "on",
257 "depth": "standard",
258 "max_claims": 5,
259 "scenarios_per_claim": 2,
260 "max_evidence_per_scenario": 6,
261 "context_aware_analysis": true
262 },
263 "client": {
264 "request_id": "optional-client-tracking-id",
265 "source_label": "optional"
266 }
267 }
268 }}}
269
270 **Options:**
271
272 * browsing: on | off (retrieve web sources or just output queries)
273 * depth: standard | deep (evidence thoroughness)
274 * max_claims: 1-10 (default: **5** for cost control)
275 * scenarios_per_claim: 1-5 (default: **2** for cost control)
276 * max_evidence_per_scenario: 3-10 (default: **6**)
277 * context_aware_analysis: true | false (experimental)
278
279 **Response:** 202 Accepted
280
281 {{{{
282 "job_id": "01J...ULID",
283 "status": "QUEUED",
284 "created_at": "2025-12-24T10:31:00Z",
285 "estimated_cost": 0.114,
286 "cost_breakdown": {
287 "stage1_extraction": 0.003,
288 "stage2_new_claims": 0.081,
289 "stage2_cached_claims": 0.000,
290 "stage3_holistic": 0.030
291 },
292 "cache_info": {
293 "claims_to_extract": 5,
294 "estimated_cache_hits": 4,
295 "estimated_new_claims": 1
296 },
297 "links": {
298 "self": "/v1/jobs/01J...ULID",
299 "result": "/v1/jobs/01J...ULID/result",
300 "report": "/v1/jobs/01J...ULID/report",
301 "events": "/v1/jobs/01J...ULID/events"
302 }
303 }
304 }}}
305
306 **Error Responses:**
307
308 402 Payment Required - Free tier limit reached, cache-only mode
309
310 {{{{
311 "error": "credit_limit_reached",
312 "message": "Monthly credit limit reached. Entering cache-only mode.",
313 "cache_only_mode": true,
314 "credit_remaining": 0.00,
315 "reset_date": "2025-02-01T00:00:00Z",
316 "action": "Resubmit with cache_preference=allow_partial for cached results"
317 }
318 }}}
319
320 ----
321
322 == 4. Data Schemas ==
323
324 === 4.1 Stage 1 Output: ClaimExtraction ===
325
326 {{{{
327 "job_id": "01J...ULID",
328 "stage": "stage1_extraction",
329 "article_metadata": {
330 "title": "Article title",
331 "source_url": "https://example.com/article",
332 "extracted_text_length": 5234,
333 "language": "en"
334 },
335 "claims": [
336 {
337 "claim_id": "C1",
338 "claim_text": "Original claim text from article",
339 "canonical_claim": "Normalized, deduplicated phrasing",
340 "claim_hash": "sha256:abc123...",
341 "is_central_to_thesis": true,
342 "claim_type": "causal",
343 "evaluability": "evaluable",
344 "risk_tier": "B",
345 "domain": "public_health"
346 }
347 ],
348 "article_thesis": "Main argument detected",
349 "cost": 0.003
350 }
351 }}}
352
353 ----
354
355 === 4.5 Verdict Label Taxonomy ===
356
357 FactHarbor uses **three distinct verdict taxonomies** depending on analysis level:
358
359 ==== 4.5.1 Scenario Verdict Labels (Stage 2) ====
360
361 Used for individual scenario verdicts within a claim.
362
363 **Enum Values:**
364
365 * Highly Likely - Probability 0.85-1.0, high confidence
366 * Likely - Probability 0.65-0.84, moderate-high confidence
367 * Unclear - Probability 0.35-0.64, or low confidence
368 * Unlikely - Probability 0.16-0.34, moderate-high confidence
369 * Highly Unlikely - Probability 0.0-0.15, high confidence
370 * Unsubstantiated - Insufficient evidence to determine probability
371
372 ==== 4.5.2 Claim Verdict Labels (Rollup) ====
373
374 Used when summarizing a claim across all scenarios.
375
376 **Enum Values:**
377
378 * Supported - Majority of scenarios are Likely or Highly Likely
379 * Refuted - Majority of scenarios are Unlikely or Highly Unlikely
380 * Inconclusive - Mixed scenarios or majority Unclear/Unsubstantiated
381
382 **Mapping Logic:**
383
384 * If ≥60% scenarios are (Highly Likely | Likely) → Supported
385 * If ≥60% scenarios are (Highly Unlikely | Unlikely) → Refuted
386 * Otherwise → Inconclusive
387
388 ==== 4.5.3 Article Verdict Labels (Stage 3) ====
389
390 Used for holistic article-level assessment.
391
392 **Enum Values:**
393
394 * WELL-SUPPORTED - Article thesis logically follows from supported claims
395 * MISLEADING - Claims may be true but article commits logical fallacies
396 * REFUTED - Central claims are refuted, invalidating thesis
397 * UNCERTAIN - Insufficient evidence or highly mixed claim verdicts
398
399 **Note:** Article verdict considers **claim centrality** (central claims override supporting claims).
400
401 ==== 4.5.4 API Field Mapping ====
402
403 |=Level|=API Field|=Enum Name
404 |Scenario|scenarios[].verdict.label|scenario_verdict_label
405 |Claim|claims[].rollup_verdict (optional)|claim_verdict_label
406 |Article|article_holistic_assessment.overall_verdict|article_verdict_label
407
408 ----
409
410 == 5. Cache Architecture ==
411
412 === 5.1 Redis Cache Design ===
413
414 **Technology:** Redis 7.0+ (in-memory key-value store)
415
416 **Cache Key Schema:**
417
418 {{{claim:v1norm1:{language}:{sha256(canonical_claim)}
419 }}}
420
421 **Example:**
422
423 {{{Claim (English): "COVID vaccines are 95% effective"
424 Canonical: "covid vaccines are 95 percent effective"
425 Language: "en"
426 SHA256: abc123...def456
427 Key: claim:v1norm1:en:abc123...def456
428 }}}
429
430 **Rationale:** Prevents cross-language collisions and enables per-language cache analytics.
431
432 **Data Structure:**
433
434 {{{SET claim:v1norm1:en:abc123...def456 '{...ClaimAnalysis JSON...}'
435 EXPIRE claim:v1norm1:en:abc123...def456 7776000 # 90 days
436 }}}
437
438 ----
439
440 === 5.1.1 Canonical Claim Normalization (v1) ===
441
442 The cache key depends on deterministic claim normalization. All implementations MUST follow this algorithm exactly.
443
444 **Algorithm: Canonical Claim Normalization v1**
445
446 {{{def normalize_claim_v1(claim_text: str, language: str) -> str:
447 """
448 Normalizes claim to canonical form for cache key generation.
449 Version: v1norm1 (POC1)
450 """
451 import re
452 import unicodedata
453
454 # Step 1: Unicode normalization (NFC)
455 text = unicodedata.normalize('NFC', claim_text)
456
457 # Step 2: Lowercase
458 text = text.lower()
459
460 # Step 3: Remove punctuation (except hyphens in words)
461 text = re.sub(r'[^\w\s-]', '', text)
462
463 # Step 4: Normalize whitespace (collapse multiple spaces)
464 text = re.sub(r'\s+', ' ', text).strip()
465
466 # Step 5: Numeric normalization
467 text = text.replace('%', ' percent')
468 # Spell out single-digit numbers
469 num_to_word = {'0':'zero', '1':'one', '2':'two', '3':'three',
470 '4':'four', '5':'five', '6':'six', '7':'seven',
471 '8':'eight', '9':'nine'}
472 for num, word in num_to_word.items():
473 text = re.sub(rf'\b{num}\b', word, text)
474
475 # Step 6: Common abbreviations (English only in v1)
476 if language == 'en':
477 text = text.replace('covid-19', 'covid')
478 text = text.replace('u.s.', 'us')
479 text = text.replace('u.k.', 'uk')
480
481 # Step 7: NO entity normalization in v1
482 # (Trump vs Donald Trump vs President Trump remain distinct)
483
484 return text
485
486 # Version identifier (include in cache namespace)
487 CANONICALIZER_VERSION = "v1norm1"
488 }}}
489
490 **Cache Key Formula (Updated):**
491
492 {{{language = "en"
493 canonical = normalize_claim_v1(claim_text, language)
494 cache_key = f"claim:{CANONICALIZER_VERSION}:{language}:{sha256(canonical)}"
495
496 Example:
497 claim: "COVID-19 vaccines are 95% effective"
498 canonical: "covid vaccines are 95 percent effective"
499 sha256: abc123...def456
500 key: "claim:v1norm1:en:abc123...def456"
501 }}}
502
503 **Cache Metadata MUST Include:**
504
505 {{{{
506 "canonical_claim": "covid vaccines are 95 percent effective",
507 "canonicalizer_version": "v1norm1",
508 "language": "en",
509 "original_claim_samples": ["COVID-19 vaccines are 95% effective"]
510 }
511 }}}
512
513 **Version Upgrade Path:**
514
515 * v1norm1 → v1norm2: Cache namespace changes, old keys remain valid until TTL
516 * v1normN → v2norm1: Major version bump, invalidate all v1 caches
517
518 ----
519
520 === 5.1.2 Copyright & Data Retention Policy ===
521
522 **Evidence Excerpt Storage:**
523
524 To comply with copyright law and fair use principles:
525
526 **What We Store:**
527
528 * **Metadata only:** Title, author, publisher, URL, publication date
529 * **Short excerpts:** Max 25 words per quote, max 3 quotes per evidence item
530 * **Summaries:** AI-generated bullet points (not verbatim text)
531 * **No full articles:** Never store complete article text beyond job processing
532
533 **Total per Cached Claim:**
534
535 * Scenarios: 2 per claim
536 * Evidence items: 6 per scenario (12 total)
537 * Quotes: 3 per evidence × 25 words = 75 words per item
538 * **Maximum stored verbatim text:** ~~900 words per claim (12 × 75)
539
540 **Retention:**
541
542 * Cache TTL: 90 days
543 * Job outputs: 24 hours (then archived or deleted)
544 * No persistent full-text article storage
545
546 **Rationale:**
547
548 * Short excerpts for citation = fair use
549 * Summaries are transformative (not copyrightable)
550 * Limited retention (90 days max)
551 * No commercial republication of excerpts
552
553 **DMCA Compliance:**
554
555 * Cache invalidation endpoint available for rights holders
556 * Contact: dmca@factharbor.org
557
558 ----
559
560 == Summary ==
561
562 This WYSIWYG preview shows the **structure and key sections** of the 1,515-line API specification.
563
564 **Full specification includes:**
565
566 * Complete API endpoints (7 total)
567 * All data schemas (ClaimExtraction, ClaimAnalysis, HolisticAssessment, Complete)
568 * Quality gates & validation rules
569 * LLM configuration for all 3 stages
570 * Implementation notes with code samples
571 * Testing strategy
572 * Cross-references to other pages
573
574 **The complete specification is available in:**
575
576 * FactHarbor_POC1_API_and_Schemas_Spec_v0_4_1_PATCHED.md (45 KB standalone)
577 * Export files (TEST/PRODUCTION) for xWiki import