Skip to Content

Wiki source code of POC1 API & Schemas Specification

Version 2.2 by Robert Schaub on 2025/12/24 16:28

Show last authors

author	version	line-number	content
		1	= POC1 API & Schemas Specification =
		2
		3	----
		4
		5	== Version History ==
		6
		7	\|=Version\|=Date\|=Changes
		8	\|0.4.1\|2025-12-24\|Applied 9 critical fixes: file format notice, verdict taxonomy, canonicalization algorithm, Stage 1 cost policy, BullMQ fix, language in cache key, historical claims TTL, idempotency, copyright policy
		9	\|0.4\|2025-12-24\|BREAKING: 3-stage pipeline with claim-level caching, user tier system, cache-only mode for free users, Redis cache architecture
		10	\|0.3.1\|2025-12-24\|Fixed single-prompt strategy, SSE clarification, schema canonicalization, cost constraints
		11	\|0.3\|2025-12-24\|Added complete API endpoints, LLM config, risk tiers, scraping details
		12
		13	----
		14
		15	== 1. Core Objective (POC1) ==
		16
		17	The primary technical goal of POC1 is to validate Approach 1 (Single-Pass Holistic Analysis) while implementing claim-level caching to achieve cost sustainability.
		18
		19	The system must prove that AI can identify an article's Main Thesis and determine if supporting claims logically support that thesis without committing fallacies.
		20
		21	=== Success Criteria: ===
		22
		23	* Test with 30 diverse articles
		24	* Target: ≥70% accuracy detecting misleading articles
		25	* Cost: <$0.25 per NEW analysis (uncached)
		26	* Cost: $0.00 for cached claim reuse
		27	* Cache hit rate: ≥50% after 1,000 articles
		28	* Processing time: <2 minutes (standard depth)
		29
		30	=== Economic Model: ===
		31
		32	* Free tier: $10 credit per month (~~40-140 articles depending on cache hits)
		33	* After limit: Cache-only mode (instant, free access to cached claims)
		34	* Paid tier: Unlimited new analyses
		35
		36	----
		37
		38	== 2. Architecture Overview ==
		39
		40	=== 2.1 3-Stage Pipeline with Caching ===
		41
		42	FactHarbor POC1 uses a 3-stage architecture designed for claim-level caching and cost efficiency:
		43
		44	{{{graph TD
		45	A[Article Input] --> B[Stage 1: Extract Claims]
		46	B --> C{For Each Claim}
		47	C --> D[Check Cache]
		48	D -->\|Cache HIT\| E[Return Cached Verdict]
		49	D -->\|Cache MISS\| F[Stage 2: Analyze Claim]
		50	F --> G[Store in Cache]
		51	G --> E
		52	E --> H[Stage 3: Holistic Assessment]
		53	H --> I[Final Report]
		54	}}}
		55
		56	==== Stage 1: Claim Extraction (Haiku, no cache) ====
		57
		58	* Input: Article text
		59	* Output: 5 canonical claims (normalized, deduplicated)
		60	* Model: Claude Haiku 4
		61	* Cost: $0.003 per article
		62	* Cache strategy: No caching (article-specific)
		63
		64	==== Stage 2: Claim Analysis (Sonnet, CACHED) ====
		65
		66	* Input: Single canonical claim
		67	* Output: Scenarios + Evidence + Verdicts
		68	* Model: Claude Sonnet 3.5
		69	* Cost: $0.081 per NEW claim
		70	* Cache strategy: Redis, 90-day TTL
		71	* Cache key: claim:v1norm1:{language}:{sha256(canonical_claim)}
		72
		73	==== Stage 3: Holistic Assessment (Sonnet, no cache) ====
		74
		75	* Input: Article + Claim verdicts (from cache or Stage 2)
		76	* Output: Article verdict + Fallacies + Logic quality
		77	* Model: Claude Sonnet 3.5
		78	* Cost: $0.030 per article
		79	* Cache strategy: No caching (article-specific)
		80
		81	=== Total Cost Formula: ===
		82
		83	{{{Cost = $0.003 (extraction) + (N_new_claims × $0.081) + $0.030 (holistic)
		84
		85	Examples:
		86	- 0 new claims (100% cache hit): $0.033
		87	- 1 new claim (80% cache hit): $0.114
		88	- 3 new claims (40% cache hit): $0.276
		89	- 5 new claims (0% cache hit): $0.438
		90	}}}
		91
		92	----
		93
		94	=== 2.2 User Tier System ===
		95
		96	\|=Tier\|=Monthly Credit\|=After Limit\|=Cache Access\|=Analytics
		97	\|Free\|$10\|Cache-only mode\|✅ Full\|Basic
		98	\|Pro (future)\|$50\|Continues\|✅ Full\|Advanced
		99	\|Enterprise (future)\|Custom\|Continues\|✅ Full + Priority\|Full
		100
		101	Free Tier Economics:
		102
		103	* $10 credit = 40-140 articles analyzed (depending on cache hit rate)
		104	* Average 70 articles/month at 70% cache hit rate
		105	* After limit: Cache-only mode
		106
		107	----
		108
		109	=== 2.3 Cache-Only Mode (Free Tier Feature) ===
		110
		111	When free users reach their $10 monthly limit, they enter Cache-Only Mode:
		112
		113	==== What Cache-Only Mode Provides: ====
		114
		115	✅ Claim Extraction (Platform-Funded):
		116
		117	* Stage 1 extraction runs at $0.003 per article
		118	* Cost: Absorbed by platform (not charged to user credit)
		119	* Rationale: Extraction is necessary to check cache, and cost is negligible
		120	* Rate limit: Max 50 extractions/day in cache-only mode (prevents abuse)
		121
		122	✅ Instant Access to Cached Claims:
		123
		124	* Any claim that exists in cache → Full verdict returned
		125	* Cost: $0 (no LLM calls)
		126	* Response time: <100ms
		127
		128	✅ Partial Article Analysis:
		129
		130	* Check each claim against cache
		131	* Return verdicts for ALL cached claims
		132	* For uncached claims: Return "status": "cache_miss"
		133
		134	✅ Cache Coverage Report:
		135
		136	* "3 of 5 claims available in cache (60% coverage)"
		137	* Links to cached analyses
		138	* Estimated cost to complete: $0.162 (2 new claims)
		139
		140	❌ Not Available in Cache-Only Mode:
		141
		142	* New claim analysis (Stage 2 LLM calls blocked)
		143	* Full holistic assessment (Stage 3 blocked if any claims missing)
		144
		145	==== User Experience Example: ====
		146
		147	{{{{
		148	"status": "cache_only_mode",
		149	"message": "Monthly credit limit reached. Showing cached results only.",
		150	"cache_coverage": {
		151	"claims_total": 5,
		152	"claims_cached": 3,
		153	"claims_missing": 2,
		154	"coverage_percent": 60
		155	},
		156	"cached_claims": [
		157	{"claim_id": "C1", "verdict": "Likely", "confidence": 0.82},
		158	{"claim_id": "C2", "verdict": "Highly Likely", "confidence": 0.91},
		159	{"claim_id": "C4", "verdict": "Unclear", "confidence": 0.55}
		160	],
		161	"missing_claims": [
		162	{"claim_id": "C3", "claim_text": "...", "estimated_cost": "$0.081"},
		163	{"claim_id": "C5", "claim_text": "...", "estimated_cost": "$0.081"}
		164	],
		165	"upgrade_options": {
		166	"top_up": "$5 for 20-70 more articles",
		167	"pro_tier": "$50/month unlimited"
		168	}
		169	}
		170	}}}
		171
		172	Design Rationale:
		173
		174	* Free users still get value (cached claims often answer their question)
		175	* Demonstrates FactHarbor's value (partial results encourage upgrade)
		176	* Sustainable for platform (no additional cost)
		177	* Fair to all users (everyone contributes to cache)
		178
		179	----
		180
		181	== 3. REST API Contract ==
		182
		183	=== 3.1 User Credit Tracking ===
		184
		185	Endpoint: GET /v1/user/credit
		186
		187	Response: 200 OK
		188
		189	{{{{
		190	"user_id": "user_abc123",
		191	"tier": "free",
		192	"credit_limit": 10.00,
		193	"credit_used": 7.42,
		194	"credit_remaining": 2.58,
		195	"reset_date": "2025-02-01T00:00:00Z",
		196	"cache_only_mode": false,
		197	"usage_stats": {
		198	"articles_analyzed": 67,
		199	"claims_from_cache": 189,
		200	"claims_newly_analyzed": 113,
		201	"cache_hit_rate": 0.626
		202	}
		203	}
		204	}}}
		205
		206	----
		207
		208	=== 3.2 Create Analysis Job (3-Stage) ===
		209
		210	Endpoint: POST /v1/analyze
		211
		212	==== Idempotency Support: ====
		213
		214	To prevent duplicate job creation on network retries, clients SHOULD include:
		215
		216	{{{POST /v1/analyze
		217	Idempotency-Key: {client-generated-uuid}
		218	}}}
		219
		220	OR use the client.request_id field:
		221
		222	{{{{
		223	"input_url": "...",
		224	"client": {
		225	"request_id": "client-uuid-12345",
		226	"source_label": "optional"
		227	}
		228	}
		229	}}}
		230
		231	Server Behavior:
		232
		233	* If Idempotency-Key or request_id seen before (within 24 hours):
		234	** Return existing job (200 OK, not 202 Accepted)
		235	** Do NOT create duplicate job or charge twice
		236	* Idempotency keys expire after 24 hours (matches job retention)
		237
		238	Example Response (Idempotent):
		239
		240	{{{{
		241	"job_id": "01J...ULID",
		242	"status": "RUNNING",
		243	"idempotent": true,
		244	"original_request_at": "2025-12-24T10:31:00Z",
		245	"message": "Returning existing job (idempotency key matched)"
		246	}
		247	}}}
		248
		249	==== Request Body: ====
		250
		251	{{{{
		252	"input_type": "url",
		253	"input_url": "https://example.com/medical-report-01",
		254	"input_text": null,
		255	"options": {
		256	"browsing": "on",
		257	"depth": "standard",
		258	"max_claims": 5,
		259	"scenarios_per_claim": 2,
		260	"max_evidence_per_scenario": 6,
		261	"context_aware_analysis": true
		262	},
		263	"client": {
		264	"request_id": "optional-client-tracking-id",
		265	"source_label": "optional"
		266	}
		267	}
		268	}}}
		269
		270	Options:
		271
		272	* browsing: on \| off (retrieve web sources or just output queries)
		273	* depth: standard \| deep (evidence thoroughness)
		274	* max_claims: 1-10 (default: 5 for cost control)
		275	* scenarios_per_claim: 1-5 (default: 2 for cost control)
		276	* max_evidence_per_scenario: 3-10 (default: 6)
		277	* context_aware_analysis: true \| false (experimental)
		278
		279	Response: 202 Accepted
		280
		281	{{{{
		282	"job_id": "01J...ULID",
		283	"status": "QUEUED",
		284	"created_at": "2025-12-24T10:31:00Z",
		285	"estimated_cost": 0.114,
		286	"cost_breakdown": {
		287	"stage1_extraction": 0.003,
		288	"stage2_new_claims": 0.081,
		289	"stage2_cached_claims": 0.000,
		290	"stage3_holistic": 0.030
		291	},
		292	"cache_info": {
		293	"claims_to_extract": 5,
		294	"estimated_cache_hits": 4,
		295	"estimated_new_claims": 1
		296	},
		297	"links": {
		298	"self": "/v1/jobs/01J...ULID",
		299	"result": "/v1/jobs/01J...ULID/result",
		300	"report": "/v1/jobs/01J...ULID/report",
		301	"events": "/v1/jobs/01J...ULID/events"
		302	}
		303	}
		304	}}}
		305
		306	Error Responses:
		307
		308	402 Payment Required - Free tier limit reached, cache-only mode
		309
		310	{{{{
		311	"error": "credit_limit_reached",
		312	"message": "Monthly credit limit reached. Entering cache-only mode.",
		313	"cache_only_mode": true,
		314	"credit_remaining": 0.00,
		315	"reset_date": "2025-02-01T00:00:00Z",
		316	"action": "Resubmit with cache_preference=allow_partial for cached results"
		317	}
		318	}}}
		319
		320	----
		321
		322	== 4. Data Schemas ==
		323
		324	=== 4.1 Stage 1 Output: ClaimExtraction ===
		325
		326	{{{{
		327	"job_id": "01J...ULID",
		328	"stage": "stage1_extraction",
		329	"article_metadata": {
		330	"title": "Article title",
		331	"source_url": "https://example.com/article",
		332	"extracted_text_length": 5234,
		333	"language": "en"
		334	},
		335	"claims": [
		336	{
		337	"claim_id": "C1",
		338	"claim_text": "Original claim text from article",
		339	"canonical_claim": "Normalized, deduplicated phrasing",
		340	"claim_hash": "sha256:abc123...",
		341	"is_central_to_thesis": true,
		342	"claim_type": "causal",
		343	"evaluability": "evaluable",
		344	"risk_tier": "B",
		345	"domain": "public_health"
		346	}
		347	],
		348	"article_thesis": "Main argument detected",
		349	"cost": 0.003
		350	}
		351	}}}
		352
		353	----
		354
		355	=== 4.5 Verdict Label Taxonomy ===
		356
		357	FactHarbor uses three distinct verdict taxonomies depending on analysis level:
		358
		359	==== 4.5.1 Scenario Verdict Labels (Stage 2) ====
		360
		361	Used for individual scenario verdicts within a claim.
		362
		363	Enum Values:
		364
		365	* Highly Likely - Probability 0.85-1.0, high confidence
		366	* Likely - Probability 0.65-0.84, moderate-high confidence
		367	* Unclear - Probability 0.35-0.64, or low confidence
		368	* Unlikely - Probability 0.16-0.34, moderate-high confidence
		369	* Highly Unlikely - Probability 0.0-0.15, high confidence
		370	* Unsubstantiated - Insufficient evidence to determine probability
		371
		372	==== 4.5.2 Claim Verdict Labels (Rollup) ====
		373
		374	Used when summarizing a claim across all scenarios.
		375
		376	Enum Values:
		377
		378	* Supported - Majority of scenarios are Likely or Highly Likely
		379	* Refuted - Majority of scenarios are Unlikely or Highly Unlikely
		380	* Inconclusive - Mixed scenarios or majority Unclear/Unsubstantiated
		381
		382	Mapping Logic:
		383
		384	* If ≥60% scenarios are (Highly Likely \| Likely) → Supported
		385	* If ≥60% scenarios are (Highly Unlikely \| Unlikely) → Refuted
		386	* Otherwise → Inconclusive
		387
		388	==== 4.5.3 Article Verdict Labels (Stage 3) ====
		389
		390	Used for holistic article-level assessment.
		391
		392	Enum Values:
		393
		394	* WELL-SUPPORTED - Article thesis logically follows from supported claims
		395	* MISLEADING - Claims may be true but article commits logical fallacies
		396	* REFUTED - Central claims are refuted, invalidating thesis
		397	* UNCERTAIN - Insufficient evidence or highly mixed claim verdicts
		398
		399	Note: Article verdict considers claim centrality (central claims override supporting claims).
		400
		401	==== 4.5.4 API Field Mapping ====
		402
		403	\|=Level\|=API Field\|=Enum Name
		404	\|Scenario\|scenarios[].verdict.label\|scenario_verdict_label
		405	\|Claim\|claims[].rollup_verdict (optional)\|claim_verdict_label
		406	\|Article\|article_holistic_assessment.overall_verdict\|article_verdict_label
		407
		408	----
		409
		410	== 5. Cache Architecture ==
		411
		412	=== 5.1 Redis Cache Design ===
		413
		414	Technology: Redis 7.0+ (in-memory key-value store)
		415
		416	Cache Key Schema:
		417
		418	{{{claim:v1norm1:{language}:{sha256(canonical_claim)}
		419	}}}
		420
		421	Example:
		422
		423	{{{Claim (English): "COVID vaccines are 95% effective"
		424	Canonical: "covid vaccines are 95 percent effective"
		425	Language: "en"
		426	SHA256: abc123...def456
		427	Key: claim:v1norm1:en:abc123...def456
		428	}}}
		429
		430	Rationale: Prevents cross-language collisions and enables per-language cache analytics.
		431
		432	Data Structure:
		433
		434	{{{SET claim:v1norm1:en:abc123...def456 '{...ClaimAnalysis JSON...}'
		435	EXPIRE claim:v1norm1:en:abc123...def456 7776000 # 90 days
		436	}}}
		437
		438	----
		439
		440	=== 5.1.1 Canonical Claim Normalization (v1) ===
		441
		442	The cache key depends on deterministic claim normalization. All implementations MUST follow this algorithm exactly.
		443
		444	Algorithm: Canonical Claim Normalization v1
		445
		446	{{{def normalize_claim_v1(claim_text: str, language: str) -> str:
		447	"""
		448	Normalizes claim to canonical form for cache key generation.
		449	Version: v1norm1 (POC1)
		450	"""
		451	import re
		452	import unicodedata
		453
		454	# Step 1: Unicode normalization (NFC)
		455	text = unicodedata.normalize('NFC', claim_text)
		456
		457	# Step 2: Lowercase
		458	text = text.lower()
		459
		460	# Step 3: Remove punctuation (except hyphens in words)
		461	text = re.sub(r'[^\w\s-]', '', text)
		462
		463	# Step 4: Normalize whitespace (collapse multiple spaces)
		464	text = re.sub(r'\s+', ' ', text).strip()
		465
		466	# Step 5: Numeric normalization
		467	text = text.replace('%', ' percent')
		468	# Spell out single-digit numbers
		469	num_to_word = {'0':'zero', '1':'one', '2':'two', '3':'three',
		470	'4':'four', '5':'five', '6':'six', '7':'seven',
		471	'8':'eight', '9':'nine'}
		472	for num, word in num_to_word.items():
		473	text = re.sub(rf'\b{num}\b', word, text)
		474
		475	# Step 6: Common abbreviations (English only in v1)
		476	if language == 'en':
		477	text = text.replace('covid-19', 'covid')
		478	text = text.replace('u.s.', 'us')
		479	text = text.replace('u.k.', 'uk')
		480
		481	# Step 7: NO entity normalization in v1
		482	# (Trump vs Donald Trump vs President Trump remain distinct)
		483
		484	return text
		485
		486	# Version identifier (include in cache namespace)
		487	CANONICALIZER_VERSION = "v1norm1"
		488	}}}
		489
		490	Cache Key Formula (Updated):
		491
		492	{{{language = "en"
		493	canonical = normalize_claim_v1(claim_text, language)
		494	cache_key = f"claim:{CANONICALIZER_VERSION}:{language}:{sha256(canonical)}"
		495
		496	Example:
		497	claim: "COVID-19 vaccines are 95% effective"
		498	canonical: "covid vaccines are 95 percent effective"
		499	sha256: abc123...def456
		500	key: "claim:v1norm1:en:abc123...def456"
		501	}}}
		502
		503	Cache Metadata MUST Include:
		504
		505	{{{{
		506	"canonical_claim": "covid vaccines are 95 percent effective",
		507	"canonicalizer_version": "v1norm1",
		508	"language": "en",
		509	"original_claim_samples": ["COVID-19 vaccines are 95% effective"]
		510	}
		511	}}}
		512
		513	Version Upgrade Path:
		514
		515	* v1norm1 → v1norm2: Cache namespace changes, old keys remain valid until TTL
		516	* v1normN → v2norm1: Major version bump, invalidate all v1 caches
		517
		518	----
		519
		520	=== 5.1.2 Copyright & Data Retention Policy ===
		521
		522	Evidence Excerpt Storage:
		523
		524	To comply with copyright law and fair use principles:
		525
		526	What We Store:
		527
		528	* Metadata only: Title, author, publisher, URL, publication date
		529	* Short excerpts: Max 25 words per quote, max 3 quotes per evidence item
		530	* Summaries: AI-generated bullet points (not verbatim text)
		531	* No full articles: Never store complete article text beyond job processing
		532
		533	Total per Cached Claim:
		534
		535	* Scenarios: 2 per claim
		536	* Evidence items: 6 per scenario (12 total)
		537	* Quotes: 3 per evidence × 25 words = 75 words per item
		538	* Maximum stored verbatim text: ~~900 words per claim (12 × 75)
		539
		540	Retention:
		541
		542	* Cache TTL: 90 days
		543	* Job outputs: 24 hours (then archived or deleted)
		544	* No persistent full-text article storage
		545
		546	Rationale:
		547
		548	* Short excerpts for citation = fair use
		549	* Summaries are transformative (not copyrightable)
		550	* Limited retention (90 days max)
		551	* No commercial republication of excerpts
		552
		553	DMCA Compliance:
		554
		555	* Cache invalidation endpoint available for rights holders
		556	* Contact: dmca@factharbor.org
		557
		558	----
		559
		560	== Summary ==
		561
		562	This WYSIWYG preview shows the structure and key sections of the 1,515-line API specification.
		563
		564	Full specification includes:
		565
		566	* Complete API endpoints (7 total)
		567	* All data schemas (ClaimExtraction, ClaimAnalysis, HolisticAssessment, Complete)
		568	* Quality gates & validation rules
		569	* LLM configuration for all 3 stages
		570	* Implementation notes with code samples
		571	* Testing strategy
		572	* Cross-references to other pages
		573
		574	The complete specification is available in:
		575
		576	* FactHarbor_POC1_API_and_Schemas_Spec_v0_4_1_PATCHED.md (45 KB standalone)
		577	* Export files (TEST/PRODUCTION) for xWiki import