POC1 API & Schemas Specification
POC1 API & Schemas Specification
Version: 0.4.1 (POC1)
Scope: POC1 (Test) — URL/Text → Stage 1 Claim Extraction → Stage 2 Claim Analysis (cached) → Stage 3 Article Assessment → result.json + report.md
Version History
| Version | Date | Changes |
|---|---|---|
| 0.4.1 | 2025-12-24 | Codegen-ready canonical contract: locked enums + mapping, deterministic normalization (v1norm1), idempotency, cache_preference semantics, minimal OpenAPI 3.1, URL extraction + SSRF rules, evidence excerpt policy |
| 0.4 | 2025-12-24 | 3-stage pipeline with claim-level caching; cache-only mode for free users; Redis cache architecture |
| 0.3.1 | 2025-12-24 | Reduced schema naming drift; clarified SSE (progress only); added cost knobs and constraints |
| 0.3 | 2025-12-24 | Initial endpoint set + draft schemas |
POC1 Codegen Contract (Canonical)
1) Goals
POC1 provides:
- A REST API to submit either a URL or raw text
- Asynchronous execution (job-based) with progress/status and optional SSE events
- Deterministic outputs:
- ``result.json`` (machine-readable; schema-validated)
- ``report.md`` (human-readable; rendered from JSON via template)
- Mandatory counter-evidence attempt per scenario (or explicit “not found despite search” uncertainty note)
- Claim-level caching to validate cost sustainability
2) Non-goals (POC1)
- Full editorial workflow / human review UI
- Full provenance ledger beyond minimal job metadata
- Complex billing/rate-limit systems (simple caps are fine)
- Multi-language i18n beyond best-effort language detection
3) Pipeline model (locked)
- Stage 1 — Claim Extraction
- Input: URL-text or pasted text
- Output: claim candidates including canonical claim + ``claim_hash``
- Stage 2 — Claim Analysis (cached by claim_hash)
- Input: canonical claim(s)
- Output: scenarios, evidence/counter-evidence, scenario verdicts, plus a claim-level verdict summary
- Stage 3 — Article Assessment
- Input: article text + Stage 2 analyses (from cache and/or freshly computed)
- Output: article-level assessment (main thesis, reasoning quality, key risks, context)
4) Locked verdict taxonomies (two enums + mapping)
Scenario verdict enum (``ScenarioVerdict.verdict_label``):
- ``Highly likely`` | ``Likely`` | ``Unclear`` | ``Unlikely`` | ``Highly unlikely`` | ``Unsubstantiated``
Claim verdict enum (``ClaimVerdict.verdict_label``):
- ``Supported`` | ``Refuted`` | ``Inconclusive``
Mapping rule (for summaries):
- If the claim’s primary-interpretation scenario verdict is:
- ``Highly likely`` / ``Likely`` ⇒ ``Supported``
- ``Highly unlikely`` / ``Unlikely`` ⇒ ``Refuted``
- ``Unclear`` / ``Unsubstantiated`` ⇒ ``Inconclusive``
- If scenarios materially disagree (different assumptions lead to different outcomes) ⇒ ``Inconclusive``, and explain why.
5) Deterministic canonical claim normalization (locked)
Normalization version: ``v1norm1``
Cache key namespace: ``claim:v1norm1:{language}:{sha256(canonical_claim_text)}``
Rules (MUST be implemented exactly; any change requires bumping normalization version):
- Unicode normalize: NFD
2. Lowercase
3. Strip diacritics
4. Normalize apostrophes: ``’`` and ``‘`` → ``'``
5. Normalize percent: ``%`` → `` percent``
6. Normalize whitespace (collapse)
7. Remove punctuation except apostrophes
8. Expand contractions (fixed list for v1norm1)
9. Normalize whitespace again
Contractions list (v1norm1):
- don't→do not, doesn't→does not, didn't→did not, can't→cannot, won't→will not,
- shouldn't→should not, wouldn't→would not, isn't→is not, aren't→are not,
- wasn't→was not, weren't→were not
Reference implementation (normative):
import re
import unicodedata
# Canonical claim normalization for deduplication.
# Version: v1norm1
#
# IMPORTANT:
# - Any change to these rules REQUIRES a new normalization version.
# - Cache keys MUST include the normalization version to avoid collisions.
CONTRACTIONS_V1NORM1 = {
"don't": "do not",
"doesn't": "does not",
"didn't": "did not",
"can't": "cannot",
"won't": "will not",
"shouldn't": "should not",
"wouldn't": "would not",
"isn't": "is not",
"aren't": "are not",
"wasn't": "was not",
"weren't": "were not",
}
def normalize_claim(text: str) -> str:
"""Canonical claim normalization (v1norm1)."""
if text is None:
return ""
# 1) Unicode normalization
text = unicodedata.normalize("NFD", text)
# 2) Lowercase
text = text.lower()
# 3) Remove diacritics
text = "".join(c for c in text if unicodedata.category(c) != "Mn")
# 4) Normalize apostrophes (common unicode variants)
text = text.replace("’", "'").replace("‘", "'")
# 5) Normalize percent sign
text = text.replace("%", " percent")
# 6) Normalize whitespace
text = re.sub(r"\s+", " ", text).strip()
# 7) Remove punctuation except apostrophes
text = re.sub(r"[^\w\s']", "", text)
# 8) Expand contractions (fixed list for v1norm1)
for k, v in CONTRACTIONS_V1NORM1.items():
text = re.sub(rf"\b{re.escape(k)}\b", v, text)
# 9) Final whitespace normalization
text = re.sub(r"\s+", " ", text).strip()
return text
6) Report generation (locked)
- ``report.md`` MUST be generated from ``result.json`` using a deterministic template (server-side).
- The LLM MUST NOT be the sole renderer of report markdown.
7) No chain-of-thought storage/exposure (locked)
- Store/expose only short, user-visible rationale bullets.
- Never store or expose internal reasoning traces.
REST API (POC1)
1) API Versioning
Base path: ``/v1``
2) Content types
- Requests: ``application/json``
- JSON responses: ``application/json``
- Report download: ``text/markdown; charset=utf-8``
- SSE events: ``text/event-stream``
3) Time & IDs
- All timestamps: ISO 8601 UTC (e.g., ``2025-12-24T19:31:30Z``)
- ``job_id``: ULID (string)
- ``claim_hash``: sha256 hex string (lowercase), computed over canonical claim + version + language as specified
4) Authentication
All ``/v1`` endpoints require:
- Header: ``Authorization: Bearer <API_KEY>``
5) Error Envelope (all non-2xx)
"error": {
"code": "CACHE_MISS | VALIDATION_ERROR | UNAUTHORIZED | FORBIDDEN | NOT_FOUND | RATE_LIMITED | UPSTREAM_FETCH_ERROR | INTERNAL_ERROR",
"message": "Human readable message",
"details": {
"field_errors": [
{"field": "input_url", "issue": "Must be a valid https URL"}
]
}
}
}
Endpoints
1) POST /v1/analyze
Creates an asynchronous job. Exactly one of ``input_url`` or ``input_text`` MUST be provided.
Request (AnalyzeRequest)
{
"input_url": "https://example.org/article",
"input_text": null,
"options": {
"max_claims": 5,
"cache_preference": "prefer_cache",
"browsing": "on",
"output_report": true
},
"client": {
"request_id": "optional-idempotency-key"
}
}
Options
- ``max_claims``: integer, 1..50 (default 5)
- ``cache_preference``: ``prefer_cache`` | ``allow_partial`` | ``cache_only`` | ``skip_cache``
- ``prefer_cache``: use cache when available; otherwise run full pipeline
- ``allow_partial``: if Stage 2 cached analyses exist, server MAY skip Stage 1+2 and rerun only Stage 3 using cached analyses
- ``cache_only``: do not run Stage 2 for uncached claims; fail with CACHE_MISS (402) if required claim analyses are missing
- ``skip_cache``: ignore cache and recompute
- ``browsing``: ``on`` | ``off``
- ``off``: do not retrieve evidence; mark evidence items as ``NEEDS_RETRIEVAL`` and output retrieval queries
- ``output_report``: boolean (default true)
Response: 202 Accepted (JobCreated)
{
"job_id": "01J8Y9K6M2Q1J0JZ7E5P8H7Y9C",
"status": "QUEUED",
"created_at": "2025-12-24T19:30:00Z",
"estimated_cost": {
"credits": 420,
"explain": "Estimate may change after Stage 1 (claim count)."
},
"links": {
"self": "/v1/jobs/01J8Y9K6M2Q1J0JZ7E5P8H7Y9C",
"events": "/v1/jobs/01J8Y9K6M2Q1J0JZ7E5P8H7Y9C/events",
"result": "/v1/jobs/01J8Y9K6M2Q1J0JZ7E5P8H7Y9C/result",
"report": "/v1/jobs/01J8Y9K6M2Q1J0JZ7E5P8H7Y9C/report"
}
}
Cache-only failure: 402 (CACHE_MISS)
{
"error": {
"code": "CACHE_MISS",
"message": "cache_only requested but a required claim analysis is not cached.",
"details": {
"missing_claim_hash": "9f2a...c01",
"normalization_version": "v1norm1"
}
}
}
2) GET /v1/jobs/{job_id}
Returns job status and progress.
Response: 200 OK (Job)
{
"job_id": "01J...",
"status": "RUNNING",
"created_at": "2025-12-24T19:30:00Z",
"updated_at": "2025-12-24T19:31:12Z",
"progress": {
"stage": "STAGE2_CLAIM_ANALYSIS",
"stage_progress": 0.4,
"message": "Analyzing claim 3/8"
},
"links": {
"events": "/v1/jobs/01J.../events",
"result": "/v1/jobs/01J.../result",
"report": "/v1/jobs/01J.../report"
}
}
Statuses:
- ``QUEUED`` → ``RUNNING`` → ``SUCCEEDED`` | ``FAILED`` | ``CANCELED``
Stages:
- ``STAGE1_CLAIM_EXTRACT``
- ``STAGE2_CLAIM_ANALYSIS``
- ``STAGE3_ARTICLE_ASSESSMENT``
3) GET /v1/jobs/{job_id}/events
Server-Sent Events (SSE) for job progress reporting (no token streaming).
Event types (minimum):
- ``job.created``
- ``stage.started``
- ``stage.progress``
- ``stage.completed``
- ``job.succeeded``
- ``job.failed``
4) GET /v1/jobs/{job_id}/result
Returns JSON result.
- ``200`` if job ``SUCCEEDED``
- ``409`` if not finished
5) GET /v1/jobs/{job_id}/report
Returns ``text/markdown`` report.
- ``200`` if job ``SUCCEEDED``
- ``409`` if not finished
6) DELETE /v1/jobs/{job_id}
Cancels a job (best-effort) and deletes stored artifacts if supported.
- ``204`` success
- ``404`` if not found
7) GET /v1/health
Response: 200 OK
{
"status": "ok",
"service": "factharbor-poc1",
"version": "v0.9.105",
"time": "2025-12-24T19:31:30Z"
}
Idempotency (Required)
Clients SHOULD send:
- Header: ``Idempotency-Key: <string>`` (preferred)
or - Body: ``client.request_id``
Server behavior:
- Same key + same request body → return existing job (``200``) with:
- ``idempotent=true``
- ``original_request_at``
- Same key + different request body → ``409`` with ``VALIDATION_ERROR`` and mismatch details
Idempotency TTL: 24 hours (matches minimum job retention)
Schemas (result.json)
Top-level AnalysisResult
"job_id": "01J...",
"input": {
"source_type": "url|text",
"source": "https://example.org/article",
"language": "en",
"retrieved_at_utc": "2025-12-24T19:30:12Z",
"extraction": {
"method": "jina|trafilatura|manual",
"word_count": 1234
}
},
"claim_extraction": {
"normalization_version": "v1norm1",
"claims": [
{
"claim_hash": "sha256hex...",
"claim_text": "as found in source",
"canonical_claim_text": "normalized claim",
"confidence": 0.72
}
]
},
"claim_analyses": [
{
"claim_hash": "sha256hex...",
"claim_verdict": {
"verdict_label": "Supported|Refuted|Inconclusive",
"confidence": 0.63,
"rationale_bullets": [
"Short bullet 1",
"Short bullet 2"
]
},
"scenarios": [
{
"scenario_id": "01J...ULID",
"scenario_title": "Interpretation / boundary",
"definitions": { "term": "definition" },
"assumptions": ["..."],
"boundaries": { "time": "...", "geography": "...", "population": "...", "conditions": "..." },
"retrieval_plan": {
"queries": [
{"q": "support query", "purpose": "support"},
{"q": "counter query", "purpose": "counter"}
]
},
"evidence": [
{
"evidence_id": "01J...ULID",
"stance": "supports|undermines|mixed|context_dependent",
"relevance": 0.0,
"summary_bullets": ["..."],
"citation": {
"title": "Source title",
"publisher": "Publisher",
"author_or_org": "Org/Author",
"publication_date": "YYYY-MM-DD",
"url": "https://...",
"retrieved_at_utc": "2025-12-24T19:31:01Z"
},
"excerpt": "optional short quote (max 25 words)",
"reliability_rating": "high|medium|low",
"limitations": ["..."],
"retrieval_status": "OK|NEEDS_RETRIEVAL|FAILED"
}
],
"verdict": {
"verdict_label": "Highly likely|Likely|Unclear|Unlikely|Highly unlikely|Unsubstantiated",
"probability_range": [0.0, 1.0],
"confidence": 0.0,
"rationale_bullets": ["..."],
"key_supporting_evidence_ids": ["E..."],
"key_counter_evidence_ids": ["E..."],
"uncertainty_factors": ["..."],
"what_would_change_my_mind": ["..."]
}
}
],
"quality_gates": {
"gate1_claim_validation": "pass|partial|fail",
"gate2_contradiction_search": "pass|partial|fail",
"gate3_uncertainty_disclosure": "pass|partial|fail",
"gate4_verdict_confidence": "pass|partial|fail",
"fail_reasons": []
}
}
],
"article_assessment": {
"main_thesis": "string",
"thesis_support": "supported|challenged|mixed|unclear",
"overall_reasoning_quality": "high|medium|low",
"summary": "string",
"key_risks": [
"missing evidence",
"cherry-picking",
"correlation/causation",
"time window mismatch"
],
"how_claims_connect_to_thesis": [
"Short bullets connecting claim analyses to the thesis assessment"
]
},
"global_notes": {
"limitations": ["..."],
"policy_notes": []
}
}
Mandatory counter-evidence rule (POC1)
For each scenario, attempt to include at least one evidence item with:
- ``stance`` ∈ {``undermines``, ``mixed``, ``context_dependent``}
If not found:
- include explicit “not found despite targeted search” note in ``uncertainty_factors``
- and/or include evidence items with ``retrieval_status=FAILED`` indicating the attempted search
URL Extraction Rules (POC1)
- Primary extraction: Jina Reader (if enabled)
- Fallback: Trafilatura (or equivalent)
SSRF protections (required):
- Block local IPs, metadata IPs, file:// URLs, and internal hostnames
- If blocked: return ``UPSTREAM_FETCH_ERROR`` with reason
Copyright/ToS safe storage policy (POC1):
- Store only metadata + short excerpts
- Excerpts: max 25 words per quote, and cap total quotes per source
Cost Control Knobs (POC1 defaults)
Defaults:
- ``max_claims = 5``
- ``scenarios_per_claim = 2..3`` (internal Stage 2 policy)
- Cap evidence items per scenario (recommended: 6 total; at least 1 counter)
- Keep rationales concise (bullets)
Minimal OpenAPI 3.1 (POC1)
info:
title: FactHarbor POC1 API
version: 0.9.105
paths:
/v1/analyze:
post:
summary: Create analysis job
parameters:
- in: header
name: Idempotency-Key
required: false
schema: { type: string }
requestBody:
required: true
content:
application/json:
schema:
$ref: '#/components/schemas/AnalyzeRequest'
responses:
'202':
description: Accepted
content:
application/json:
schema:
$ref: '#/components/schemas/JobCreated'
'4XX':
description: Error
content:
application/json:
schema:
$ref: '#/components/schemas/ErrorEnvelope'
/v1/jobs/{job_id}:
get:
summary: Get job status
parameters:
- in: path
name: job_id
required: true
schema: { type: string }
responses:
'200':
description: OK
content:
application/json:
schema:
$ref: '#/components/schemas/Job'
'404':
description: Not Found
content:
application/json:
schema:
$ref: '#/components/schemas/ErrorEnvelope'
delete:
summary: Cancel job (best-effort) and delete artifacts
parameters:
- in: path
name: job_id
required: true
schema: { type: string }
responses:
'204': { description: No Content }
'404':
description: Not Found
content:
application/json:
schema:
$ref: '#/components/schemas/ErrorEnvelope'
/v1/jobs/{job_id}/events:
get:
summary: Job progress via SSE
parameters:
- in: path
name: job_id
required: true
schema: { type: string }
responses:
'200':
description: text/event-stream
/v1/jobs/{job_id}/result:
get:
summary: Get final JSON result
parameters:
- in: path
name: job_id
required: true
schema: { type: string }
responses:
'200':
description: OK
content:
application/json:
schema:
$ref: '#/components/schemas/AnalysisResult'
'409':
description: Not ready
content:
application/json:
schema:
$ref: '#/components/schemas/ErrorEnvelope'
/v1/jobs/{job_id}/report:
get:
summary: Download report (markdown)
parameters:
- in: path
name: job_id
required: true
schema: { type: string }
responses:
'200':
description: text/markdown
'409':
description: Not ready
content:
application/json:
schema:
$ref: '#/components/schemas/ErrorEnvelope'
/v1/health:
get:
summary: Health check
responses:
'200':
description: OK
content:
application/json:
schema:
type: object
properties:
status: { type: string }
components:
schemas:
AnalyzeRequest:
type: object
required: [options]
properties:
input_url: { type: ['string', 'null'] }
input_text: { type: ['string', 'null'] }
options:
$ref: '#/components/schemas/AnalyzeOptions'
client:
type: object
properties:
request_id: { type: string }
AnalyzeOptions:
type: object
properties:
max_claims: { type: integer, minimum: 1, maximum: 50, default: 5 }
cache_preference:
type: string
enum: [prefer_cache, allow_partial, cache_only, skip_cache]
default: prefer_cache
browsing:
type: string
enum: [on, off]
default: on
output_report: { type: boolean, default: true }
JobCreated:
type: object
required: [job_id, status, created_at, links]
properties:
job_id: { type: string }
status: { type: string }
created_at: { type: string }
estimated_cost:
type: object
properties:
credits: { type: integer }
explain: { type: string }
links:
type: object
properties:
self: { type: string }
events: { type: string }
result: { type: string }
report: { type: string }
Job:
type: object
required: [job_id, status, created_at, updated_at, links]
properties:
job_id: { type: string }
status: { type: string, enum: [QUEUED, RUNNING, SUCCEEDED, FAILED, CANCELED] }
progress:
type: object
properties:
stage: { type: string }
stage_progress: { type: number, minimum: 0, maximum: 1 }
message: { type: string }
created_at: { type: string }
updated_at: { type: string }
links:
type: object
properties:
events: { type: string }
result: { type: string }
report: { type: string }
AnalysisResult:
type: object
required: [job_id, claim_extraction, claim_analyses, article_assessment]
properties:
job_id: { type: string }
claim_extraction:
type: object
properties:
normalization_version: { type: string }
claims:
type: array
items:
type: object
properties:
claim_hash: { type: string }
claim_text: { type: string }
canonical_claim_text: { type: string }
confidence: { type: number }
claim_analyses:
type: array
items:
type: object
properties:
claim_hash: { type: string }
scenarios:
type: array
items:
type: object
properties:
scenario_id: { type: string }
scenario_title: { type: string }
verdict:
type: object
properties:
verdict_label: { type: string }
confidence: { type: number }
rationale_bullets:
type: array
items: { type: string }
article_assessment:
type: object
properties:
overall_reasoning_quality: { type: string, enum: [high, medium, low] }
summary: { type: string }
key_risks:
type: array
items: { type: string }
ErrorEnvelope:
type: object
properties:
error:
type: object
properties:
code: { type: string }
message: { type: string }
details: { type: object }
End of page.