POC1 API & Schemas Specification

Version 2.1 by Robert Schaub on 2025/12/24 21:08

POC1 API & Schemas Specification

Version: 0.4.1 (POC1)
Scope: POC1 (Test) — URL/Text → Stage 1 Claim Extraction → Stage 2 Claim Analysis (cached) → Stage 3 Article Assessment → result.json + report.md

Version History

Version	Date	Changes
0.4.1	2025-12-24	Codegen-ready canonical contract: locked enums + mapping, deterministic normalization (v1norm1), idempotency, cache_preference semantics, minimal OpenAPI 3.1, URL extraction + SSRF rules, evidence excerpt policy
0.4	2025-12-24	3-stage pipeline with claim-level caching; cache-only mode for free users; Redis cache architecture
0.3.1	2025-12-24	Reduced schema naming drift; clarified SSE (progress only); added cost knobs and constraints
0.3	2025-12-24	Initial endpoint set + draft schemas

POC1 Codegen Contract (Canonical)

This section is the authoritative, code-generation-ready contract for POC1.
If any other page conflicts with this section, this section wins.

Format note: This page is authored in xWiki 2.1 syntax. If exchanged as a transport `.md` file, import it as xWiki content, not Markdown.

1) Goals

POC1 provides:

A REST API to submit either a URL or raw text
Asynchronous execution (job-based) with progress/status and optional SSE events
Deterministic outputs:
- ``result.json`` (machine-readable; schema-validated)
- ``report.md`` (human-readable; rendered from JSON via template)
Mandatory counter-evidence attempt per scenario (or explicit “not found despite search” uncertainty note)
Claim-level caching to validate cost sustainability

2) Non-goals (POC1)

Full editorial workflow / human review UI
Full provenance ledger beyond minimal job metadata
Complex billing/rate-limit systems (simple caps are fine)
Multi-language i18n beyond best-effort language detection

3) Pipeline model (locked)

Stage 1 — Claim Extraction
- Input: URL-text or pasted text
- Output: claim candidates including canonical claim + ``claim_hash``
Stage 2 — Claim Analysis (cached by claim_hash)
- Input: canonical claim(s)
- Output: scenarios, evidence/counter-evidence, scenario verdicts, plus a claim-level verdict summary
Stage 3 — Article Assessment
- Input: article text + Stage 2 analyses (from cache and/or freshly computed)
- Output: article-level assessment (main thesis, reasoning quality, key risks, context)

4) Locked verdict taxonomies (two enums + mapping)

Scenario verdict enum (``ScenarioVerdict.verdict_label``):

``Highly likely`` | ``Likely`` | ``Unclear`` | ``Unlikely`` | ``Highly unlikely`` | ``Unsubstantiated``

Claim verdict enum (``ClaimVerdict.verdict_label``):

``Supported`` | ``Refuted`` | ``Inconclusive``

Mapping rule (for summaries):

If the claim’s primary-interpretation scenario verdict is:
- ``Highly likely`` / ``Likely`` ⇒ ``Supported``
- ``Highly unlikely`` / ``Unlikely`` ⇒ ``Refuted``
- ``Unclear`` / ``Unsubstantiated`` ⇒ ``Inconclusive``
If scenarios materially disagree (different assumptions lead to different outcomes) ⇒ ``Inconclusive``, and explain why.

5) Deterministic canonical claim normalization (locked)

Normalization version: ``v1norm1``
Cache key namespace: ``claim:v1norm1:{language}:{sha256(canonical_claim_text)}``

Rules (MUST be implemented exactly; any change requires bumping normalization version):

Unicode normalize: NFD
2. Lowercase
3. Strip diacritics
4. Normalize apostrophes: ``’`` and ``‘`` → ``'``
5. Normalize percent: ``%`` → `` percent``
6. Normalize whitespace (collapse)
7. Remove punctuation except apostrophes
8. Expand contractions (fixed list for v1norm1)
9. Normalize whitespace again

Contractions list (v1norm1):

don't→do not, doesn't→does not, didn't→did not, can't→cannot, won't→will not,
shouldn't→should not, wouldn't→would not, isn't→is not, aren't→are not,
wasn't→was not, weren't→were not

Reference implementation (normative):
import re
import unicodedata

# Canonical claim normalization for deduplication.
# Version: v1norm1
#
# IMPORTANT:
# - Any change to these rules REQUIRES a new normalization version.
# - Cache keys MUST include the normalization version to avoid collisions.

CONTRACTIONS_V1NORM1 = {
   "don't": "do not",
   "doesn't": "does not",
   "didn't": "did not",
   "can't": "cannot",
   "won't": "will not",
   "shouldn't": "should not",
   "wouldn't": "would not",
   "isn't": "is not",
   "aren't": "are not",
   "wasn't": "was not",
   "weren't": "were not",
}

def normalize_claim(text: str) -> str:
   """Canonical claim normalization (v1norm1)."""
   if text is None:
       return ""

   # 1) Unicode normalization
    text = unicodedata.normalize("NFD", text)

   # 2) Lowercase
    text = text.lower()

   # 3) Remove diacritics
    text = "".join(c for c in text if unicodedata.category(c) != "Mn")

   # 4) Normalize apostrophes (common unicode variants)
    text = text.replace("’", "'").replace("‘", "'")

   # 5) Normalize percent sign
    text = text.replace("%", " percent")

   # 6) Normalize whitespace
    text = re.sub(r"\s+", " ", text).strip()

   # 7) Remove punctuation except apostrophes
    text = re.sub(r"[^\w\s']", "", text)

   # 8) Expand contractions (fixed list for v1norm1)
   for k, v in CONTRACTIONS_V1NORM1.items():
        text = re.sub(rf"\b{re.escape(k)}\b", v, text)

   # 9) Final whitespace normalization
    text = re.sub(r"\s+", " ", text).strip()

   return text

6) Report generation (locked)

``report.md`` MUST be generated from ``result.json`` using a deterministic template (server-side).
The LLM MUST NOT be the sole renderer of report markdown.

7) No chain-of-thought storage/exposure (locked)

Store/expose only short, user-visible rationale bullets.
Never store or expose internal reasoning traces.

REST API (POC1)

1) API Versioning

Base path: ``/v1``

2) Content types

Requests: ``application/json``
JSON responses: ``application/json``
Report download: ``text/markdown; charset=utf-8``
SSE events: ``text/event-stream``

3) Time & IDs

All timestamps: ISO 8601 UTC (e.g., ``2025-12-24T19:31:30Z``)
``job_id``: ULID (string)
``claim_hash``: sha256 hex string (lowercase), computed over canonical claim + version + language as specified

4) Authentication

All ``/v1`` endpoints require:

Header: ``Authorization: Bearer <API_KEY>``

5) Error Envelope (all non-2xx)

{
"error": {
   "code": "CACHE_MISS | VALIDATION_ERROR | UNAUTHORIZED | FORBIDDEN | NOT_FOUND | RATE_LIMITED | UPSTREAM_FETCH_ERROR | INTERNAL_ERROR",
   "message": "Human readable message",
   "details": {
     "field_errors": [
        {"field": "input_url", "issue": "Must be a valid https URL"}
      ]
    }
  }
}

Endpoints

1) POST /v1/analyze

Creates an asynchronous job. Exactly one of ``input_url`` or ``input_text`` MUST be provided.

Request (AnalyzeRequest)
{
"input_url": "https://example.org/article",
"input_text": null,
"options": {
   "max_claims": 5,
   "cache_preference": "prefer_cache",
   "browsing": "on",
   "output_report": true
  },
"client": {
   "request_id": "optional-idempotency-key"
  }
}

Options

``max_claims``: integer, 1..50 (default 5)
``cache_preference``: ``prefer_cache`` | ``allow_partial`` | ``cache_only`` | ``skip_cache``
- ``prefer_cache``: use cache when available; otherwise run full pipeline
- ``allow_partial``: if Stage 2 cached analyses exist, server MAY skip Stage 1+2 and rerun only Stage 3 using cached analyses
- ``cache_only``: do not run Stage 2 for uncached claims; fail with CACHE_MISS (402) if required claim analyses are missing
- ``skip_cache``: ignore cache and recompute
``browsing``: ``on`` | ``off``
- ``off``: do not retrieve evidence; mark evidence items as ``NEEDS_RETRIEVAL`` and output retrieval queries
``output_report``: boolean (default true)

Response: 202 Accepted (JobCreated)
{
"job_id": "01J8Y9K6M2Q1J0JZ7E5P8H7Y9C",
"status": "QUEUED",
"created_at": "2025-12-24T19:30:00Z",
"estimated_cost": {
   "credits": 420,
   "explain": "Estimate may change after Stage 1 (claim count)."
  },
"links": {
   "self": "/v1/jobs/01J8Y9K6M2Q1J0JZ7E5P8H7Y9C",
   "events": "/v1/jobs/01J8Y9K6M2Q1J0JZ7E5P8H7Y9C/events",
   "result": "/v1/jobs/01J8Y9K6M2Q1J0JZ7E5P8H7Y9C/result",
   "report": "/v1/jobs/01J8Y9K6M2Q1J0JZ7E5P8H7Y9C/report"
  }
}

Cache-only failure: 402 (CACHE_MISS)
{
"error": {
   "code": "CACHE_MISS",
   "message": "cache_only requested but a required claim analysis is not cached.",
   "details": {
     "missing_claim_hash": "9f2a...c01",
     "normalization_version": "v1norm1"
    }
  }
}

2) GET /v1/jobs/{job_id}

Returns job status and progress.

Response: 200 OK (Job)
{
"job_id": "01J...",
"status": "RUNNING",
"created_at": "2025-12-24T19:30:00Z",
"updated_at": "2025-12-24T19:31:12Z",
"progress": {
   "stage": "STAGE2_CLAIM_ANALYSIS",
   "stage_progress": 0.4,
   "message": "Analyzing claim 3/8"
  },
"links": {
   "events": "/v1/jobs/01J.../events",
   "result": "/v1/jobs/01J.../result",
   "report": "/v1/jobs/01J.../report"
  }
}

Statuses:

``QUEUED`` → ``RUNNING`` → ``SUCCEEDED`` | ``FAILED`` | ``CANCELED``

Stages:

``STAGE1_CLAIM_EXTRACT``
``STAGE2_CLAIM_ANALYSIS``
``STAGE3_ARTICLE_ASSESSMENT``

3) GET /v1/jobs/{job_id}/events

Server-Sent Events (SSE) for job progress reporting (no token streaming).

Event types (minimum):

``job.created``
``stage.started``
``stage.progress``
``stage.completed``
``job.succeeded``
``job.failed``

4) GET /v1/jobs/{job_id}/result

Returns JSON result.

``200`` if job ``SUCCEEDED``
``409`` if not finished

5) GET /v1/jobs/{job_id}/report

Returns ``text/markdown`` report.

``200`` if job ``SUCCEEDED``
``409`` if not finished

6) DELETE /v1/jobs/{job_id}

Cancels a job (best-effort) and deletes stored artifacts if supported.

``204`` success
``404`` if not found

7) GET /v1/health

Response: 200 OK
{
"status": "ok",
"service": "factharbor-poc1",
"version": "v0.9.105",
"time": "2025-12-24T19:31:30Z"
}

Idempotency (Required)

Clients SHOULD send:

Header: ``Idempotency-Key: <string>`` (preferred)
or
Body: ``client.request_id``

Server behavior:

Same key + same request body → return existing job (``200``) with:
- ``idempotent=true``
- ``original_request_at``
Same key + different request body → ``409`` with ``VALIDATION_ERROR`` and mismatch details

Idempotency TTL: 24 hours (matches minimum job retention)

Schemas (result.json)

Top-level AnalysisResult

{
"job_id": "01J...",
"input": {
   "source_type": "url|text",
   "source": "https://example.org/article",
   "language": "en",
   "retrieved_at_utc": "2025-12-24T19:30:12Z",
   "extraction": {
     "method": "jina|trafilatura|manual",
     "word_count": 1234
    }
  },
"claim_extraction": {
   "normalization_version": "v1norm1",
   "claims": [
      {
       "claim_hash": "sha256hex...",
       "claim_text": "as found in source",
       "canonical_claim_text": "normalized claim",
       "confidence": 0.72
      }
    ]
  },
"claim_analyses": [
    {
     "claim_hash": "sha256hex...",
     "claim_verdict": {
       "verdict_label": "Supported|Refuted|Inconclusive",
       "confidence": 0.63,
       "rationale_bullets": [
         "Short bullet 1",
         "Short bullet 2"
        ]
      },
     "scenarios": [
        {
         "scenario_id": "01J...ULID",
         "scenario_title": "Interpretation / boundary",
         "definitions": { "term": "definition" },
         "assumptions": ["..."],
         "boundaries": { "time": "...", "geography": "...", "population": "...", "conditions": "..." },
         "retrieval_plan": {
           "queries": [
              {"q": "support query", "purpose": "support"},
              {"q": "counter query", "purpose": "counter"}
            ]
          },
         "evidence": [
            {
             "evidence_id": "01J...ULID",
             "stance": "supports|undermines|mixed|context_dependent",
             "relevance": 0.0,
             "summary_bullets": ["..."],
             "citation": {
               "title": "Source title",
               "publisher": "Publisher",
               "author_or_org": "Org/Author",
               "publication_date": "YYYY-MM-DD",
               "url": "https://...",
               "retrieved_at_utc": "2025-12-24T19:31:01Z"
              },
             "excerpt": "optional short quote (max 25 words)",
             "reliability_rating": "high|medium|low",
             "limitations": ["..."],
             "retrieval_status": "OK|NEEDS_RETRIEVAL|FAILED"
            }
          ],
         "verdict": {
           "verdict_label": "Highly likely|Likely|Unclear|Unlikely|Highly unlikely|Unsubstantiated",
           "probability_range": [0.0, 1.0],
           "confidence": 0.0,
           "rationale_bullets": ["..."],
           "key_supporting_evidence_ids": ["E..."],
           "key_counter_evidence_ids": ["E..."],
           "uncertainty_factors": ["..."],
           "what_would_change_my_mind": ["..."]
          }
        }
      ],
     "quality_gates": {
       "gate1_claim_validation": "pass|partial|fail",
       "gate2_contradiction_search": "pass|partial|fail",
       "gate3_uncertainty_disclosure": "pass|partial|fail",
       "gate4_verdict_confidence": "pass|partial|fail",
       "fail_reasons": []
      }
    }
  ],
"article_assessment": {
   "main_thesis": "string",
   "thesis_support": "supported|challenged|mixed|unclear",
   "overall_reasoning_quality": "high|medium|low",
   "summary": "string",
   "key_risks": [
     "missing evidence",
     "cherry-picking",
     "correlation/causation",
     "time window mismatch"
    ],
   "how_claims_connect_to_thesis": [
     "Short bullets connecting claim analyses to the thesis assessment"
    ]
  },
"global_notes": {
   "limitations": ["..."],
   "policy_notes": []
  }
}

Mandatory counter-evidence rule (POC1)

For each scenario, attempt to include at least one evidence item with:

``stance`` ∈ {``undermines``, ``mixed``, ``context_dependent``}

If not found:

include explicit “not found despite targeted search” note in ``uncertainty_factors``
and/or include evidence items with ``retrieval_status=FAILED`` indicating the attempted search

URL Extraction Rules (POC1)

Primary extraction: Jina Reader (if enabled)
Fallback: Trafilatura (or equivalent)

SSRF protections (required):

Block local IPs, metadata IPs, file:// URLs, and internal hostnames
If blocked: return ``UPSTREAM_FETCH_ERROR`` with reason

Copyright/ToS safe storage policy (POC1):

Store only metadata + short excerpts
Excerpts: max 25 words per quote, and cap total quotes per source

Cost Control Knobs (POC1 defaults)

Defaults:

``max_claims = 5``
``scenarios_per_claim = 2..3`` (internal Stage 2 policy)
Cap evidence items per scenario (recommended: 6 total; at least 1 counter)
Keep rationales concise (bullets)

Minimal OpenAPI 3.1 (POC1)

openapi: 3.1.0
info:
title: FactHarbor POC1 API
version: 0.9.105
paths:
/v1/analyze:
   post:
     summary: Create analysis job
     parameters:
        - in: header
         name: Idempotency-Key
         required: false
         schema: { type: string }
     requestBody:
       required: true
       content:
         application/json:
           schema:
             $ref: '#/components/schemas/AnalyzeRequest'
     responses:
       '202':
         description: Accepted
         content:
           application/json:
             schema:
               $ref: '#/components/schemas/JobCreated'
       '4XX':
         description: Error
         content:
           application/json:
             schema:
               $ref: '#/components/schemas/ErrorEnvelope'
  /v1/jobs/{job_id}:
   get:
     summary: Get job status
     parameters:
        - in: path
         name: job_id
         required: true
         schema: { type: string }
     responses:
       '200':
         description: OK
         content:
           application/json:
             schema:
               $ref: '#/components/schemas/Job'
       '404':
         description: Not Found
         content:
           application/json:
             schema:
               $ref: '#/components/schemas/ErrorEnvelope'
   delete:
     summary: Cancel job (best-effort) and delete artifacts
     parameters:
        - in: path
         name: job_id
         required: true
         schema: { type: string }
     responses:
       '204': { description: No Content }
       '404':
         description: Not Found
         content:
           application/json:
             schema:
               $ref: '#/components/schemas/ErrorEnvelope'
  /v1/jobs/{job_id}/events:
   get:
     summary: Job progress via SSE
     parameters:
        - in: path
         name: job_id
         required: true
        schema: { type: string }
     responses:
       '200':
         description: text/event-stream
  /v1/jobs/{job_id}/result:
   get:
     summary: Get final JSON result
     parameters:
        - in: path
         name: job_id
         required: true
         schema: { type: string }
     responses:
       '200':
         description: OK
         content:
           application/json:
             schema:
               $ref: '#/components/schemas/AnalysisResult'
       '409':
         description: Not ready
         content:
           application/json:
             schema:
               $ref: '#/components/schemas/ErrorEnvelope'
  /v1/jobs/{job_id}/report:
   get:
     summary: Download report (markdown)
     parameters:
        - in: path
         name: job_id
         required: true
         schema: { type: string }
     responses:
       '200':
         description: text/markdown
       '409':
         description: Not ready
         content:
           application/json:
             schema:
               $ref: '#/components/schemas/ErrorEnvelope'
/v1/health:
   get:
     summary: Health check
     responses:
       '200':
         description: OK
         content:
           application/json:
             schema:
               type: object
               properties:
                 status: { type: string }
components:
schemas:
   AnalyzeRequest:
     type: object
     required: [options]
     properties:
       input_url: { type: ['string', 'null'] }
       input_text: { type: ['string', 'null'] }
       options:
         $ref: '#/components/schemas/AnalyzeOptions'
       client:
         type: object
         properties:
           request_id: { type: string }
   AnalyzeOptions:
     type: object
     properties:
       max_claims: { type: integer, minimum: 1, maximum: 50, default: 5 }
       cache_preference:
         type: string
         enum: [prefer_cache, allow_partial, cache_only, skip_cache]
         default: prefer_cache
       browsing:
         type: string
         enum: [on, off]
         default: on
       output_report: { type: boolean, default: true }
   JobCreated:
     type: object
     required: [job_id, status, created_at, links]
     properties:
       job_id: { type: string }
       status: { type: string }
       created_at: { type: string }
       estimated_cost:
         type: object
         properties:
           credits: { type: integer }
           explain: { type: string }
       links:
         type: object
         properties:
           self: { type: string }
           events: { type: string }
           result: { type: string }
           report: { type: string }
   Job:
     type: object
     required: [job_id, status, created_at, updated_at, links]
     properties:
       job_id: { type: string }
       status: { type: string, enum: [QUEUED, RUNNING, SUCCEEDED, FAILED, CANCELED] }
       progress:
         type: object
         properties:
           stage: { type: string }
           stage_progress: { type: number, minimum: 0, maximum: 1 }
           message: { type: string }
       created_at: { type: string }
       updated_at: { type: string }
       links:
         type: object
         properties:
           events: { type: string }
           result: { type: string }
           report: { type: string }
   AnalysisResult:
     type: object
     required: [job_id, claim_extraction, claim_analyses, article_assessment]
     properties:
       job_id: { type: string }
       claim_extraction:
         type: object
         properties:
           normalization_version: { type: string }
           claims:
             type: array
             items:
               type: object
               properties:
                 claim_hash: { type: string }
                 claim_text: { type: string }
                 canonical_claim_text: { type: string }
                 confidence: { type: number }
       claim_analyses:
         type: array
         items:
           type: object
           properties:
             claim_hash: { type: string }
             scenarios:
               type: array
               items:
                 type: object
                 properties:
                   scenario_id: { type: string }
                   scenario_title: { type: string }
                   verdict:
                     type: object
                     properties:
                       verdict_label: { type: string }
                       confidence: { type: number }
                       rationale_bullets:
                         type: array
                         items: { type: string }
       article_assessment:
         type: object
         properties:
           overall_reasoning_quality: { type: string, enum: [high, medium, low] }
           summary: { type: string }
           key_risks:
             type: array
             items: { type: string }
   ErrorEnvelope:
     type: object
     properties:
       error:
         type: object
         properties:
           code: { type: string }
           message: { type: string }
           details: { type: object }

End of page.