POC1 API & Schemas Specification

1

= POC1 API & Schemas Specification =

2

3

----

4

5

== Version History ==

6

7

|=Version|=Date|=Changes

8

|0.4.1|2025-12-24|Applied 9 critical fixes: file format notice, verdict taxonomy, canonicalization algorithm, Stage 1 cost policy, BullMQ fix, language in cache key, historical claims TTL, idempotency, copyright policy

9

|0.4|2025-12-24|**BREAKING:** 3-stage pipeline with claim-level caching, user tier system, cache-only mode for free users, Redis cache architecture

10

|0.3.1|2025-12-24|Fixed single-prompt strategy, SSE clarification, schema canonicalization, cost constraints

11

|0.3|2025-12-24|Added complete API endpoints, LLM config, risk tiers, scraping details

----

== POC1 Codegen Contract (Canonical) ==

16

17

18

This section is the **authoritative, code-generation-ready contract** for POC1.

19

If any other page conflicts with this section, **this section wins**.

20

21

22

=== Canonical outputs ===

23

* **result.json**: schema-validated, machine-readable output

24

* **report.md**: deterministic template rendering from ``result.json`` (LLM must not free-write the final report)

25

26

=== Locked enums ===

27

**Scenario verdict** (``ScenarioVerdict.verdict_label``):

28

29

30

**Claim verdict** (``ClaimVerdict.verdict_label``):

31

* ``Supported`` | ``Refuted`` | ``Inconclusive``

32

33

**Mapping rule (summary):**

34

* Primary-interpretation scenario:

35

** ``Highly likely`` / ``Likely`` ⇒ ``Supported``

36

** ``Highly unlikely`` / ``Unlikely`` ⇒ ``Refuted``

37

** ``Unclear`` / ``Unsubstantiated`` ⇒ ``Inconclusive``

38

* If scenarios materially disagree (assumption-dependent outcomes) ⇒ ``Inconclusive`` (explain why)

39

40

=== Deterministic claim normalization (cache key) ===

41

* Normalization version: ``v1norm1``

42

* Cache namespace: ``claim:v1norm1:{language}:{sha256(canonical_claim_text)}``

43

* Normative reference implementation is defined in section **5.1.1** (no ellipses; must match exactly).

=== Idempotency ===

Clients SHOULD send:

* Header: ``Idempotency-Key: <client-generated-uuid>`` (preferred)

48

or

49

* Body: ``client.request_id``

50

51

Server rules:

52

* Same key + same request body ⇒ return existing job (``200``) and include ``idempotent=true``.

53

* Same key + different request body ⇒ ``409`` ``VALIDATION_ERROR``.

54

55

Idempotency TTL: 24 hours.

56

57

=== Minimal OpenAPI 3.1 (authoritative for codegen) ===

openapi: 3.1.0

info:

title: FactHarbor POC1 API

version: 0.9.106

servers:

- url: /

paths:

/v1/analyze:

post:

summary: Create analysis job

parameters:

- in: header

required: true

schema: { type: string }

- in: header

required: false

schema: { type: string }

requestBody:

required: true

content:

application/json:

schema:

$ref: '#/components/schemas/AnalyzeRequest'

84

responses:

85

'202':

86

description: Accepted

content:

application/json:

schema:

$ref: '#/components/schemas/JobCreated'

'4XX':

description: Error

content:

application/json:

schema:

$ref: '#/components/schemas/ErrorEnvelope'

97

/v1/jobs/{job_id}:

98

get:

99

summary: Get job status

parameters:

- in: path

required: true

schema: { type: string }

- in: header

required: true

schema: { type: string }

responses:

'200':

description: OK

content:

application/json:

schema:

$ref: '#/components/schemas/Job'

116

'404':

117

description: Not Found

content:

application/json:

schema:

$ref: '#/components/schemas/ErrorEnvelope'

122

delete:

123

summary: Cancel job (best-effort) and delete artifacts

parameters:

- in: path

required: true

schema: { type: string }

- in: header

required: true

schema: { type: string }

133

responses:

134

'204': { description: No Content }

135

'404':

136

description: Not Found

content:

application/json:

schema:

$ref: '#/components/schemas/ErrorEnvelope'

141

/v1/jobs/{job_id}/events:

142

get:

143

summary: Job progress via SSE (no token streaming)

parameters:

- in: path

required: true

schema: { type: string }

- in: header

required: true

schema: { type: string }

153

responses:

154

'200':

155

description: text/event-stream

156

/v1/jobs/{job_id}/result:

157

get:

158

summary: Get final JSON result

parameters:

- in: path

required: true

schema: { type: string }

- in: header

required: true

schema: { type: string }

responses:

'200':

description: OK

content:

application/json:

schema:

$ref: '#/components/schemas/AnalysisResult'

175

'409':

176

description: Not ready

content:

application/json:

schema:

$ref: '#/components/schemas/ErrorEnvelope'

181

/v1/jobs/{job_id}/report:

182

get:

183

summary: Download report (markdown)

parameters:

- in: path

required: true

schema: { type: string }

- in: header

required: true

schema: { type: string }

193

responses:

194

'200':

195

description: text/markdown

196

'409':

197

description: Not ready

content:

application/json:

schema:

$ref: '#/components/schemas/ErrorEnvelope'

202

/v1/health:

203

get:

204

summary: Health check

responses:

'200':

description: OK

components:

schemas:

AnalyzeRequest:

type: object

properties:

input_url: { type: ['string', 'null'] }

214

input_text: { type: ['string', 'null'] }

options:

type: object

properties:

max_claims: { type: integer, minimum: 1, maximum: 50, default: 5 }

219

cache_preference:

220

type: string

221

enum: [prefer_cache, allow_partial, cache_only, skip_cache]

222

default: prefer_cache

browsing:

type: string

enum: [on, off]

default: on

output_report: { type: boolean, default: true }

client:

type: object

properties:

request_id: { type: string }

232

JobCreated:

233

type: object

234

required: [job_id, status, created_at, links]

235

properties:

236

job_id: { type: string }

237

status: { type: string }

238

created_at: { type: string }

links:

type: object

properties:

self: { type: string }

243

events: { type: string }

244

result: { type: string }

245

report: { type: string }

246

Job:

247

type: object

248

required: [job_id, status, created_at, updated_at]

249

properties:

250

job_id: { type: string }

251

status:

252

type: string

253

enum: [QUEUED, RUNNING, SUCCEEDED, FAILED, CANCELED]

254

created_at: { type: string }

255

updated_at: { type: string }

AnalysisResult:

type: object

properties:

job_id: { type: string }

ErrorEnvelope:

type: object

properties:

error:

type: object

properties:

code: { type: string }

267

message: { type: string }

268

details: { type: object }

----

== 1. Core Objective (POC1) ==

274

275

The primary technical goal of POC1 is to validate **Approach 1 (Single-Pass Holistic Analysis)** while implementing **claim-level caching** to achieve cost sustainability.

276

277

The system must prove that AI can identify an article's **Main Thesis** and determine if supporting claims logically support that thesis without committing fallacies.

278

279

=== Success Criteria: ===

280

281

* Test with 30 diverse articles

282

* Target: ≥70% accuracy detecting misleading articles

283

* Cost: <$0.25 per NEW analysis (uncached)

284

* Cost: $0.00 for cached claim reuse

285

* Cache hit rate: ≥50% after 1,000 articles

286

* Processing time: <2 minutes (standard depth)

287

288

=== Economic Model: ===

289

290

* **Free tier:** $10 credit per month (~~40-140 articles depending on cache hits)

291

* **After limit:** Cache-only mode (instant, free access to cached claims)

292

* **Paid tier:** Unlimited new analyses

----

== 2. Architecture Overview ==

297

298

=== 2.1 3-Stage Pipeline with Caching ===

299

300

FactHarbor POC1 uses a **3-stage architecture** designed for claim-level caching and cost efficiency:

graph TD

A[Article Input] --> B[Stage 1: Extract Claims]

305

B --> C{For Each Claim}

306

C --> D[Check Cache]

307

D -->|Cache HIT| E[Return Cached Verdict]

308

D -->|Cache MISS| F[Stage 2: Analyze Claim]

309

F --> G[Store in Cache]

310

G --> E

311

E --> H[Stage 3: Holistic Assessment]

312

H --> I[Final Report]

313

314

315

==== Stage 1: Claim Extraction (FAST model, no cache) ====

316

317

* **Input:** Article text

318

* **Output:** 5 canonical claims (normalized, deduplicated)

319

* **Model:** Provider-default FAST model (default, configurable via LLM abstraction layer)

320

* **Cost:** $0.003 per article

321

* **Cache strategy:** No caching (article-specific)

322

323

==== Stage 2: Claim Analysis (REASONING model, CACHED) ====

324

325

* **Input:** Single canonical claim

326

* **Output:** Scenarios + Evidence + Verdicts

327

* **Model:** Provider-default REASONING model (default, configurable via LLM abstraction layer)

328

* **Cost:** $0.081 per NEW claim

329

* **Cache strategy:** Redis, 90-day TTL

330

* **Cache key:** claim:v1norm1:{language}:{sha256(canonical_claim)}

331

332

==== Stage 3: Holistic Assessment (REASONING model, no cache) ====

333

334

* **Input:** Article + Claim verdicts (from cache or Stage 2)

335

* **Output:** Article verdict + Fallacies + Logic quality

336

* **Model:** Provider-default REASONING model (default, configurable via LLM abstraction layer)

337

* **Cost:** $0.030 per article

338

* **Cache strategy:** No caching (article-specific)

**Note:** Stage 3 implements **Approach 1 (Single-Pass Holistic Analysis)** from the [[Article Verdict Problem>>Test.FactHarbor.Specification.POC.Article-Verdict-Problem]]. While claim analysis (Stage 2) is cached for efficiency, the holistic assessment maintains the integrated evaluation philosophy of Approach 1.

343

344

=== Total Cost Formula: ===

345

346

{{{Cost = $0.003 (extraction) + (N_new_claims × $0.081) + $0.030 (holistic)

347

348

Examples:

349

- 0 new claims (100% cache hit): $0.033

350

- 1 new claim (80% cache hit): $0.114

351

- 3 new claims (40% cache hit): $0.276

352

- 5 new claims (0% cache hit): $0.438

}}}

----

=== 2.2 User Tier System ===

**Free Tier Economics:**

365

366

* $10 credit = 40-140 articles analyzed (depending on cache hit rate)

367

* Average 70 articles/month at 70% cache hit rate

368

* After limit: Cache-only mode

----

=== 2.3 Cache-Only Mode (Free Tier Feature) ===

373

374

When free users reach their $10 monthly limit, they enter **Cache-Only Mode**:

==== Stage 3: Holistic Assessment - Complete Specification ====

379

380

===== 3.3.1 Overview =====

381

382

**Purpose:** Synthesize individual claim analyses into an overall article assessment, identifying logical fallacies, reasoning quality, and publication readiness.

383

384

**Approach:** **Single-Pass Holistic Analysis** (Approach 1 from Comparison Matrix)

385

386

**Why This Approach for POC1:**

387

* ✅ **1 API call** (vs 2 for Two-Pass or Judge)

388

* ✅ **Low cost** ($0.030 per article)

389

* ✅ **Fast** (4-6 seconds)

390

* ✅ **Low complexity** (simple implementation)

391

* ⚠️ **Medium reliability** (acceptable for POC1, will improve in POC2/Production)

392

393

**Alternative Approaches Considered:**

394

395

396

| **1. Single-Pass** ⭐ | 1 | 💰 Low | ⚡ Fast | 🟢 Low | ⚠️ Medium | **POC1**

397

| 2. Two-Pass | 2 | 💰💰 Med | 🐢 Slow | 🟡 Med | ✅ High | POC2/Prod

398

| 3. Structured | 1 | 💰 Low | ⚡ Fast | 🟡 Med | ✅ High | POC1 (alternative)

399

| 4. Weighted | 1 | 💰 Low | ⚡ Fast | 🟢 Low | ⚠️ Medium | POC1 (alternative)

400

| 5. Heuristics | 1 | 💰 Lowest | ⚡⚡ Fastest | 🟡 Med | ⚠️ Medium | Any

401

| 6. Hybrid | 1 | 💰 Low | ⚡ Fast | 🔴 Med-High | ✅ High | POC2

402

| 7. Judge | 2 | 💰💰 Med | 🐢 Slow | 🟡 Med | ✅ High | Production

403

404

**POC1 Choice:** Approach 1 (Single-Pass) for speed and simplicity. Will upgrade to Approach 2 (Two-Pass) or 6 (Hybrid) in POC2 for higher reliability.

405

406

===== 3.3.2 What Stage 3 Evaluates =====

407

408

Stage 3 performs **integrated holistic analysis** considering:

409

410

**1. Claim-Level Aggregation:**

411

* Verdict distribution (how many TRUE vs FALSE vs DISPUTED)

412

* Average confidence across all claims

413

* Claim interdependencies (do claims support/contradict each other?)

414

* Critical claim identification (which claims are most important?)

415

416

**2. Contextual Factors:**

417

* **Source credibility**: Is the article from a reputable publisher?

418

* **Author expertise**: Does the author have relevant credentials?

419

* **Publication date**: Is information current or outdated?

420

* **Claim coherence**: Do claims form a logical narrative?

421

* **Missing context**: Are important caveats or qualifications missing?

422

423

**3. Logical Fallacies:**

424

* **Cherry-picking**: Selective evidence presentation

425

* **False equivalence**: Treating unequal things as equal

426

* **Straw man**: Misrepresenting opposing arguments

427

* **Ad hominem**: Attacking person instead of argument

428

* **Slippery slope**: Assuming extreme consequences without justification

429

* **Circular reasoning**: Conclusion assumes premise

430

* **False dichotomy**: Presenting only two options when more exist

431

432

**4. Reasoning Quality:**

433

* **Evidence strength**: Quality and quantity of supporting evidence

434

* **Logical coherence**: Arguments follow logically

435

* **Transparency**: Assumptions and limitations acknowledged

436

* **Nuance**: Complexity and uncertainty appropriately addressed

437

438

**5. Publication Readiness:**

439

* **Risk tier assignment**: A (high risk), B (medium), or C (low risk)

440

* **Publication mode**: DRAFT_ONLY, AI_GENERATED, or HUMAN_REVIEWED

441

* **Required disclaimers**: What warnings should accompany this content?

442

443

===== 3.3.3 Implementation: Single-Pass Approach =====

444

445

**Input:**

446

* Original article text (full content)

447

* Stage 2 claim analyses (array of ClaimAnalysis objects)

448

* Article metadata (URL, title, author, date, source)

**Processing:**

# Pseudo-code for Stage 3 (Single-Pass)

454

455

def stage3_holistic_assessment(article, claim_analyses, metadata):

456

"""

457

Single-pass holistic assessment using Provider-default REASONING model.

458

459

Approach 1: One comprehensive prompt that asks the LLM to:

460

1. Review all claim verdicts

461

2. Identify patterns and dependencies

462

3. Detect logical fallacies

463

4. Assess reasoning quality

464

5. Determine credibility score and risk tier

465

6. Generate publication recommendations

466

"""

467

468

# Construct comprehensive prompt

469

prompt = f"""

470

You are analyzing an article for factual accuracy and logical reasoning.

471

472

ARTICLE METADATA:

473

- Title: {metadata['title']}

474

- Source: {metadata['source']}

475

- Date: {metadata['date']}

476

- Author: {metadata['author']}

ARTICLE TEXT:

{article}

INDIVIDUAL CLAIM ANALYSES:

482

{format_claim_analyses(claim_analyses)}

483

484

YOUR TASK:

485

Perform a holistic assessment considering:

486

487

1. CLAIM AGGREGATION:

488

- Review the verdict for each claim

489

- Identify any interdependencies between claims

490

- Determine which claims are most critical to the article's thesis

491

492

2. CONTEXTUAL EVALUATION:

493

- Assess source credibility

494

- Evaluate author expertise

495

- Consider publication timeliness

496

- Identify missing context or important caveats

497

498

3. LOGICAL FALLACIES:

499

- Identify any logical fallacies present

500

- For each fallacy, provide:

501

* Type of fallacy

502

* Where it occurs in the article

503

* Why it's problematic

504

* Severity (minor/moderate/severe)

505

506

4. REASONING QUALITY:

507

- Evaluate evidence strength

508

- Assess logical coherence

509

- Check for transparency in assumptions

510

- Evaluate handling of nuance and uncertainty

511

512

5. CREDIBILITY SCORING:

513

- Calculate overall credibility score (0.0-1.0)

514

- Assign risk tier:

515

* A (high risk): ≤0.5 credibility OR severe fallacies

516

* B (medium risk): 0.5-0.8 credibility OR moderate issues

517

* C (low risk): >0.8 credibility AND no significant issues

518

519

6. PUBLICATION RECOMMENDATIONS:

520

- Determine publication mode:

521

* DRAFT_ONLY: Tier A, multiple severe issues

522

* AI_GENERATED: Tier B/C, acceptable quality with disclaimers

523

* HUMAN_REVIEWED: Complex or borderline cases

524

- List required disclaimers

525

- Explain decision rationale

526

527

OUTPUT FORMAT:

528

Return a JSON object matching the ArticleAssessment schema.

"""

# Call LLM

response = llm_client.complete(

533

model="claude-sonnet-4-5-20250929",

534

prompt=prompt,

535

max_tokens=4000,

536

response_format="json"

537

)

538

539

# Parse and validate response

540

assessment = parse_json(response.content)

541

validate_article_assessment_schema(assessment)

return assessment

**Prompt Engineering Notes:**

547

548

1. **Structured Instructions**: Break down task into 6 clear sections

549

2. **Context-Rich**: Provide article + all claim analyses + metadata

550

3. **Explicit Criteria**: Define credibility scoring and risk tiers precisely

551

4. **JSON Schema**: Request structured output matching ArticleAssessment schema

552

5. **Examples** (in production): Include 2-3 example assessments for consistency

553

554

===== 3.3.4 Credibility Scoring Algorithm =====

555

556

**Base Score Calculation:**

557

558

559

def calculate_credibility_score(claim_analyses, fallacies, contextual_factors):

560

"""

561

Calculate overall credibility score (0.0-1.0).

562

563

This is a GUIDELINE for the LLM, not strict code.

564

The LLM has flexibility to adjust based on context.

565

"""

566

567

# 1. Claim Verdict Score (60% weight)

568

verdict_weights = {

569

"TRUE": 1.0,

570

"PARTIALLY_TRUE": 0.7,

"DISPUTED": 0.5,

"UNSUPPORTED": 0.3,

"FALSE": 0.0,

"UNVERIFIABLE": 0.4

}

claim_scores = [

verdict_weights[c.verdict.label] * c.verdict.confidence

579

for c in claim_analyses

580

]

581

avg_claim_score = sum(claim_scores) / len(claim_scores)

582

claim_component = avg_claim_score * 0.6

583

584

# 2. Fallacy Penalty (20% weight)

585

fallacy_penalties = {

"minor": -0.05,

"moderate": -0.15,

"severe": -0.30

}

fallacy_score = 1.0

for fallacy in fallacies:

593

fallacy_score += fallacy_penalties[fallacy.severity]

594

595

fallacy_score = max(0.0, min(1.0, fallacy_score))

596

fallacy_component = fallacy_score * 0.2

597

598

# 3. Contextual Factors (20% weight)

599

context_adjustments = {

600

"source_credibility": {"positive": +0.1, "neutral": 0, "negative": -0.1},

601

"author_expertise": {"positive": +0.1, "neutral": 0, "negative": -0.1},

602

"timeliness": {"positive": +0.05, "neutral": 0, "negative": -0.05},

603

"transparency": {"positive": +0.05, "neutral": 0, "negative": -0.05}

}

context_score = 1.0

for factor in contextual_factors:

608

adjustment = context_adjustments.get(factor.factor, {}).get(factor.impact, 0)

609

context_score += adjustment

610

611

context_score = max(0.0, min(1.0, context_score))

612

context_component = context_score * 0.2

613

614

# 4. Combine components

615

final_score = claim_component + fallacy_component + context_component

616

617

# 5. Apply confidence modifier

618

avg_confidence = sum(c.verdict.confidence for c in claim_analyses) / len(claim_analyses)

619

final_score = final_score * (0.8 + 0.2 * avg_confidence)

620

621

return max(0.0, min(1.0, final_score))

622

623

624

**Note:** This algorithm is a **guideline** provided to the LLM in the system prompt. The LLM has flexibility to adjust based on specific article context, but should generally follow this structure for consistency.

625

626

===== 3.3.5 Risk Tier Assignment =====

627

628

**Automatic Risk Tier Rules:**

629

630

631

Risk Tier A (High Risk - Requires Review):

632

- Credibility score ≤ 0.5, OR

633

- Any severe fallacies detected, OR

634

- Multiple (3+) moderate fallacies, OR

635

- 50%+ of claims are FALSE or UNSUPPORTED

636

637

Risk Tier B (Medium Risk - May Publish with Disclaimers):

638

- Credibility score 0.5-0.8, OR

639

- 1-2 moderate fallacies, OR

640

- 20-49% of claims are DISPUTED or PARTIALLY_TRUE

641

642

Risk Tier C (Low Risk - Safe to Publish):

643

- Credibility score > 0.8, AND

644

- No severe or moderate fallacies, AND

645

- <20% disputed/problematic claims, AND

646

- No critical missing context

647

648

649

===== 3.3.6 Output: ArticleAssessment Schema =====

650

651

(See Stage 3 Output Schema section above for complete JSON schema)

652

653

===== 3.3.7 Performance Metrics =====

654

655

**POC1 Targets:**

656

* **Processing time**: 4-6 seconds per article

657

* **Cost**: $0.030 per article (Sonnet 4.5 tokens)

658

* **Quality**: 70-80% agreement with human reviewers (acceptable for POC)

659

* **API calls**: 1 per article

660

661

**Future Improvements (POC2/Production):**

662

* Upgrade to Two-Pass (Approach 2): +15% accuracy, +$0.020 cost

663

* Add human review sampling: 10% of Tier B articles

664

* Implement Judge approach (Approach 7) for Tier A: Highest quality

665

666

===== 3.3.8 Example Stage 3 Execution =====

667

668

**Input:**

669

* Article: "Biden won the 2020 election"

670

* Claim analyses: [{claim: "Biden won", verdict: "TRUE", confidence: 0.95}]

671

672

**Stage 3 Processing:**

673

1. Analyzes single claim with high confidence

674

2. Checks for contextual factors (source credibility)

675

3. Searches for logical fallacies (none found)

676

4. Calculates credibility: 0.6 * 0.95 + 0.2 * 1.0 + 0.2 * 1.0 = 0.97

677

5. Assigns risk tier: C (low risk)

678

6. Recommends: AI_GENERATED publication mode

**Output:**

```json

{

"article_id": "a1",

"overall_assessment": {

685

"credibility_score": 0.97,

686

"risk_tier": "C",

687

"summary": "Article makes single verifiable claim with strong evidence support",

688

"confidence": 0.95

689

},

690

"claim_aggregation": {

691

"total_claims": 1,

692

"verdict_distribution": {"TRUE": 1},

693

"avg_confidence": 0.95

694

},

695

"contextual_factors": [

696

{"factor": "source_credibility", "impact": "positive", "description": "Reputable news source"}

697

],

698

"recommendations": {

699

"publication_mode": "AI_GENERATED",

700

"requires_review": false,

701

"suggested_disclaimers": []

}

}

```

==== What Cache-Only Mode Provides: ====

707

708

✅ **Claim Extraction (Platform-Funded):**

709

710

* Stage 1 extraction runs at $0.003 per article

711

* **Cost: Absorbed by platform** (not charged to user credit)

712

* Rationale: Extraction is necessary to check cache, and cost is negligible

713

* Rate limit: Max 50 extractions/day in cache-only mode (prevents abuse)

714

715

✅ **Instant Access to Cached Claims:**

716

717

* Any claim that exists in cache → Full verdict returned

718

* Cost: $0 (no LLM calls)

719

* Response time: <100ms

720

721

✅ **Partial Article Analysis:**

722

723

* Check each claim against cache

724

* Return verdicts for ALL cached claims

725

* For uncached claims: Return "status": "cache_miss"

726

727

✅ **Cache Coverage Report:**

728

729

* "3 of 5 claims available in cache (60% coverage)"

730

* Links to cached analyses

731

* Estimated cost to complete: $0.162 (2 new claims)

732

733

❌ **Not Available in Cache-Only Mode:**

734

735

* New claim analysis (Stage 2 LLM calls blocked)

736

* Full holistic assessment (Stage 3 blocked if any claims missing)

737

738

==== User Experience Example: ====

739

740

{{{{

741

"status": "cache_only_mode",

742

"message": "Monthly credit limit reached. Showing cached results only.",

"cache_coverage": {

"claims_total": 5,

"claims_cached": 3,

"claims_missing": 2,

"coverage_percent": 60

748

},

749

"cached_claims": [

750

{"claim_id": "C1", "verdict": "Likely", "confidence": 0.82},

751

{"claim_id": "C2", "verdict": "Highly Likely", "confidence": 0.91},

752

{"claim_id": "C4", "verdict": "Unclear", "confidence": 0.55}

753

],

754

"missing_claims": [

755

{"claim_id": "C3", "claim_text": "...", "estimated_cost": "$0.081"},

756

{"claim_id": "C5", "claim_text": "...", "estimated_cost": "$0.081"}

757

],

758

"upgrade_options": {

759

"top_up": "$5 for 20-70 more articles",

760

"pro_tier": "$50/month unlimited"

}

}

}}}

**Design Rationale:**

766

767

* Free users still get value (cached claims often answer their question)

768

* Demonstrates FactHarbor's value (partial results encourage upgrade)

769

* Sustainable for platform (no additional cost)

770

* Fair to all users (everyone contributes to cache)

----

== 6. LLM Abstraction Layer ==

777

778

=== 6.1 Design Principle ===

779

780

**FactHarbor uses provider-agnostic LLM abstraction** to avoid vendor lock-in and enable:

781

782

* **Provider switching:** Change LLM providers without code changes

783

* **Cost optimization:** Use different providers for different stages

784

* **Resilience:** Automatic fallback if primary provider fails

785

* **Cross-checking:** Compare outputs from multiple providers

786

* **A/B testing:** Test new models without deployment changes

787

788

**Implementation:** All LLM calls go through an abstraction layer that routes to configured providers.

----

=== 6.2 LLM Provider Interface ===

793

794

**Abstract Interface:**

795

796

{{{

797

interface LLMProvider {

798

// Core methods

799

complete(prompt: string, options: CompletionOptions): Promise<CompletionResponse>

800

stream(prompt: string, options: CompletionOptions): AsyncIterator<StreamChunk>

// Provider metadata

getName(): string

getMaxTokens(): number

805

getCostPer1kTokens(): { input: number, output: number }

806

807

// Health check

808

isAvailable(): Promise<boolean>

809

}

810

811

interface CompletionOptions {

model?: string

maxTokens?: number

temperature?: number

stopSequences?: string[]

816

systemPrompt?: string

}

}}}

----

=== 6.3 Supported Providers (POC1) ===

823

824

**Primary Provider (Default):**

825

826

* **Anthropic Claude API**

827

* Models (examples; not normative): Provider-default FAST model, Provider-default REASONING model, Provider-default HEAVY model (optional)

828

* Used by default in POC1

829

* Best quality for holistic analysis

830

831

**Secondary Providers (Future):**

832

833

* **OpenAI API**

834

* Models: GPT-4o, GPT-4o-mini

835

* For cost comparison

836

837

* **Google Vertex AI**

838

* Models: Gemini 1.5 Pro, Gemini 1.5 Flash

839

* For diversity in evidence gathering

840

841

* **Local Models** (Post-POC)

842

* Models: Llama 3.1, Mistral

843

* For privacy-sensitive deployments

----

=== 6.4 Provider Configuration ===

848

849

**Environment Variables:**

{{{

# Primary provider

LLM_PRIMARY_PROVIDER=anthropic

854

ANTHROPIC_API_KEY=sk-ant-...

855

856

# Fallback provider

857

LLM_FALLBACK_PROVIDER=openai

858

OPENAI_API_KEY=sk-...

859

860

# Provider selection per stage

861

LLM_STAGE1_PROVIDER=anthropic

862

LLM_STAGE1_MODEL=claude-haiku-4

863

LLM_STAGE2_PROVIDER=anthropic

864

LLM_STAGE2_MODEL=claude-sonnet-4-5-20250929

865

LLM_STAGE3_PROVIDER=anthropic

866

LLM_STAGE3_MODEL=claude-sonnet-4-5-20250929

867

868

# Cost limits

869

LLM_MAX_COST_PER_REQUEST=1.00

870

}}}

871

872

**Database Configuration (Alternative):**

{{{{

{

"providers": [

{

"name": "anthropic",

"api_key_ref": "vault://anthropic-api-key",

"enabled": true,

"priority": 1

},

{

"name": "openai",

"api_key_ref": "vault://openai-api-key",

"enabled": true,

"priority": 2

}

],

"stage_config": {

"stage1": {

"provider": "anthropic",

893

"model": "claude-haiku-4-5-20251001",

"max_tokens": 4096,

"temperature": 0.0

},

"stage2": {

"provider": "anthropic",

899

"model": "claude-sonnet-4-5-20250929",

"max_tokens": 16384,

"temperature": 0.3

},

"stage3": {

"provider": "anthropic",

905

"model": "claude-sonnet-4-5-20250929",

"max_tokens": 8192,

"temperature": 0.2

}

}

}

}}}

----

=== 6.5 Stage-Specific Models (POC1 Defaults) ===

916

917

**Stage 1: Claim Extraction**

918

919

* **Default:** Anthropic Provider-default FAST model

920

* **Alternative:** OpenAI GPT-4o-mini, Google Gemini 1.5 Flash

921

* **Rationale:** Fast, cheap, simple task

922

* **Cost:** ~$0.003 per article

923

924

**Stage 2: Claim Analysis** (CACHEABLE)

925

926

* **Default:** Anthropic Provider-default REASONING model

927

* **Alternative:** OpenAI GPT-4o, Google Gemini 1.5 Pro

928

* **Rationale:** High-quality analysis, cached 90 days

929

* **Cost:** ~$0.081 per NEW claim

930

931

**Stage 3: Holistic Assessment**

932

933

* **Default:** Anthropic Provider-default REASONING model

934

* **Alternative:** OpenAI GPT-4o, Provider-default HEAVY model (optional) (for high-stakes)

935

* **Rationale:** Complex reasoning, logical fallacy detection

936

* **Cost:** ~$0.030 per article

937

938

**Cost Comparison (Example):**

939

940

|=Stage|=Anthropic (Default)|=OpenAI Alternative|=Google Alternative

941

|Stage 1|Provider-default FAST model ($0.003)|GPT-4o-mini ($0.002)|Gemini Flash ($0.002)

942

|Stage 2|Provider-default REASONING model ($0.081)|GPT-4o ($0.045)|Gemini Pro ($0.050)

943

|Stage 3|Provider-default REASONING model ($0.030)|GPT-4o ($0.018)|Gemini Pro ($0.020)

944

|**Total (0% cache)**|**$0.114**|**$0.065**|**$0.072**

945

946

**Note:** POC1 uses Anthropic exclusively for consistency. Multi-provider support planned for POC2.

----

=== 6.6 Failover Strategy ===

951

952

**Automatic Failover:**

953

954

{{{

955

async function completeLLM(stage: string, prompt: string): Promise<string> {

956

const primaryProvider = getProviderForStage(stage)

957

const fallbackProvider = getFallbackProvider()

958

959

try {

960

return await primaryProvider.complete(prompt)

961

} catch (error) {

962

if (error.type === 'rate_limit' || error.type === 'service_unavailable') {

963

logger.warn(`Primary provider failed, using fallback`)

964

return await fallbackProvider.complete(prompt)

}

throw error

}

}

}}}

**Fallback Priority:**

972

973

1. **Primary:** Configured provider for stage

974

2. **Secondary:** Fallback provider (if configured)

975

3. **Cache:** Return cached result (if available for Stage 2)

976

4. **Error:** Return 503 Service Unavailable

----

=== 6.7 Provider Selection API ===

981

982

**Admin Endpoint:** POST /admin/v1/llm/configure

983

984

**Update provider for specific stage:**

{{{{

{

"stage": "stage2",

"provider": "openai",

"model": "gpt-4o",

"max_tokens": 16384,

"temperature": 0.3

}

}}}

**Response:** 200 OK

{{{{

{

"message": "LLM configuration updated",

1001

"stage": "stage2",

1002

"previous": {

1003

"provider": "anthropic",

1004

"model": "claude-sonnet-4-5-20250929"

1005

},

1006

"current": {

1007

"provider": "openai",

"model": "gpt-4o"

},

"cost_impact": {

"previous_cost_per_claim": 0.081,

1012

"new_cost_per_claim": 0.045,

1013

"savings_percent": 44

}

}

}}}

**Get current configuration:**

1019

1020

GET /admin/v1/llm/config

{{{{

{

"providers": ["anthropic", "openai"],

1025

"primary": "anthropic",

1026

"fallback": "openai",

1027

"stages": {

1028

"stage1": {

1029

"provider": "anthropic",

1030

"model": "claude-haiku-4-5-20251001",

1031

"cost_per_request": 0.003

1032

},

1033

"stage2": {

1034

"provider": "anthropic",

1035

"model": "claude-sonnet-4-5-20250929",

1036

"cost_per_new_claim": 0.081

1037

},

1038

"stage3": {

1039

"provider": "anthropic",

1040

"model": "claude-sonnet-4-5-20250929",

1041

"cost_per_request": 0.030

}

}

}

}}}

----

=== 6.8 Implementation Notes ===

1050

1051

**Provider Adapter Pattern:**

1052

1053

{{{

1054

class AnthropicProvider implements LLMProvider {

1055

async complete(prompt: string, options: CompletionOptions) {

1056

const response = await anthropic.messages.create({

1057

model: options.model || 'claude-sonnet-4-5-20250929',

1058

max_tokens: options.maxTokens || 4096,

1059

messages: [{ role: 'user', content: prompt }],

1060

system: options.systemPrompt

1061

})

1062

return response.content[0].text

}

}

class OpenAIProvider implements LLMProvider {

1067

async complete(prompt: string, options: CompletionOptions) {

1068

const response = await openai.chat.completions.create({

1069

model: options.model || 'gpt-4o',

1070

max_tokens: options.maxTokens || 4096,

1071

messages: [

1072

{ role: 'system', content: options.systemPrompt },

1073

{ role: 'user', content: prompt }

1074

]

1075

})

1076

return response.choices[0].message.content

}

}

}}}

**Provider Registry:**

1082

1083

{{{

1084

const providers = new Map<string, LLMProvider>()

1085

providers.set('anthropic', new AnthropicProvider())

1086

providers.set('openai', new OpenAIProvider())

1087

providers.set('google', new GoogleProvider())

1088

1089

function getProvider(name: string): LLMProvider {

1090

return providers.get(name) || providers.get(config.primaryProvider)

}

}}}

----

== 3. REST API Contract ==

1097

1098

=== 3.1 User Credit Tracking ===

1099

1100

**Endpoint:** GET /v1/user/credit

**Response:** 200 OK

{{{{

"user_id": "user_abc123",

1106

"tier": "free",

1107

"credit_limit": 10.00,

1108

"credit_used": 7.42,

1109

"credit_remaining": 2.58,

1110

"reset_date": "2025-02-01T00:00:00Z",

1111

"cache_only_mode": false,

1112

"usage_stats": {

1113

"articles_analyzed": 67,

1114

"claims_from_cache": 189,

1115

"claims_newly_analyzed": 113,

1116

"cache_hit_rate": 0.626

}

}

}}}

----

==== Stage 2 Output Schema: ClaimAnalysis ====

1126

1127

**Complete schema for each claim's analysis result:**

{

"claim_id": "claim_abc123",

1132

"claim_text": "Biden won the 2020 election",

1133

"scenarios": [

1134

{

1135

"scenario_id": "scenario_1",

1136

"description": "Interpreting 'won' as Electoral College victory",

"verdict": {

"label": "TRUE",

"confidence": 0.95,

"explanation": "Joe Biden won 306 electoral votes vs Trump's 232"

},

"evidence": {

"supporting": [

{

"text": "Biden certified with 306 electoral votes",

1146

"source_url": "https://www.archives.gov/electoral-college/2020",

1147

"source_title": "2020 Electoral College Results",

1148

"credibility_score": 0.98

}

],

"opposing": []

}

}

],

"recommended_scenario": "scenario_1",

1156

"metadata": {

1157

"analysis_timestamp": "2024-12-24T18:00:00Z",

1158

"model_used": "claude-sonnet-4-5-20250929",

1159

"processing_time_seconds": 8.5

}

}

**Required Fields:**

* **claim_id**: Unique identifier matching Stage 1 output

1166

* **claim_text**: The exact claim being analyzed

1167

* **scenarios**: Array of interpretation scenarios (minimum 1)

1168

* **scenario_id**: Unique ID for this scenario

1169

* **description**: Clear interpretation of the claim

1170

* **verdict**: Verdict object with label, confidence, explanation

1171

* **evidence**: Supporting and opposing evidence arrays

1172

* **recommended_scenario**: ID of the primary/recommended scenario

1173

* **metadata**: Processing metadata (timestamp, model, timing)

1174

1175

**Optional Fields:**

1176

* Additional context, warnings, or quality scores

1177

1178

**Minimum Viable Example:**

{

"claim_id": "c1",

"claim_text": "The sky is blue",

1184

"scenarios": [{

1185

"scenario_id": "s1",

1186

"description": "Under clear daytime conditions",

1187

"verdict": {"label": "TRUE", "confidence": 0.99, "explanation": "Rayleigh scattering"},

1188

"evidence": {"supporting": [], "opposing": []}

1189

}],

1190

"recommended_scenario": "s1",

1191

"metadata": {"analysis_timestamp": "2024-12-24T18:00:00Z"}

}

==== Stage 3 Output Schema: ArticleAssessment ====

1198

1199

**Complete schema for holistic article-level assessment:**

{

"article_id": "article_xyz789",

1204

"overall_assessment": {

1205

"credibility_score": 0.72,

1206

"risk_tier": "B",

1207

"summary": "Article contains mostly accurate claims with one disputed claim requiring expert review",

1208

"confidence": 0.85

1209

},

1210

"claim_aggregation": {

1211

"total_claims": 5,

1212

"verdict_distribution": {

"TRUE": 3,

"PARTIALLY_TRUE": 1,

"DISPUTED": 1,

"FALSE": 0,

"UNSUPPORTED": 0,

"UNVERIFIABLE": 0

},

"avg_confidence": 0.82

1221

},

1222

"contextual_factors": [

1223

{

1224

"factor": "Source credibility",

1225

"impact": "positive",

1226

"description": "Published by reputable news organization"

1227

},

1228

{

1229

"factor": "Claim interdependence",

1230

"impact": "neutral",

1231

"description": "Claims are independent; no logical chains"

}

],

"recommendations": {

"publication_mode": "AI_GENERATED",

1236

"requires_review": false,

1237

"review_reason": null,

1238

"suggested_disclaimers": [

1239

"One claim (Claim 4) has conflicting expert opinions"

]

},

"metadata": {

"holistic_timestamp": "2024-12-24T18:00:10Z",

1244

"model_used": "claude-sonnet-4-5-20250929",

1245

"processing_time_seconds": 4.2,

"cache_used": false

}

}

**Required Fields:**

* **article_id**: Unique identifier for this article

1253

* **overall_assessment**: Top-level assessment

1254

* **credibility_score**: 0.0-1.0 composite score

1255

* **risk_tier**: A, B, or C (per AKEL quality gates)

1256

* **summary**: Human-readable assessment

1257

* **confidence**: How confident the holistic assessment is

1258

* **claim_aggregation**: Statistics across all claims

1259

* **total_claims**: Count of claims analyzed

1260

* **verdict_distribution**: Count per verdict label

1261

* **avg_confidence**: Average confidence across verdicts

1262

* **contextual_factors**: Array of contextual considerations

1263

* **recommendations**: Publication decision support

1264

* **publication_mode**: DRAFT_ONLY, AI_GENERATED, or HUMAN_REVIEWED

1265

* **requires_review**: Boolean flag

1266

* **suggested_disclaimers**: Array of disclaimer texts

1267

* **metadata**: Processing metadata

1268

1269

**Minimum Viable Example:**

{

"article_id": "a1",

"overall_assessment": {

1275

"credibility_score": 0.95,

1276

"risk_tier": "C",

1277

"summary": "All claims verified as true",

1278

"confidence": 0.98

1279

},

1280

"claim_aggregation": {

1281

"total_claims": 1,

1282

"verdict_distribution": {"TRUE": 1},

1283

"avg_confidence": 0.99

1284

},

1285

"contextual_factors": [],

1286

"recommendations": {

1287

"publication_mode": "AI_GENERATED",

1288

"requires_review": false,

1289

"suggested_disclaimers": []

1290

},

1291

"metadata": {"holistic_timestamp": "2024-12-24T18:00:00Z"}

}

=== 3.2 Create Analysis Job (3-Stage) ===

1296

1297

**Endpoint:** POST /v1/analyze

1298

1299

==== Idempotency Support: ====

1300

1301

To prevent duplicate job creation on network retries, clients SHOULD include **either**:

1302

1303

* Header: ``Idempotency-Key: <client-generated-uuid>`` (preferred)

1304

* OR body: ``client.request_id``

1305

1306

**Example request (header):**

1307

1308

POST /v1/analyze

1309

Authorization: Bearer <API_KEY>

1310

Idempotency-Key: 0f3c6c0e-2d2b-4b4a-9d6f-1a1f6b0c9f7e

1311

Content-Type: application/json

1312

1313

1314

**Example request (body):**

1315

1316

{

1317

"input_url": "https://example.org/article",

1318

"options": { "max_claims": 5, "cache_preference": "prefer_cache" },

1319

"client": { "request_id": "0f3c6c0e-2d2b-4b4a-9d6f-1a1f6b0c9f7e" }

}

**Server behavior:**

* Same idempotency key + same request body ⇒ return existing job (``200``) and include:

1325

``idempotent=true`` and ``original_request_at``.

1326

* Same key + different body ⇒ ``409`` with ``VALIDATION_ERROR`` describing the mismatch.

1327

1328

**Idempotency TTL:** 24 hours (minimum).

1329

1330

==== Request Body: ====

{{{{

"input_type": "url",

"input_url": "https://example.com/medical-report-01",

"input_text": null,

"options": {

"browsing": "on",

"depth": "standard",

"max_claims": 5,

* **cache_preference** (optional): Cache usage preference

1342

* **Type:** string

1343

* **Enum:** {{code}}["prefer_cache", "allow_partial", "skip_cache"]{{/code}}

1344

* **Default:** {{code}}"prefer_cache"{{/code}}

1345

* **Semantics:**

1346

* {{code}}"prefer_cache"{{/code}}: Use full cache if available, otherwise run all stages

1347

* {{code}}"allow_partial"{{/code}}: Use cached Stage 2 results if available, rerun only Stage 3

1348

* {{code}}"skip_cache"{{/code}}: Always rerun all stages (ignore cache)

1349

* **Behavior:** When set to {{code}}"allow_partial"{{/code}} and Stage 2 cached results exist:

1350

* Stage 1 & 2 are skipped

1351

* Stage 3 (holistic assessment) runs fresh with cached claim analyses

1352

* Response includes {{code}}"cache_used": true{{/code}} and {{code}}"stages_cached": ["stage1", "stage2"]{{/code}}

1353

1354

"scenarios_per_claim": 2,

1355

"max_evidence_per_scenario": 6,

1356

"context_aware_analysis": true

1357

},

1358

"client": {

1359

"request_id": "optional-client-tracking-id",

1360

"source_label": "optional"

}

}

}}}

**Options:**

* browsing: on | off (retrieve web sources or just output queries)

1368

* depth: standard | deep (evidence thoroughness)

1369

* max_claims: 1-10 (default: **5** for cost control)

1370

* scenarios_per_claim: 1-5 (default: **2** for cost control)

1371

* max_evidence_per_scenario: 3-10 (default: **6**)

1372

* context_aware_analysis: true | false (experimental)

1373

1374

**Response:** 202 Accepted

1375

1376

{{{{

1377

"job_id": "01J...ULID",

1378

"status": "QUEUED",

1379

"created_at": "2025-12-24T10:31:00Z",

1380

"estimated_cost": 0.114,

1381

"cost_breakdown": {

1382

"stage1_extraction": 0.003,

1383

"stage2_new_claims": 0.081,

1384

"stage2_cached_claims": 0.000,

1385

"stage3_holistic": 0.030

1386

},

1387

"cache_info": {

1388

"claims_to_extract": 5,

1389

"estimated_cache_hits": 4,

1390

"estimated_new_claims": 1

1391

},

1392

"links": {

1393

"self": "/v1/jobs/01J...ULID",

1394

"result": "/v1/jobs/01J...ULID/result",

1395

"report": "/v1/jobs/01J...ULID/report",

1396

"events": "/v1/jobs/01J...ULID/events"

}

}

}}}

**Error Responses:**

402 Payment Required - Free tier limit reached, cache-only mode

1404

1405

{{{{

1406

"error": "credit_limit_reached",

1407

"message": "Monthly credit limit reached. Entering cache-only mode.",

1408

"cache_only_mode": true,

1409

"credit_remaining": 0.00,

1410

"reset_date": "2025-02-01T00:00:00Z",

1411

"action": "Resubmit with cache_preference=allow_partial for cached results"

}

}}}

----

== 4. Data Schemas ==

1418

1419

=== 4.1 Stage 1 Output: ClaimExtraction ===

1420

1421

{{{{

1422

"job_id": "01J...ULID",

1423

"stage": "stage1_extraction",

1424

"article_metadata": {

1425

"title": "Article title",

1426

"source_url": "https://example.com/article",

1427

"extracted_text_length": 5234,

"language": "en"

},

"claims": [

{

"claim_id": "C1",

"claim_text": "Original claim text from article",

1434

"canonical_claim": "Normalized, deduplicated phrasing",

1435

"claim_hash": "sha256:abc123...",

1436

"is_central_to_thesis": true,

1437

"claim_type": "causal",

1438

"evaluability": "evaluable",

1439

"risk_tier": "B",

1440

"domain": "public_health"

1441

}

1442

],

1443

"article_thesis": "Main argument detected",

"cost": 0.003

}

}}}

----

=== 4.5 Verdict Label Taxonomy ===

1451

1452

FactHarbor uses **three distinct verdict taxonomies** depending on analysis level:

1453

1454

==== 4.5.1 Scenario Verdict Labels (Stage 2) ====

1455

1456

Used for individual scenario verdicts within a claim.

**Enum Values:**

* Highly Likely - Probability 0.85-1.0, high confidence

1461

* Likely - Probability 0.65-0.84, moderate-high confidence

1462

* Unclear - Probability 0.35-0.64, or low confidence

1463

* Unlikely - Probability 0.16-0.34, moderate-high confidence

1464

* Highly Unlikely - Probability 0.0-0.15, high confidence

1465

* Unsubstantiated - Insufficient evidence to determine probability

1466

1467

==== 4.5.2 Claim Verdict Labels (Rollup) ====

1468

1469

Used when summarizing a claim across all scenarios.

**Enum Values:**

* Supported - Majority of scenarios are Likely or Highly Likely

1474

* Refuted - Majority of scenarios are Unlikely or Highly Unlikely

1475

* Inconclusive - Mixed scenarios or majority Unclear/Unsubstantiated

**Mapping Logic:**

* If ≥60% scenarios are (Highly Likely | Likely) → Supported

1480

* If ≥60% scenarios are (Highly Unlikely | Unlikely) → Refuted

1481

* Otherwise → Inconclusive

1482

1483

==== 4.5.3 Article Verdict Labels (Stage 3) ====

1484

1485

Used for holistic article-level assessment.

**Enum Values:**

* WELL-SUPPORTED - Article thesis logically follows from supported claims

1490

* MISLEADING - Claims may be true but article commits logical fallacies

1491

* REFUTED - Central claims are refuted, invalidating thesis

1492

* UNCERTAIN - Insufficient evidence or highly mixed claim verdicts

1493

1494

**Note:** Article verdict considers **claim centrality** (central claims override supporting claims).

1495

1496

==== 4.5.4 API Field Mapping ====

1497

1498

|=Level|=API Field|=Enum Name

1499

|Scenario|scenarios[].verdict.label|scenario_verdict_label

1500

|Claim|claims[].rollup_verdict (optional)|claim_verdict_label

1501

|Article|article_holistic_assessment.overall_verdict|article_verdict_label

----

== 5. Cache Architecture ==

1506

1507

=== 5.1 Redis Cache Design ===

1508

1509

**Technology:** Redis 7.0+ (in-memory key-value store)

1510

1511

**Cache Key Schema:**

1512

1513

{{{claim:v1norm1:{language}:{sha256(canonical_claim)}

}}}

**Example:**

{{{Claim (English): "COVID vaccines are 95% effective"

1519

Canonical: "covid vaccines are 95 percent effective"

1520

Language: "en"

1521

SHA256: abc123...def456

1522

Key: claim:v1norm1:en:abc123...def456

1523

}}}

1524

1525

**Rationale:** Prevents cross-language collisions and enables per-language cache analytics.

**Data Structure:**

{{{SET claim:v1norm1:en:abc123...def456 '{...ClaimAnalysis JSON...}'

1530

EXPIRE claim:v1norm1:en:abc123...def456 7776000 # 90 days

}}}

----

=== 5.1.1 Canonical Claim Normalization (v1norm1) ===

1536

1537

The cache key depends on deterministic claim normalization. **All implementations MUST follow this algorithm exactly.**

1538

1539

**Normalization version:** ``v1norm1``

1540

1541

**Algorithm (v1norm1):**

1542

1. Unicode normalize: NFD

1543

2. Lowercase

1544

3. Strip diacritics

1545

4. Normalize apostrophes: ``’`` and ``‘`` → ``'``

1546

5. Replace percent sign: ``%`` → `` percent``

1547

6. Collapse whitespace

1548

7. Remove punctuation **except apostrophes**

1549

8. Expand contractions (fixed list below)

1550

9. Remove remaining apostrophes

1551

10. Collapse whitespace again

import re

import unicodedata

# Canonical claim normalization for deduplication.

# Version: v1norm1

#

# IMPORTANT:

# - Any change to these rules REQUIRES a new normalization version.

1562

# - Cache keys MUST include the normalization version to avoid collisions.

1563

1564

CONTRACTIONS_V1NORM1 = {

1565

"don't": "do not",

1566

"doesn't": "does not",

"didn't": "did not",

"can't": "cannot",

"won't": "will not",

"shouldn't": "should not",

1571

"wouldn't": "would not",

"isn't": "is not",

"aren't": "are not",

"wasn't": "was not",

"weren't": "were not",

1576

"haven't": "have not",

"hasn't": "has not",

"hadn't": "had not",

"it's": "it is",

"that's": "that is",

"there's": "there is",

1582

"i'm": "i am",

1583

"we're": "we are",

1584

"they're": "they are",

"you're": "you are",

"i've": "i have",

"we've": "we have",

"they've": "they have",

1589

"you've": "you have",

1590

"i'll": "i will",

1591

"we'll": "we will",

1592

"they'll": "they will",

1593

"you'll": "you will",

1594

}

1595

1596

def normalize_claim(text: str) -> str:

if text is None:

return ""

# 1) Unicode normalization (NFD)

1601

text = unicodedata.normalize("NFD", text)

# 2) Lowercase

text = text.lower()

# 3) Strip diacritics

1607

text = "".join(c for c in text if unicodedata.category(c) != "Mn")

1608

1609

# 4) Normalize apostrophes

1610

text = text.replace("’", "'").replace("‘", "'")

1611

1612

# 5) Normalize percent sign

1613

text = text.replace("%", " percent")

1614

1615

# 6) Collapse whitespace

1616

text = re.sub(r"\s+", " ", text).strip()

1617

1618

# 7) Remove punctuation except apostrophes

1619

text = re.sub(r"[^\w\s']", "", text)

1620

1621

# 8) Expand contractions

1622

for k, v in CONTRACTIONS_V1NORM1.items():

1623

text = re.sub(rf"\b{re.escape(k)}\b", v, text)

1624

1625

# 9) Remove remaining apostrophes (after contraction expansion)

1626

text = text.replace("'", "")

1627

1628

# 10) Final whitespace normalization

1629

text = re.sub(r"\s+", " ", text).strip()

return text

**Canonical claim hash input (normative):**

1635

* ``claim_hash = sha256_hex_lower( "v1norm1|<language>|" + canonical_claim_text )``

1636

* Cache key: ``claim:v1norm1:<language>:<claim_hash>``

1637

1638

**Normalization Examples:**

1639

1640

|= Input |= Normalized Output

1641

| "Biden won the 2020 election" | {{code}}biden won the 2020 election{{/code}}

1642

| "Biden won the 2020 election!" | {{code}}biden won the 2020 election{{/code}}

1643

| "Biden won the 2020 election" | {{code}}biden won the 2020 election{{/code}}

1644

| "Biden didn't win the 2020 election" | {{code}}biden did not win the 2020 election{{/code}}

1645

| "BIDEN WON THE 2020 ELECTION" | {{code}}biden won the 2020 election{{/code}}

1646

1647

**Versioning:** Algorithm version is {{code}}v1norm1{{/code}}. Changes to the algorithm require a new version identifier.

1648

1649

=== 5.1.2 Copyright & Data Retention Policy ===

1650

1651

**Evidence Excerpt Storage:**

1652

1653

To comply with copyright law and fair use principles:

**What We Store:**

* **Metadata only:** Title, author, publisher, URL, publication date

1658

* **Short excerpts:** Max 25 words per quote, max 3 quotes per evidence item

1659

* **Summaries:** AI-generated bullet points (not verbatim text)

1660

* **No full articles:** Never store complete article text beyond job processing

1661

1662

**Total per Cached Claim:**

1663

1664

* Scenarios: 2 per claim

1665

* Evidence items: 6 per scenario (12 total)

1666

* Quotes: 3 per evidence × 25 words = 75 words per item

1667

* **Maximum stored verbatim text:** ~~900 words per claim (12 × 75)

**Retention:**

* Cache TTL: 90 days

* Job outputs: 24 hours (then archived or deleted)

1673

* No persistent full-text article storage

**Rationale:**

* Short excerpts for citation = fair use

1678

* Summaries are transformative (not copyrightable)

1679

* Limited retention (90 days max)

1680

* No commercial republication of excerpts

**DMCA Compliance:**

* Cache invalidation endpoint available for rights holders

1685

* Contact: dmca@factharbor.org

----

== Summary ==

This WYSIWYG preview shows the **structure and key sections** of the 1,515-line API specification.

1692

1693

**Full specification includes:**

1694

1695

* Complete API endpoints (7 total)

1696

* All data schemas (ClaimExtraction, ClaimAnalysis, HolisticAssessment, Complete)

1697

* Quality gates & validation rules

1698

* LLM configuration for all 3 stages

1699

* Implementation notes with code samples

1700

* Testing strategy

1701

* Cross-references to other pages

1702

1703

**The complete specification is available in:**

1704

1705

* this page (authoritative canonical contract) (45 KB standalone)

1706

* Export files (TEST/PRODUCTION) for xWiki import

Wiki source code of POC1 API & Schemas Specification