Architecture

Version 1.1 by Robert Schaub on 2025/12/16 21:42

Architecture

FactHarbor uses a modular-monolith architecture (POC → Beta 0) designed to evolve into a distributed, federated, multi-node system (Release 1.0+).
Modules are strongly separated, versioned, and auditable. All logic is transparent and deterministic.

High-Level System Architecture

FactHarbor is composed of the following major modules:

UI Frontend
REST API Layer
Core Logic Layer
- Claim Processing
- Scenario Engine
- Evidence Repository
- Verdict Engine
- Re-evaluation Engine
- Roles / Identity / Reputation
AKEL (AI Knowledge Extraction Layer)
Federation Layer
Workers & Background Jobs
Storage Layer (Postgres + VectorDB + ObjectStore)

Current Implementation (v2.10.2) - Two-service architecture: Next.js web app for UI and analysis, .NET API for job persistence.

High-Level Architecture


graph TB
    subgraph Client[Client Layer]
        BROWSER[Web Browser]
    end

    subgraph NextJS[Next.js Web App]
        ANALYZE[analyze page]
        JOBS[jobs page]
        JOBVIEW[jobs id page]
        ANALYZE_API[api fh analyze]
        JOBS_API[api fh jobs]
        RUN_JOB[api internal run-job]
        ORCH[orchestrated.ts]
        CANON[monolithic-canonical.ts]
        SHARED[Shared Modules]
        WEBSEARCH[web-search.ts]
        SR[source-reliability.ts]
    end

    subgraph DotNet[.NET API]
        DOTNET_API[ASP.NET Core API]
        JOBS_CTRL[JobsController]
        HEALTH_CTRL[HealthController]
        SQLITE[(SQLite factharbor.db)]
    end

    subgraph External[External Services]
        LLM[LLM Providers]
        SEARCH[Search Providers]
    end

    BROWSER --> ANALYZE
    BROWSER --> JOBS
    BROWSER --> JOBVIEW
    ANALYZE --> ANALYZE_API
    ANALYZE_API --> DOTNET_API
    DOTNET_API --> SQLITE
    RUN_JOB --> ORCH
    RUN_JOB --> CANON
    ORCH --> SHARED
    CANON --> SHARED
    ORCH --> LLM
    CANON --> LLM
    ORCH --> WEBSEARCH
    WEBSEARCH --> SEARCH
    SHARED --> SR

Component Summary

Component	Technology	Purpose
Web App	Next.js 14+	UI, API routes, AKEL pipeline
API	ASP.NET Core 8.0	Job persistence, health checks
Database	SQLite (3 databases)	Jobs/events, UCM config, SR cache
LLM	AI SDK (Vercel)	Multi-provider LLM abstraction with model tiering
Search	Google CSE / SerpAPI	Web search for evidence

Key Files

File	Lines	Purpose
orchestrated.ts	13300	Main orchestrated pipeline
monolithic-canonical.ts	1500	Monolithic canonical pipeline
analysis-contexts.ts	600	AnalysisContext pre-detection
aggregation.ts	400	Verdict aggregation + claim weighting
evidence-filter.ts	300	Deterministic evidence quality filtering
source-reliability.ts	500	LLM-based source reliability scoring

Environment Variables

Variable	Default	Purpose
FH_SEARCH_ENABLED	true	Enable web search
FH_DETERMINISTIC	true	Zero temperature
FH_API_URL	localhost:5139	.NET API endpoint

Key ideas:

Core logic is deterministic, auditable, and versioned
AKEL drafts structured outputs but never publishes directly
Workers run long or asynchronous tasks
Storage is separated for scalability and clarity
Federation Layer provides optional distributed operation

Storage Architecture

FactHarbor separates structured data, embeddings, and evidence files:

PostgreSQL — canonical structured entities, all versioning, lineage, signatures
Vector DB (Qdrant or pgvector) — semantic search, duplication detection, cluster mapping
Object Storage — PDFs, datasets, raw evidence, transcripts
Optional (Release 1.0): Redis for caching, IPFS for decentralized object storage

Storage Architecture

1. Current Implementation (v2.10.2)

1.1 Three-Database Architecture


graph TB
    subgraph NextJS[Next.js Web App]
        PIPELINE[Orchestrated Pipeline]
        CONFIG_SVC[Config Storage]
        SR_SVC[SR Cache]
    end

    subgraph DotNet[.NET API]
        CONTROLLERS[Controllers]
        EF[Entity Framework]
    end

    CONFIG_DB[(config.db)]
    SR_DB[(source-reliability.db)]
    FH_DB[(factharbor.db)]

    CONFIG_SVC --> CONFIG_DB
    SR_SVC --> SR_DB
    PIPELINE -->|via API| CONTROLLERS
    CONTROLLERS --> EF
    EF --> FH_DB

Database	Purpose	Access Layer	Key Tables
factharbor.db	Jobs, events, analysis results	.NET API (Entity Framework)	Jobs, JobEvents, AnalysisMetrics
config.db	UCM configuration management	Next.js (better-sqlite3)	config_blobs, config_active, config_usage, job_config_snapshots
source-reliability.db	Source reliability cache	Next.js (better-sqlite3)	source_reliability

1.2 Current Caching Mechanisms

What	Mechanism	TTL	Status
Source reliability scores	SQLite + batch prefetch to in-memory Map	90 days (configurable via UCM)	IMPLEMENTED
UCM config values	In-memory Map with TTL-based expiry	60 seconds	IMPLEMENTED
URL content (fetched pages)	Not cached	N/A	NOT IMPLEMENTED
Claim-level analysis results	Not cached	N/A	NOT IMPLEMENTED
LLM responses	Not cached	N/A	NOT IMPLEMENTED

1.3 Storage Patterns

Analysis results: JSON blob in ResultJson column (per job), stored once by .NET API
Config blobs: Content-addressable with SHA-256 hash as PK, history tracked
Job config snapshots: Pipeline + search + SR config captured per job for auditability
SR cache: Per-domain reliability assessment with multi-model consensus scores

Current limitations:

No relational queries across claims, evidence, or sources from different analyses
No full-text search on analysis content
Single-writer limitation (SQLite) — fine for single-instance but blocks horizontal scaling
Every analysis re-fetches URL content and recomputes all LLM calls from scratch

2. What Is Worth Caching?

This section identifies caching opportunities. The EVALUATE items require deeper analysis during Alpha planning before committing to scope and timeline.

2.1 Caching Value Analysis

Cacheable Item	Estimated Savings	Latency Impact	Complexity	Recommendation
Claim-level results	30-50% LLM cost on duplicate claims	None (cache lookup)	MEDIUM — needs canonical claim hash + TTL + prompt-version awareness	EVALUATE in Alpha
URL content	$0 API cost but 5-15s latency per source	Major — eliminates re-fetch	LOW — URL hash + content + timestamp	EVALUATE in Alpha
LLM responses	Highest per-call savings	None	HIGH — prompt hash + input hash, invalidation on prompt change	DEFER — claim-level caching captures most benefit
Search query results	Marginal — search APIs are cheap	Minor	MEDIUM — results go stale quickly	NOT RECOMMENDED — volatile, low ROI

2.2 Cost Impact Modeling

Assuming $0.10-$2.00 per analysis (depending on article complexity and model tier):

Usage Level	Current Cost/day	With Claim Cache (-35%)	With URL Cache
10 analyses/day	$1-20	$0.65-13	Same cost, 30-60s faster
100 analyses/day	$10-200	$6.50-130	Same cost, 5-15 min faster
1000 analyses/day	$100-2,000	$65-1,300	Same cost, 50-150 min faster

Key insight: Claim caching saves money; URL caching saves time. Both follow the existing SQLite + in-memory Map pattern from source reliability.

3. Redis: Do We Still Need It?

3.1 Current Reality Assessment

Original Redis Use Case	Current Solution	Gap?
Hot data caching	In-memory Map (config), SQLite (SR)	No gap at current scale
Session management	No user auth = no sessions	Not needed until Beta
Rate limiting	Not implemented	Can be in-process for single-instance
Pub/sub for real-time	SSE events work without Redis	No gap for single-instance

3.2 When Redis Becomes Necessary

Redis adds value when:

Multiple application instances need shared cache/state (horizontal scaling)
Sub-millisecond cache lookups required (SQLite is 1-5ms, sufficient for current needs)
Distributed rate limiting needed across multiple servers

Trigger criteria (following When-to-Add-Complexity philosophy):

Single-instance SQLite cache latency >100ms
Need for >1 application instance
Rate limiting required across instances

Decision: DEFER Redis. Not needed for current or near-term development. SQLite + in-memory Map handles all current caching needs.

4. PostgreSQL: When and Why?

4.1 Current SQLite Limitations

Limitation	Impact	When It Hurts
JSON blob storage (no relational queries)	Cannot query across analyses	When browse/search is needed
Single-writer	No concurrent writes	When horizontal scaling is needed
No complex aggregation	Cannot run cross-analysis analytics	When quality dashboards need SQL
No full-text search	Cannot search claim text or evidence	When browse/search is needed

4.2 What PostgreSQL Enables

Browse/search claims across all analyses
Quality metrics dashboards with SQL aggregation
Evidence deduplication (FR54) with relational queries
User accounts and permissions (Beta requirement)
Multi-instance deployments

4.3 Migration Path

The .NET API already has PostgreSQL support configured (appsettings.json). Switching is a configuration change, not a code rewrite.

Note: Keep SQLite for config.db (portable) and source-reliability.db (standalone). Only factharbor.db needs PostgreSQL.

Decision: EVALUATE for Alpha/Beta. Add PostgreSQL when user accounts + search + evidence dedup needed. Requires deeper analysis during Alpha planning.

5. Vector Database Assessment

Full assessment: Docs/WIP/Vector_DB_Assessment.md (February 2, 2026)

Conclusion: Vector search is not required for core functionality. Vectors add value only for approximate similarity (near-duplicate claim detection, edge case clustering) and should remain optional and offline to preserve pipeline performance and determinism.

When to add: Only after Shadow Mode data collection proves that near-duplicate detection needs exceed text-hash capability. Start with lightweight normalization + n-gram overlap (no vector DB needed).

Decision: DEFER. Re-evaluate after Shadow Mode data collection.

6. Revised Storage Roadmap

Previous Roadmap (Superseded)

Phase 1: Add Redis for caching
Phase 2: Migrate to PostgreSQL for normalized data
Phase 3: Add S3 for archives and backups

Current Roadmap


graph LR
    subgraph Phase1[Phase 1: Alpha]
        P1A[Expand SQLite caching]
        P1B[Keep 3-DB architecture]
    end

    subgraph Phase2[Phase 2: Beta]
        P2A[Add PostgreSQL for factharbor.db]
        P2B[Add normalized claim/evidence tables]
        P2C[Keep SQLite for config + SR]
    end

    subgraph Phase3[Phase 3: V1.0]
        P3A[Add Redis IF multi-instance needed]
        P3B[PostgreSQL primary for production]
    end

    subgraph Phase4[Phase 4: V1.0+]
        P4A[Vector DB IF Shadow Mode proves value]
        P4B[S3 IF storage exceeds 50GB]
    end

    Phase1 --> Phase2
    Phase2 --> Phase3
    Phase3 --> Phase4

Phase 1 (Alpha): Evaluate and potentially add URL content cache + claim-level cache in SQLite. Keep 3-DB architecture and in-memory Map caches.

Phase 2 (Beta): Add PostgreSQL for factharbor.db (user data, normalized claims, search). Keep SQLite for config.db (portable) and source-reliability.db (standalone).

Phase 3 (V1.0): Add Redis ONLY IF multi-instance deployment required. PostgreSQL becomes primary for all production data.

Phase 4 (V1.0+): Add vector DB ONLY IF Shadow Mode data proves value. Add S3 ONLY IF storage exceeds 50GB.

7. Decision Summary

Technology	Decision	When	Status
SQLite URL cache	EVALUATE	Alpha planning	Needs further analysis
SQLite claim cache	EVALUATE	Alpha planning	Needs further analysis
Redis	DEFER	Multi-instance	Agreed
PostgreSQL	EVALUATE	Alpha/Beta	Needs further analysis
Vector DB	DEFER	Post-Shadow Mode	Agreed
S3	DEFER	V1.0+	Agreed

DEFER items are agreed. EVALUATE items (URL cache, claim cache, PostgreSQL) require deeper analysis during Alpha release planning — scope, dependencies, and prioritization to be determined as part of Alpha milestones.

POC to Alpha Transition — Phase redefinition (caching is Alpha milestone)
When to Add Complexity — Decision philosophy
Architecture — System architecture
Data Model — Database schema

Document Status: PARTIALLY APPROVED (February 2026) — DEFER decisions agreed; EVALUATE items need Alpha-phase analysis

Core Backend Module Architecture

Each module has a clear responsibility and versioned boundaries to allow future extraction into microservices.

1. Claim Processing Module

Responsibilities:

Ingest text, URLs, documents, transcripts, federated input
Extract claims (AKEL-assisted)
Normalize structure
Classify (type, domain, evaluability, safety)
Deduplicate via embeddings
Assign to claim clusters

Flow:
Ingest → Normalize → Classify → Deduplicate → Cluster

2. Scenario Engine

Responsibilities:

Create and validate scenarios
Enforce required fields (definitions, assumptions, boundaries...)
Perform safety checks (AKEL-assisted)
Manage versioning and lifecycle
Provide contextual evaluation settings to the Verdict Engine

Flow:
Create → Validate → Version → Lifecycle → Safety

3. Evidence Repository

Responsibilities:

Store metadata + files (object store)
Classify evidence
Compute preliminary reliability
Maintain version history
Detect retractions or disputes
Provide structured metadata to the Verdict Engine

Flow:
Store → Classify → Score → Version → Update/Retract

4. Verdict Engine

Responsibilities:

Aggregate scenario-linked evidence
Compute likelihood ranges per scenario
Generate reasoning chain
Track uncertainty factors
Maintain verdict version timelines

Flow:
Aggregate → Compute → Explain → Version → Timeline

5. Re-evaluation Engine

Responsibilities:

Listen for upstream changes
Trigger partial or full recomputation
Update verdicts + summary views
Maintain consistency across federated nodes

Triggers include:

Evidence updated or retracted
Scenario definition or assumption changes
Claim type or evaluability changes
Contradiction detection
Federation sync updates

Flow:
Trigger → Impact Analysis → Recompute → Publish Update

AKEL Integration Summary

AKEL is fully documented in its own chapter.
Here is only the architectural integration summary:

Receives raw input for claims
Proposes scenario drafts
Extracts and summarizes evidence
Gives reliability hints
Suggests draft verdicts
Monitors contradictions
Syncs metadata with trusted nodes

AKEL runs in parallel to human review — never overrides it.

Current Implementation - Triple-Path Pipeline Architecture. Three pipeline variants share common modules for AnalysisContext detection, aggregation, claim processing, evidence filtering, verdict corrections, and source reliability.

Updated 2026-02-08 per documentation audit report.

Triple-Path Pipeline Architecture


graph TB
    subgraph Input[User Input]
        URL[URL Input]
        TEXT[Text Input]
    end

    subgraph Shared[Shared Modules]
        CONTEXTS[analysis-contexts.ts Context Detection]
        AGG[aggregation.ts Verdict Aggregation]
        CLAIM_D[claim-decomposition.ts]
        EF[evidence-filter.ts ~330 lines]
        QG[quality-gates.ts ~410 lines]
        SR[source-reliability.ts ~620 lines]
        VC[verdict-corrections.ts ~310 lines]
        TS[truth-scale.ts ~280 lines]
        BU[budgets.ts ~250 lines]
    end

    subgraph Dispatch[Pipeline Dispatch]
        SELECT{Select Pipeline}
    end

    subgraph Pipelines[Pipeline Implementations]
        ORCH[Orchestrated Pipeline]
        CANON[Monolithic Canonical]
        DYN[Monolithic Dynamic]
    end

    subgraph LLM[LLM Layer]
        PROVIDER[AI SDK Provider]
    end

    subgraph Output[Result]
        RESULT[AnalysisResult JSON]
        REPORT[Markdown Report]
    end

    URL --> SELECT
    TEXT --> SELECT
    SELECT -->|orchestrated| ORCH
    SELECT -->|monolithic_canonical| CANON
    SELECT -->|monolithic_dynamic| DYN
    CONTEXTS --> ORCH
    CONTEXTS --> CANON
    AGG --> ORCH
    AGG --> CANON
    CLAIM_D --> ORCH
    CLAIM_D --> CANON
    EF --> ORCH
    QG --> ORCH
    SR --> ORCH
    SR --> CANON
    SR --> DYN
    VC --> ORCH
    TS --> CANON
    TS --> DYN
    BU --> ORCH
    BU --> CANON
    BU --> DYN
    ORCH --> PROVIDER
    CANON --> PROVIDER
    DYN --> PROVIDER
    ORCH --> RESULT
    CANON --> RESULT
    DYN --> RESULT
    RESULT --> REPORT

Pipeline Variants

Variant	File	Lines	Approach	Output Schema
Orchestrated	orchestrated.ts	13,300	Multi-step workflow with explicit stages	Canonical (structured)
Monolithic Canonical	monolithic-canonical.ts	1,500	Single LLM tool-loop call	Canonical (structured)
Monolithic Dynamic	monolithic-dynamic.ts	735	Single LLM tool-loop call	Dynamic (flexible)

Shared Modules

Module	Lines	Used By	Purpose
analysis-contexts.ts		Orch, Canon	Heuristic context pre-detection before LLM
aggregation.ts		Orch, Canon	Verdict weighting, contestation validation
claim-decomposition.ts		Orch, Canon	Claim text parsing and normalization
evidence-filter.ts	330	Orch	Probative value filtering, false positive rate calculation
quality-gates.ts	410	Orch	Gate 1 (claim validation) and Gate 4 (verdict confidence)
source-reliability.ts	620	Orch, Canon, Dyn	LLM-based source reliability evaluation with cache
verdict-corrections.ts	310	Orch	Post-hoc verdict direction mismatch corrections
truth-scale.ts	280	Canon, Dyn	Percentage-to-verdict label mapping
budgets.ts	250	Orch, Canon, Dyn	Token/cost budget tracking and enforcement

Orchestrated Pipeline Steps

Understand - Detect input type, extract claims, identify dependencies
2. Research (iterative) - Generate queries, fetch sources, extract evidence
3. Verdict Generation - Generate claim and article verdicts
4. Summary - Build two-panel summary
5. Report - Generate markdown report

Detailed Pipeline Diagrams

For internal implementation details of each pipeline variant:

Orchestrated Pipeline Internal - 7-step staged workflow (13,300 lines)
Monolithic Canonical Internal - Single-context canonical output (1,500 lines)
Monolithic Dynamic Internal - Flexible experimental output (735 lines)

Federated Architecture

Each FactHarbor node:

Has its own dataset (claims, scenarios, evidence, verdicts)
Runs its own AKEL
Maintains local governance and reviewer rules
May partially mirror global or domain-specific data
Contributes to global knowledge clusters

Nodes synchronize via:

Signed version bundles
Merkle-tree lineage structures
Optionally IPFS for evidence
Trust-weighted acceptance

Benefits:

Community independence
Scalability
Resilience
Domain specialization

Not Implemented (v2.10.2) — Federation is planned for V2.0+. Current implementation is single-instance only.

Federation Architecture (Future)


graph LR
    FH1[FactHarbor Instance 1]
    FH2[FactHarbor Instance 2]
    FH3[FactHarbor Instance 3]
    FH1 -.->|V1.0+ Sync claims| FH2
    FH2 -.->|V1.0+ Sync claims| FH3
    FH3 -.->|V1.0+ Sync claims| FH1
    U1[Users] --> FH1
    U2[Users] --> FH2
    U3[Users] --> FH3

Federation Architecture - Future (V1.0+): Independent FactHarbor instances can sync claims for broader reach while maintaining local control.

Target Features

Feature	Purpose	Status
Claim synchronization	Share verified claims across instances	Not implemented
Cross-node audits	Distributed quality assurance	Not implemented
Local control	Each instance maintains autonomy	N/A
Contradiction detection	Cross-instance contradiction checking	Not implemented

Current Implementation

Single-instance deployment only
No inter-instance communication
All data stored locally in SQLite

Request → Verdict Flow

Simple end-to-end flow:

User → UI Frontend → REST API → FactHarbor Core
→ (Claim Processing → Scenario Engine → Evidence Repository → Verdict Engine)
→ Summary View → UI Frontend → User

Federation Sync Workflow

Sequence:

Detect Local Change → Build Signed Bundle → Push to Peers → Validate Signature → Merge or Fork → Trigger Re-evaluation

Versioning Architecture

All entities (Claim, Scenario, Evidence, Verdict) use immutable version chains:

VersionID
ParentVersionID
Timestamp
AuthorType (Human, AI, ExternalNode)
ChangeReason
Signature (optional POC, required in 1.0)

UCM Configuration Versioning Architecture


graph LR
    ADMIN[UCM Administrator] -->|creates| BLOB[Config Blob - immutable]
    BLOB -->|content-addressed| STORE[(config_blobs)]
    ADMIN -->|activates| ACTIVE[config_active]
    ACTIVE -->|points to| BLOB
    JOB[Analysis Job] -->|snapshots at start| USAGE[config_usage]
    USAGE -->|references| BLOB
    REPORT[Analysis Report] -->|cites| USAGE

How UCM Config Versioning Works

Concept	Description
config_blobs	Immutable, content-addressed config versions. Each change creates a new blob; old blobs are never deleted.
config_active	Pointer to the currently active config blob per config type. Changing this activates a new config version.
config_usage	Links each analysis job to the exact config snapshot used. Enables reproducibility.
Immutability	Analysis outputs are never edited. To improve results, update UCM config and re-analyse.

Current Implementation (v2.10.2)

Feature	Status
UCM config storage	Implemented (config.db SQLite)
Config hot-reload	Implemented (60s TTL)
Per-job config snapshots	Implemented (job_config_snapshots)
Content-addressed blobs	Implemented (hash-based deduplication)
Config activation tracking	Implemented (config_active table)
Admin UI for config management	Not yet implemented (CLI/direct DB)

Design Principles

Every config change creates a new immutable blob — no in-place mutation
Every analysis job records the config snapshot used at time of execution
Reports can be reproduced by re-running with the same config snapshot
Config history is the audit trail — who changed what, when, and why
Analysis data is never edited — "improve the system, not the data"

Architecture

Architecture

High-Level System Architecture

High-Level Architecture

Component Summary

Key Files

Environment Variables

Storage Architecture

Storage Architecture

1. Current Implementation (v2.10.2)

1.1 Three-Database Architecture

1.2 Current Caching Mechanisms

1.3 Storage Patterns

2. What Is Worth Caching?

2.1 Caching Value Analysis

2.2 Cost Impact Modeling

3. Redis: Do We Still Need It?

3.1 Current Reality Assessment

3.2 When Redis Becomes Necessary

4. PostgreSQL: When and Why?

4.1 Current SQLite Limitations

4.2 What PostgreSQL Enables

4.3 Migration Path

5. Vector Database Assessment

6. Revised Storage Roadmap

Previous Roadmap (Superseded)

Current Roadmap

7. Decision Summary

Related Pages

Core Backend Module Architecture

1. Claim Processing Module

2. Scenario Engine

3. Evidence Repository

4. Verdict Engine

5. Re-evaluation Engine

AKEL Integration Summary

Triple-Path Pipeline Architecture

Pipeline Variants

Shared Modules

Orchestrated Pipeline Steps

Detailed Pipeline Diagrams

Federated Architecture

Federation Architecture (Future)

Target Features

Current Implementation

Request → Verdict Flow

Federation Sync Workflow

Versioning Architecture

UCM Configuration Versioning Architecture

How UCM Config Versioning Works

Current Implementation (v2.10.2)

Design Principles