Architecture

Version 1.1 by Robert Schaub on 2025/12/16 21:42

Architecture

FactHarbor uses a modular-monolith architecture (POC → Beta 0) designed to evolve into a distributed, federated, multi-node system (Release 1.0+).
Modules are strongly separated, versioned, and auditable. All logic is transparent and deterministic.

High-Level System Architecture

FactHarbor is composed of the following major modules:

  • UI Frontend
  • REST API Layer
  • Core Logic Layer
    • Claim Processing  
    • Scenario Engine  
    • Evidence Repository  
    • Verdict Engine  
    • Re-evaluation Engine  
    • Roles / Identity / Reputation
  • AKEL (AI Knowledge Extraction Layer)
  • Federation Layer
  • Workers & Background Jobs
  • Storage Layer (Postgres + VectorDB + ObjectStore)
Information

Current Implementation (v2.10.2) - Two-service architecture: Next.js web app for UI and analysis, .NET API for job persistence.

High-Level Architecture


graph TB
    subgraph Client[Client Layer]
        BROWSER[Web Browser]
    end

    subgraph NextJS[Next.js Web App]
        ANALYZE[analyze page]
        JOBS[jobs page]
        JOBVIEW[jobs id page]
        ANALYZE_API[api fh analyze]
        JOBS_API[api fh jobs]
        RUN_JOB[api internal run-job]
        ORCH[orchestrated.ts]
        CANON[monolithic-canonical.ts]
        SHARED[Shared Modules]
        WEBSEARCH[web-search.ts]
        SR[source-reliability.ts]
    end

    subgraph DotNet[.NET API]
        DOTNET_API[ASP.NET Core API]
        JOBS_CTRL[JobsController]
        HEALTH_CTRL[HealthController]
        SQLITE[(SQLite factharbor.db)]
    end

    subgraph External[External Services]
        LLM[LLM Providers]
        SEARCH[Search Providers]
    end

    BROWSER --> ANALYZE
    BROWSER --> JOBS
    BROWSER --> JOBVIEW
    ANALYZE --> ANALYZE_API
    ANALYZE_API --> DOTNET_API
    DOTNET_API --> SQLITE
    RUN_JOB --> ORCH
    RUN_JOB --> CANON
    ORCH --> SHARED
    CANON --> SHARED
    ORCH --> LLM
    CANON --> LLM
    ORCH --> WEBSEARCH
    WEBSEARCH --> SEARCH
    SHARED --> SR

Component Summary

 Component  Technology  Purpose
 Web App  Next.js 14+  UI, API routes, AKEL pipeline
 API  ASP.NET Core 8.0  Job persistence, health checks
 Database  SQLite (3 databases)  Jobs/events, UCM config, SR cache
 LLM  AI SDK (Vercel)  Multi-provider LLM abstraction with model tiering
 Search  Google CSE / SerpAPI  Web search for evidence

Key Files

 File  Lines  Purpose
 orchestrated.ts  13300  Main orchestrated pipeline
 monolithic-canonical.ts  1500  Monolithic canonical pipeline
 analysis-contexts.ts  600  AnalysisContext pre-detection
 aggregation.ts  400  Verdict aggregation + claim weighting
 evidence-filter.ts  300  Deterministic evidence quality filtering
 source-reliability.ts  500  LLM-based source reliability scoring

Environment Variables

 Variable  Default  Purpose
 FH_SEARCH_ENABLED  true  Enable web search
 FH_DETERMINISTIC  true  Zero temperature
 FH_API_URL  localhost:5139  .NET API endpoint

Key ideas:

  • Core logic is deterministic, auditable, and versioned  
  • AKEL drafts structured outputs but never publishes directly  
  • Workers run long or asynchronous tasks  
  • Storage is separated for scalability and clarity  
  • Federation Layer provides optional distributed operation  

Storage Architecture

FactHarbor separates structured data, embeddings, and evidence files:

  • PostgreSQL — canonical structured entities, all versioning, lineage, signatures  
  • Vector DB (Qdrant or pgvector) — semantic search, duplication detection, cluster mapping  
  • Object Storage — PDFs, datasets, raw evidence, transcripts  
  • Optional (Release 1.0): Redis for caching, IPFS for decentralized object storage  

Storage Architecture

1. Current Implementation (v2.10.2)

1.1 Three-Database Architecture


graph TB
    subgraph NextJS[Next.js Web App]
        PIPELINE[Orchestrated Pipeline]
        CONFIG_SVC[Config Storage]
        SR_SVC[SR Cache]
    end

    subgraph DotNet[.NET API]
        CONTROLLERS[Controllers]
        EF[Entity Framework]
    end

    CONFIG_DB[(config.db)]
    SR_DB[(source-reliability.db)]
    FH_DB[(factharbor.db)]

    CONFIG_SVC --> CONFIG_DB
    SR_SVC --> SR_DB
    PIPELINE -->|via API| CONTROLLERS
    CONTROLLERS --> EF
    EF --> FH_DB

 Database  Purpose  Access Layer  Key Tables
 factharbor.db  Jobs, events, analysis results  .NET API (Entity Framework)  Jobs, JobEvents, AnalysisMetrics
 config.db  UCM configuration management  Next.js (better-sqlite3)  config_blobs, config_active, config_usage, job_config_snapshots
 source-reliability.db  Source reliability cache  Next.js (better-sqlite3)  source_reliability

1.2 Current Caching Mechanisms

 What  Mechanism  TTL  Status
 Source reliability scores  SQLite + batch prefetch to in-memory Map  90 days (configurable via UCM)  IMPLEMENTED
 UCM config values  In-memory Map with TTL-based expiry  60 seconds  IMPLEMENTED
 URL content (fetched pages)  Not cached  N/A  NOT IMPLEMENTED
 Claim-level analysis results  Not cached  N/A  NOT IMPLEMENTED
 LLM responses  Not cached  N/A  NOT IMPLEMENTED

1.3 Storage Patterns

  • Analysis results: JSON blob in ResultJson column (per job), stored once by .NET API
  • Config blobs: Content-addressable with SHA-256 hash as PK, history tracked
  • Job config snapshots: Pipeline + search + SR config captured per job for auditability
  • SR cache: Per-domain reliability assessment with multi-model consensus scores

Current limitations:

  • No relational queries across claims, evidence, or sources from different analyses
  • No full-text search on analysis content
  • Single-writer limitation (SQLite) — fine for single-instance but blocks horizontal scaling
  • Every analysis re-fetches URL content and recomputes all LLM calls from scratch

2. What Is Worth Caching?

Warning

This section identifies caching opportunities. The EVALUATE items require deeper analysis during Alpha planning before committing to scope and timeline.

2.1 Caching Value Analysis

 Cacheable Item  Estimated Savings  Latency Impact  Complexity  Recommendation
 Claim-level results  30-50% LLM cost on duplicate claims  None (cache lookup)  MEDIUM — needs canonical claim hash + TTL + prompt-version awareness  EVALUATE in Alpha
 URL content  $0 API cost but 5-15s latency per source  Major — eliminates re-fetch  LOW — URL hash + content + timestamp  EVALUATE in Alpha
 LLM responses  Highest per-call savings  None  HIGH — prompt hash + input hash, invalidation on prompt change  DEFER — claim-level caching captures most benefit
 Search query results  Marginal — search APIs are cheap  Minor  MEDIUM — results go stale quickly  NOT RECOMMENDED — volatile, low ROI

2.2 Cost Impact Modeling

Assuming $0.10-$2.00 per analysis (depending on article complexity and model tier):

 Usage Level  Current Cost/day  With Claim Cache (-35%)  With URL Cache
 10 analyses/day  $1-20  $0.65-13  Same cost, 30-60s faster
 100 analyses/day  $10-200  $6.50-130  Same cost, 5-15 min faster
 1000 analyses/day  $100-2,000  $65-1,300  Same cost, 50-150 min faster

Key insight: Claim caching saves money; URL caching saves time. Both follow the existing SQLite + in-memory Map pattern from source reliability.

3. Redis: Do We Still Need It?

3.1 Current Reality Assessment

 Original Redis Use Case  Current Solution  Gap?
 Hot data caching  In-memory Map (config), SQLite (SR)  No gap at current scale
 Session management  No user auth = no sessions  Not needed until Beta
 Rate limiting  Not implemented  Can be in-process for single-instance
 Pub/sub for real-time  SSE events work without Redis  No gap for single-instance

3.2 When Redis Becomes Necessary

Redis adds value when:

  • Multiple application instances need shared cache/state (horizontal scaling)
  • Sub-millisecond cache lookups required (SQLite is 1-5ms, sufficient for current needs)
  • Distributed rate limiting needed across multiple servers

Trigger criteria (following When-to-Add-Complexity philosophy):

  • Single-instance SQLite cache latency >100ms
  • Need for >1 application instance
  • Rate limiting required across instances
Information

Decision: DEFER Redis. Not needed for current or near-term development. SQLite + in-memory Map handles all current caching needs.

4. PostgreSQL: When and Why?

4.1 Current SQLite Limitations

 Limitation  Impact  When It Hurts
 JSON blob storage (no relational queries)  Cannot query across analyses  When browse/search is needed
 Single-writer  No concurrent writes  When horizontal scaling is needed
 No complex aggregation  Cannot run cross-analysis analytics  When quality dashboards need SQL
 No full-text search  Cannot search claim text or evidence  When browse/search is needed

4.2 What PostgreSQL Enables

  • Browse/search claims across all analyses
  • Quality metrics dashboards with SQL aggregation
  • Evidence deduplication (FR54) with relational queries
  • User accounts and permissions (Beta requirement)
  • Multi-instance deployments

4.3 Migration Path

The .NET API already has PostgreSQL support configured (appsettings.json). Switching is a configuration change, not a code rewrite.

Note: Keep SQLite for config.db (portable) and source-reliability.db (standalone). Only factharbor.db needs PostgreSQL.

Information

Decision: EVALUATE for Alpha/Beta. Add PostgreSQL when user accounts + search + evidence dedup needed. Requires deeper analysis during Alpha planning.

5. Vector Database Assessment

Information

Full assessment: Docs/WIP/Vector_DB_Assessment.md (February 2, 2026)

Conclusion: Vector search is not required for core functionality. Vectors add value only for approximate similarity (near-duplicate claim detection, edge case clustering) and should remain optional and offline to preserve pipeline performance and determinism.

When to add: Only after Shadow Mode data collection proves that near-duplicate detection needs exceed text-hash capability. Start with lightweight normalization + n-gram overlap (no vector DB needed).

Information

Decision: DEFER. Re-evaluate after Shadow Mode data collection.

6. Revised Storage Roadmap

Previous Roadmap (Superseded)

Phase 1: Add Redis for caching
Phase 2: Migrate to PostgreSQL for normalized data
Phase 3: Add S3 for archives and backups

Current Roadmap


graph LR
    subgraph Phase1[Phase 1: Alpha]
        P1A[Expand SQLite caching]
        P1B[Keep 3-DB architecture]
    end

    subgraph Phase2[Phase 2: Beta]
        P2A[Add PostgreSQL for factharbor.db]
        P2B[Add normalized claim/evidence tables]
        P2C[Keep SQLite for config + SR]
    end

    subgraph Phase3[Phase 3: V1.0]
        P3A[Add Redis IF multi-instance needed]
        P3B[PostgreSQL primary for production]
    end

    subgraph Phase4[Phase 4: V1.0+]
        P4A[Vector DB IF Shadow Mode proves value]
        P4B[S3 IF storage exceeds 50GB]
    end

    Phase1 --> Phase2
    Phase2 --> Phase3
    Phase3 --> Phase4

Phase 1 (Alpha): Evaluate and potentially add URL content cache + claim-level cache in SQLite. Keep 3-DB architecture and in-memory Map caches.

Phase 2 (Beta): Add PostgreSQL for factharbor.db (user data, normalized claims, search). Keep SQLite for config.db (portable) and source-reliability.db (standalone).

Phase 3 (V1.0): Add Redis ONLY IF multi-instance deployment required. PostgreSQL becomes primary for all production data.

Phase 4 (V1.0+): Add vector DB ONLY IF Shadow Mode data proves value. Add S3 ONLY IF storage exceeds 50GB.

7. Decision Summary

 Technology  Decision  When  Status
 SQLite URL cache  EVALUATE  Alpha planning  Needs further analysis
 SQLite claim cache  EVALUATE  Alpha planning  Needs further analysis
 Redis  DEFER  Multi-instance  Agreed
 PostgreSQL  EVALUATE  Alpha/Beta  Needs further analysis
 Vector DB  DEFER  Post-Shadow Mode  Agreed
 S3  DEFER  V1.0+  Agreed
Information

DEFER items are agreed. EVALUATE items (URL cache, claim cache, PostgreSQL) require deeper analysis during Alpha release planning — scope, dependencies, and prioritization to be determined as part of Alpha milestones.

Related Pages

Document Status: PARTIALLY APPROVED (February 2026) — DEFER decisions agreed; EVALUATE items need Alpha-phase analysis

Core Backend Module Architecture

Each module has a clear responsibility and versioned boundaries to allow future extraction into microservices.

1. Claim Processing Module

Responsibilities:

  • Ingest text, URLs, documents, transcripts, federated input  
  • Extract claims (AKEL-assisted)  
  • Normalize structure  
  • Classify (type, domain, evaluability, safety)  
  • Deduplicate via embeddings  
  • Assign to claim clusters  

Flow:  
Ingest → Normalize → Classify → Deduplicate → Cluster

2. Scenario Engine

Responsibilities:

  • Create and validate scenarios  
  • Enforce required fields (definitions, assumptions, boundaries...)  
  • Perform safety checks (AKEL-assisted)  
  • Manage versioning and lifecycle  
  • Provide contextual evaluation settings to the Verdict Engine  

Flow:  
Create → Validate → Version → Lifecycle → Safety

3. Evidence Repository

Responsibilities:

  • Store metadata + files (object store)  
  • Classify evidence  
  • Compute preliminary reliability  
  • Maintain version history  
  • Detect retractions or disputes  
  • Provide structured metadata to the Verdict Engine  

Flow:  
Store → Classify → Score → Version → Update/Retract

4. Verdict Engine

Responsibilities:

  • Aggregate scenario-linked evidence  
  • Compute likelihood ranges per scenario
  • Generate reasoning chain  
  • Track uncertainty factors  
  • Maintain verdict version timelines  

Flow:  
Aggregate → Compute → Explain → Version → Timeline

5. Re-evaluation Engine

Responsibilities:

  • Listen for upstream changes  
  • Trigger partial or full recomputation  
  • Update verdicts + summary views  
  • Maintain consistency across federated nodes  

Triggers include:

  • Evidence updated or retracted  
  • Scenario definition or assumption changes  
  • Claim type or evaluability changes  
  • Contradiction detection  
  • Federation sync updates  

Flow:  
Trigger → Impact Analysis → Recompute → Publish Update

AKEL Integration Summary

AKEL is fully documented in its own chapter.
Here is only the architectural integration summary:

  • Receives raw input for claims  
  • Proposes scenario drafts  
  • Extracts and summarizes evidence  
  • Gives reliability hints  
  • Suggests draft verdicts  
  • Monitors contradictions  
  • Syncs metadata with trusted nodes  

AKEL runs in parallel to human review — never overrides it.

Information

Current Implementation - Triple-Path Pipeline Architecture. Three pipeline variants share common modules for AnalysisContext detection, aggregation, claim processing, evidence filtering, verdict corrections, and source reliability.

Updated 2026-02-08 per documentation audit report.

Triple-Path Pipeline Architecture


graph TB
    subgraph Input[User Input]
        URL[URL Input]
        TEXT[Text Input]
    end

    subgraph Shared[Shared Modules]
        CONTEXTS[analysis-contexts.ts Context Detection]
        AGG[aggregation.ts Verdict Aggregation]
        CLAIM_D[claim-decomposition.ts]
        EF[evidence-filter.ts ~330 lines]
        QG[quality-gates.ts ~410 lines]
        SR[source-reliability.ts ~620 lines]
        VC[verdict-corrections.ts ~310 lines]
        TS[truth-scale.ts ~280 lines]
        BU[budgets.ts ~250 lines]
    end

    subgraph Dispatch[Pipeline Dispatch]
        SELECT{Select Pipeline}
    end

    subgraph Pipelines[Pipeline Implementations]
        ORCH[Orchestrated Pipeline]
        CANON[Monolithic Canonical]
        DYN[Monolithic Dynamic]
    end

    subgraph LLM[LLM Layer]
        PROVIDER[AI SDK Provider]
    end

    subgraph Output[Result]
        RESULT[AnalysisResult JSON]
        REPORT[Markdown Report]
    end

    URL --> SELECT
    TEXT --> SELECT
    SELECT -->|orchestrated| ORCH
    SELECT -->|monolithic_canonical| CANON
    SELECT -->|monolithic_dynamic| DYN
    CONTEXTS --> ORCH
    CONTEXTS --> CANON
    AGG --> ORCH
    AGG --> CANON
    CLAIM_D --> ORCH
    CLAIM_D --> CANON
    EF --> ORCH
    QG --> ORCH
    SR --> ORCH
    SR --> CANON
    SR --> DYN
    VC --> ORCH
    TS --> CANON
    TS --> DYN
    BU --> ORCH
    BU --> CANON
    BU --> DYN
    ORCH --> PROVIDER
    CANON --> PROVIDER
    DYN --> PROVIDER
    ORCH --> RESULT
    CANON --> RESULT
    DYN --> RESULT
    RESULT --> REPORT

Pipeline Variants

 Variant  File  Lines  Approach  Output Schema
 Orchestrated  orchestrated.ts  13,300  Multi-step workflow with explicit stages  Canonical (structured)
 Monolithic Canonical  monolithic-canonical.ts  1,500  Single LLM tool-loop call  Canonical (structured)
 Monolithic Dynamic  monolithic-dynamic.ts  735  Single LLM tool-loop call  Dynamic (flexible)

Shared Modules

 Module  Lines  Used By  Purpose
 analysis-contexts.ts   Orch, Canon  Heuristic context pre-detection before LLM
 aggregation.ts   Orch, Canon  Verdict weighting, contestation validation
 claim-decomposition.ts   Orch, Canon  Claim text parsing and normalization
 evidence-filter.ts  330  Orch  Probative value filtering, false positive rate calculation
 quality-gates.ts  410  Orch  Gate 1 (claim validation) and Gate 4 (verdict confidence)
 source-reliability.ts  620  Orch, Canon, Dyn  LLM-based source reliability evaluation with cache
 verdict-corrections.ts  310  Orch  Post-hoc verdict direction mismatch corrections
 truth-scale.ts  280  Canon, Dyn  Percentage-to-verdict label mapping
 budgets.ts  250  Orch, Canon, Dyn  Token/cost budget tracking and enforcement

Orchestrated Pipeline Steps

  1. Understand - Detect input type, extract claims, identify dependencies
    2. Research (iterative) - Generate queries, fetch sources, extract evidence
    3. Verdict Generation - Generate claim and article verdicts
    4. Summary - Build two-panel summary
    5. Report - Generate markdown report

Detailed Pipeline Diagrams

For internal implementation details of each pipeline variant:

Federated Architecture

Each FactHarbor node:

  • Has its own dataset (claims, scenarios, evidence, verdicts)  
  • Runs its own AKEL  
  • Maintains local governance and reviewer rules  
  • May partially mirror global or domain-specific data  
  • Contributes to global knowledge clusters  

Nodes synchronize via:

  • Signed version bundles  
  • Merkle-tree lineage structures  
  • Optionally IPFS for evidence  
  • Trust-weighted acceptance  

Benefits:

  • Community independence  
  • Scalability  
  • Resilience  
  • Domain specialization  
Warning

Not Implemented (v2.10.2) — Federation is planned for V2.0+. Current implementation is single-instance only.

Federation Architecture (Future)


graph LR
    FH1[FactHarbor Instance 1]
    FH2[FactHarbor Instance 2]
    FH3[FactHarbor Instance 3]
    FH1 -.->|V1.0+ Sync claims| FH2
    FH2 -.->|V1.0+ Sync claims| FH3
    FH3 -.->|V1.0+ Sync claims| FH1
    U1[Users] --> FH1
    U2[Users] --> FH2
    U3[Users] --> FH3

Federation Architecture - Future (V1.0+): Independent FactHarbor instances can sync claims for broader reach while maintaining local control.

Target Features

 Feature  Purpose  Status
 Claim synchronization  Share verified claims across instances  Not implemented
 Cross-node audits  Distributed quality assurance  Not implemented
 Local control  Each instance maintains autonomy  N/A
 Contradiction detection  Cross-instance contradiction checking  Not implemented

Current Implementation

  • Single-instance deployment only
  • No inter-instance communication
  • All data stored locally in SQLite

Request → Verdict Flow

Simple end-to-end flow:

User → UI Frontend → REST API → FactHarbor Core
      → (Claim Processing → Scenario Engine → Evidence Repository → Verdict Engine)
      → Summary View → UI Frontend → User

Federation Sync Workflow

Sequence:

Detect Local Change → Build Signed Bundle → Push to Peers → Validate Signature → Merge or Fork → Trigger Re-evaluation

Versioning Architecture

All entities (Claim, Scenario, Evidence, Verdict) use immutable version chains:

  • VersionID  
  • ParentVersionID  
  • Timestamp  
  • AuthorType (Human, AI, ExternalNode)  
  • ChangeReason  
  • Signature (optional POC, required in 1.0)  

UCM Configuration Versioning Architecture


graph LR
    ADMIN[UCM Administrator] -->|creates| BLOB[Config Blob - immutable]
    BLOB -->|content-addressed| STORE[(config_blobs)]
    ADMIN -->|activates| ACTIVE[config_active]
    ACTIVE -->|points to| BLOB
    JOB[Analysis Job] -->|snapshots at start| USAGE[config_usage]
    USAGE -->|references| BLOB
    REPORT[Analysis Report] -->|cites| USAGE

How UCM Config Versioning Works

 Concept  Description
 config_blobs  Immutable, content-addressed config versions. Each change creates a new blob; old blobs are never deleted.
 config_active  Pointer to the currently active config blob per config type. Changing this activates a new config version.
 config_usage  Links each analysis job to the exact config snapshot used. Enables reproducibility.
 Immutability  Analysis outputs are never edited. To improve results, update UCM config and re-analyse.

Current Implementation (v2.10.2)

 Feature  Status
 UCM config storage  Implemented (config.db SQLite)
 Config hot-reload  Implemented (60s TTL)
 Per-job config snapshots  Implemented (job_config_snapshots)
 Content-addressed blobs  Implemented (hash-based deduplication)
 Config activation tracking  Implemented (config_active table)
 Admin UI for config management  Not yet implemented (CLI/direct DB)

Design Principles

  • Every config change creates a new immutable blob — no in-place mutation
  • Every analysis job records the config snapshot used at time of execution
  • Reports can be reproduced by re-running with the same config snapshot
  • Config history is the audit trail — who changed what, when, and why
  • Analysis data is never edited — "improve the system, not the data"