Architecture

Last modified by Robert Schaub on 2026/02/08 21:34

Architecture

FactHarbor's architecture is designed for simplicity, automation, and continuous improvement.

1. Core Principles

  • AI-First: AKEL (AI) is the primary system, humans supplement
  • Publish by Default: No centralized approval (removed in V0.9.50), publish with confidence scores
  • System Over Data: Fix algorithms, not individual outputs
  • Measure Everything: Quality metrics drive improvements
  • Scale Through Automation: Minimal human intervention
  • Start Simple: Add complexity only when metrics prove necessary

2. High-Level Architecture

Information

Current Implementation (v2.6.33) - Two-service architecture: Next.js web app for UI and analysis, .NET API for job persistence.

High-Level Architecture


graph TB
    subgraph Client[Client Layer]
        BROWSER[Web Browser]
    end

    subgraph NextJS[Next.js Web App]
        ANALYZE[analyze page]
        JOBS[jobs page]
        JOBVIEW[jobs id page]
        ANALYZE_API[api fh analyze]
        JOBS_API[api fh jobs]
        RUN_JOB[api internal run-job]
        ORCH[orchestrated.ts]
        CANON[monolithic-canonical.ts]
        SHARED[Shared Modules]
        WEBSEARCH[web-search.ts]
        MBFC[source-reliability.ts]
    end

    subgraph DotNet[.NET API]
        DOTNET_API[ASP.NET Core API]
        JOBS_CTRL[JobsController]
        HEALTH_CTRL[HealthController]
        SQLITE[(SQLite factharbor.db)]
    end

    subgraph External[External Services]
        LLM[LLM Providers]
        SEARCH[Search Providers]
    end

    BROWSER --> ANALYZE
    BROWSER --> JOBS
    BROWSER --> JOBVIEW
    ANALYZE --> ANALYZE_API
    ANALYZE_API --> DOTNET_API
    DOTNET_API --> SQLITE
    RUN_JOB --> ORCH
    RUN_JOB --> CANON
    ORCH --> SHARED
    CANON --> SHARED
    ORCH --> LLM
    CANON --> LLM
    ORCH --> WEBSEARCH
    WEBSEARCH --> SEARCH
    SHARED --> MBFC

Component Summary

 Component  Technology  Purpose
 Web App  Next.js 14  UI, API routes, AKEL pipeline
 API  ASP.NET Core  Job persistence, health checks
 Database  SQLite  Jobs, events, results (JSON blob)
 LLM  AI SDK  Multi-provider LLM abstraction
 Search  Google CSE / SerpAPI  Web search for evidence

Key Files

 File  Lines  Purpose
 orchestrated.ts  9000  Main orchestrated pipeline
 monolithic-canonical.ts  1100  Monolithic canonical pipeline
 scopes.ts  600  Scope detection
 aggregation.ts  300  Verdict aggregation

Environment Variables

 Variable  Default  Purpose
 LLM_PROVIDER  anthropic  Primary LLM provider
 FH_SEARCH_ENABLED  true  Enable web search
 FH_DETERMINISTIC  true  Zero temperature

2.1 Three-Layer Architecture

FactHarbor uses a clean three-layer architecture:

Interface Layer

Handles all user and system interactions:

  • Web UI: Browse claims, view evidence, submit feedback
  • REST API: Programmatic access for integrations
  • Authentication & Authorization: User identity and permissions
  • Rate Limiting: Protect against abuse

Processing Layer

Core business logic and AI processing:

  • AKEL Pipeline: AI-driven claim analysis (parallel processing)
  • Parse and extract claim components
  • Gather evidence from multiple sources
  • Check source track records
  • Extract scenarios from evidence
  • Synthesize verdicts
  • Calculate risk scores
  • LLM Abstraction Layer: Provider-agnostic AI access
  • Multi-provider support (Anthropic, OpenAI, Google, local models)
  • Automatic failover and rate limit handling
  • Per-stage model configuration
  • Cost optimization through provider selection
  • No vendor lock-in
  • Background Jobs: Automated maintenance tasks
  • Source track record updates (weekly)
  • Cache warming and invalidation
  • Metrics aggregation
  • Data archival
  • Quality Monitoring: Automated quality checks
  • Anomaly detection
  • Contradiction detection
  • Completeness validation
  • Moderation Detection: Automated abuse detection
  • Spam identification
  • Manipulation detection
  • Flag suspicious activity

Data & Storage Layer

Persistent data storage and caching:

  • PostgreSQL: Primary database for all core data
  • Claims, evidence, sources, users
  • Scenarios, edits, audit logs
  • Built-in full-text search
  • Time-series capabilities for metrics
  • Redis: High-speed caching layer
  • Session data
  • Frequently accessed claims
  • API rate limiting
  • S3 Storage: Long-term archival
  • Old edit history (90+ days)
  • AKEL processing logs
  • Backup snapshots
    Optional future additions (add only when metrics prove necessary):
  • Elasticsearch: If PostgreSQL full-text search becomes slow
  • TimescaleDB: If metrics queries become a bottleneck

2.2 LLM Abstraction Layer

LLM Abstraction Architecture

graph LR
 subgraph AKEL["AKEL Pipeline"]
 S1[Stage 1
Extract Claims] S2[Stage 2
Analyze Claims] S3[Stage 3
Holistic Assessment] end subgraph LLM["LLM Abstraction Layer"] INT[Provider Interface] CFG[Configuration
Registry] FAIL[Failover
Handler] end subgraph Providers["LLM Providers"] ANT[Anthropic
Claude API
PRIMARY] OAI[OpenAI
GPT API
SECONDARY] GOO[Google
Gemini API
TERTIARY] LOC[Local Models
Llama/Mistral
FUTURE] end S1 --> INT S2 --> INT S3 --> INT INT --> CFG INT --> FAIL CFG --> ANT FAIL --> ANT FAIL --> OAI FAIL --> GOO ANT -.fallback.-> OAI OAI -.fallback.-> GOO style AKEL fill:#ffcccc style LLM fill:#ccffcc style Providers fill:#e1f5ff style ANT fill:#ff9999 style OAI fill:#99ccff style GOO fill:#99ff99 style LOC fill:#cccccc

LLM Abstraction Architecture - AKEL stages call through provider interface. Configuration registry selects provider per stage. Failover handler implements automatic fallback chain.

POC1 Implementation:

  • PRIMARY: Provider A API (FAST model for Stage 1, REASONING model for Stages 2 & 3)
  • Failover: Basic error handling with cache fallback

Future (POC2/Beta):

  • SECONDARY: OpenAI GPT API (automatic failover)
  • TERTIARY: Google Gemini API (tertiary fallback)
  • FUTURE: Local models (Llama/Mistral for on-premises deployments)

Architecture Benefits:

  • Prevents vendor lock-in
  • Ensures resilience through automatic failover
  • Enables cost optimization per stage
  • Supports regulatory compliance (provider selection for data residency)

Description: Shows how AKEL stages interact with multiple LLM providers through an abstraction layer. POC1 uses Anthropic Claude as primary provider (Haiku 4.5 for extraction, Sonnet 4.5 for analysis). OpenAI, Google, and local models are shown as future expansion options (POC2/Beta).

Purpose: FactHarbor uses a provider-agnostic abstraction layer for all AI interactions, avoiding vendor lock-in and enabling flexible provider selection.

Multi-Provider Support:

  • Primary: Anthropic Claude API (Haiku for extraction, Sonnet for analysis)
  • Secondary: OpenAI GPT API (automatic failover)
  • Tertiary: Google Vertex AI / Gemini
  • Future: Local models (Llama, Mistral) for on-premises deployments

Provider Interface:

  • Abstract `LLMProvider` interface with `complete()`, `stream()`, `getName()`, `getCostPer1kTokens()`, `isAvailable()` methods
  • Per-stage model configuration (Stage 1: Haiku, Stage 2 & 3: Sonnet)
  • Environment variable and database configuration
  • Adapter pattern implementation (AnthropicProvider, OpenAIProvider, GoogleProvider)

Configuration:

  • Runtime provider switching without code changes
  • Admin API for provider management (`POST /admin/v1/llm/configure`)
  • Per-stage cost optimization (use cheaper models for extraction, quality models for analysis)
  • Support for rate limit handling and cost tracking

Failover Strategy:

  • Automatic fallback: Primary → Secondary → Tertiary
  • Circuit breaker pattern for unavailable providers
  • Health checking and provider availability monitoring
  • Graceful degradation when all providers unavailable

Cost Optimization:

  • Track and compare costs across providers per request
  • Enable A/B testing of different models for quality/cost tradeoffs
  • Per-stage provider selection for optimal cost-efficiency
  • Cost comparison: Anthropic ($0.114), OpenAI ($0.065), Google ($0.072) per article at 0% cache

Architecture Pattern:

AKEL Stages          LLM Abstraction       Providers
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Stage 1 Extract  ──→ Provider Interface ──→ Anthropic (PRIMARY)
Stage 2 Analyze  ──→ Configuration      ──→ OpenAI (SECONDARY)
Stage 3 Holistic ──→ Failover Handler   ──→ Google (TERTIARY)
                                        └→ Local Models (FUTURE)

Benefits:

  • No Vendor Lock-In: Switch providers based on cost, quality, or availability without code changes
  • Resilience: Automatic failover ensures service continuity during provider outages
  • Cost Efficiency: Use optimal provider per task (cheap for extraction, quality for analysis)
  • Quality Assurance: Cross-provider output verification for critical claims
  • Regulatory Compliance: Use specific providers for data residency requirements
  • Future-Proofing: Easy integration of new models as they become available

Cross-References:

2.2 Design Philosophy

Start Simple, Evolve Based on Metrics
The architecture deliberately starts simple:

  • Single primary database (PostgreSQL handles most workloads initially)
  • Three clear layers (easy to understand and maintain)
  • Automated operations (minimal human intervention)
  • Measure before optimizing (add complexity only when proven necessary)
    See Design Decisions and When to Add Complexity for detailed rationale.

3. AKEL Architecture

Information

Current Implementation (v2.6.33) - Triple-Path Pipeline Architecture. Three pipeline variants share common modules for scope detection, aggregation, and claim processing.

Triple-Path Pipeline Architecture


graph TB
    subgraph Input[User Input]
        URL[URL Input]
        TEXT[Text Input]
    end

    subgraph Shared[Shared Modules]
        SCOPES[scopes.ts Scope Detection]
        AGG[aggregation.ts Verdict Aggregation]
        CLAIM_D[claim-decomposition.ts]
    end

    subgraph Dispatch[Pipeline Dispatch]
        SELECT{Select Pipeline}
    end

    subgraph Pipelines[Pipeline Implementations]
        ORCH[Orchestrated Pipeline]
        CANON[Monolithic Canonical]
        DYN[Monolithic Dynamic]
    end

    subgraph LLM[LLM Layer]
        PROVIDER[AI SDK Provider]
    end

    subgraph Output[Result]
        RESULT[AnalysisResult JSON]
        REPORT[Markdown Report]
    end

    URL --> SELECT
    TEXT --> SELECT
    SELECT -->|orchestrated| ORCH
    SELECT -->|monolithic_canonical| CANON
    SELECT -->|monolithic_dynamic| DYN
    SCOPES --> ORCH
    SCOPES --> CANON
    AGG --> ORCH
    AGG --> CANON
    CLAIM_D --> ORCH
    CLAIM_D --> CANON
    ORCH --> PROVIDER
    CANON --> PROVIDER
    DYN --> PROVIDER
    ORCH --> RESULT
    CANON --> RESULT
    DYN --> RESULT
    RESULT --> REPORT

Pipeline Variants

 Variant  File  Approach  Output Schema
 Orchestrated  orchestrated.ts  Multi-step workflow with explicit stages  Canonical (structured)
 Monolithic Canonical  monolithic-canonical.ts  Single LLM tool-loop call  Canonical (structured)
 Monolithic Dynamic  monolithic-dynamic.ts  Single LLM tool-loop call  Dynamic (flexible)

Shared Modules

 Module  Purpose
 scopes.ts  Heuristic scope pre-detection before LLM
 aggregation.ts  Verdict weighting, contestation validation
 claim-decomposition.ts  Claim text parsing and normalization

Orchestrated Pipeline Steps

  1. Understand - Detect input type, extract claims, identify dependencies
    2. Research (iterative) - Generate queries, fetch sources, extract facts
    3. Verdict Generation - Generate claim and article verdicts
    4. Summary - Build two-panel summary
    5. Report - Generate markdown report

Detailed Pipeline Diagrams

For internal implementation details of each pipeline variant:


See AI Knowledge Extraction Layer (AKEL) for detailed information.

3.5 Claim Processing Architecture

FactHarbor's claim processing architecture is designed to handle both single-claim and multi-claim submissions efficiently.

Multi-Claim Handling

Users often submit:

  • Text with multiple claims: Articles, statements, or paragraphs containing several distinct factual claims
  • Web pages: URLs that are analyzed to extract all verifiable claims
  • Single claims: Simple, direct factual statements

The first processing step is always Claim Extraction: identifying and isolating individual verifiable claims from submitted content.

Processing Phases

POC Implementation (Two-Phase):

Phase 1 - Claim Extraction:

  • LLM analyzes submitted content
  • Extracts all distinct, verifiable claims
  • Returns structured list of claims with context

Phase 2 - Parallel Analysis:

  • Each claim processed independently by LLM
  • Single call per claim generates: Evidence, Scenarios, Sources, Verdict, Risk
  • Parallelized across all claims
  • Results aggregated for presentation

Production Implementation (Three-Phase):

Phase 1 - Extraction + Validation:

  • Extract claims from content
  • Validate clarity and uniqueness
  • Filter vague or duplicate claims

Phase 2 - Evidence Gathering (Parallel):

  • Independent evidence gathering per claim
  • Source validation and scenario generation
  • Quality gates prevent poor data from advancing

Phase 3 - Verdict Generation (Parallel):

  • Generate verdict from validated evidence
  • Confidence scoring and risk assessment
  • Low-confidence cases routed to human review

Architectural Benefits

Scalability:

  • Process 100 claims with 3x latency of single claim
  • Parallel processing across independent claims
  • Linear cost scaling with claim count

2.3 Design Philosophy

Quality:

  • Validation gates between phases
  • Errors isolated to individual claims
  • Clear observability per processing step

Flexibility:

  • Each phase optimizable independently
  • Can use different model sizes per phase
  • Easy to add human review at decision points

4. Storage Architecture

Storage Architecture

Current Implementation (v2.6.33)


graph TB
    subgraph Current[Current Implementation]
        APP[Application Next.js and .NET API]
        SQLITE[(SQLite factharbor.db)]
        FILES[Local Files MBFC bundle]
    end

    APP --> SQLITE
    APP --> FILES

Current Storage Model:

  • SQLite - Single file database (factharbor.db)
  • JSON Blobs - Analysis results stored in ResultJson column
  • No Caching - No Redis or in-memory cache
  • Local Files - MBFC bundle, configuration
 Table  Purpose  Key Fields
 Jobs  Job metadata and results  JobId, Status, Progress, ResultJson, ReportMarkdown
 JobEvents  Job execution log  JobId, TsUtc, Level, Message

Target Architecture (Future)


graph TB
    subgraph Target[Target Architecture Future]
        APP2[Application API and AKEL]
        REDIS[(Redis Cache)]
        PG[(PostgreSQL Primary Database)]
        S3[(S3 Storage Archives)]
    end

    APP2 --> REDIS
    REDIS --> PG
    APP2 --> PG
    PG --> S3

Target Features (Not Yet Implemented):

  • Redis Caching - Hot data, sessions, rate limiting
  • PostgreSQL - Normalized tables for Claims, Evidence, Sources
  • S3 Storage - Archives, backups, old edit history

Migration Path

  1. Current to Phase 1: Add Redis for caching (optional)
    2. Phase 1 to Phase 2: Migrate to PostgreSQL for normalized data
    3. Phase 2 to Phase 3: Add S3 for archives and backups

See Storage Strategy for detailed information.

4.5 Versioning Architecture

Warning

Not Implemented (v2.6.33) - Entity versioning is not yet implemented. Current implementation stores analysis results as immutable JSON blobs. This diagram shows the target architecture.

Target Versioning Architecture


graph LR
    CLAIM[Claim] -->|edited| EDIT[Edit Record]
    EDIT -->|stores| BEFORE[Before State]
    EDIT -->|stores| AFTER[After State]
    EDIT -->|tracks| WHO[Who Changed]
    EDIT -->|tracks| WHEN[When Changed]
    EDIT -->|tracks| WHY[Why Changed]
    EDIT -->|if needed| RESTORE[Manual Restore]
    RESTORE -->|create new| CLAIM

Current vs Target

 Feature  Current (v2.6.33)  Target
 Edit tracking  No  Yes via EDIT table
 Before/after states  No  Yes JSON storage
 User attribution  No (anonymous)  Yes with user system
 Restore capability  No  Yes create new edit

Target Details

V1.0 Target: Simple edit history sufficient for accountability and basic rollback.

  • Track who, what, when, why for each change
  • Store before/after values in edits table
  • Manual restore if needed (create new edit with old values)

V2.0+ Future: Add complex versioning if users request:

  • Version history browsing
  • Restore previous version
  • Diff between versions

5. Automated Systems in Detail

FactHarbor relies heavily on automation to achieve scale and quality. Here's how each automated system works:

5.1 AKEL (AI Knowledge Evaluation Layer)

What it does: Primary AI processing engine that analyzes claims automatically
Inputs:

  • User-submitted claim text
  • Existing evidence and sources
  • Source track record database
    Processing steps:
  1. Parse & Extract: Identify key components, entities, assertions
    2. Gather Evidence: Search web and database for relevant sources
    3. Check Sources: Evaluate source reliability using track records
    4. Extract Scenarios: Identify different contexts from evidence
    5. Synthesize Verdict: Compile evidence assessment per scenario
    6. Calculate Risk: Assess potential harm and controversy
    Outputs:
  • Structured claim record
  • Evidence links with relevance scores
  • Scenarios with context descriptions
  • Verdict summary per scenario
  • Overall confidence score
  • Risk assessment
    Timing: 10-18 seconds total (parallel processing)

5.2 Background Jobs

Source Track Record Updates (Weekly):

  • Analyze claim outcomes from past week
  • Calculate source accuracy and reliability
  • Update source_track_record table
  • Never triggered by individual claims (prevents circular dependencies)
    Cache Management (Continuous):
  • Warm cache for popular claims
  • Invalidate cache on claim updates
  • Monitor cache hit rates
    Metrics Aggregation (Hourly):
  • Roll up detailed metrics
  • Calculate system health indicators
  • Generate performance reports
    Data Archival (Daily):
  • Move old AKEL logs to S3 (90+ days)
  • Archive old edit history
  • Compress and backup data

5.3 Quality Monitoring

Automated checks run continuously:

  • Anomaly Detection: Flag unusual patterns
  • Sudden confidence score changes
  • Unusual evidence distributions
  • Suspicious source patterns
  • Contradiction Detection: Identify conflicts
  • Evidence that contradicts other evidence
  • Claims with internal contradictions
  • Source track record anomalies
  • Completeness Validation: Ensure thoroughness
  • Sufficient evidence gathered
  • Multiple source types represented
  • Key scenarios identified

5.4 Moderation Detection

Automated abuse detection:

  • Spam Identification: Pattern matching for spam claims
  • Manipulation Detection: Identify coordinated editing
  • Gaming Detection: Flag attempts to game source scores
  • Suspicious Activity: Log unusual behavior patterns
    Human Review: Moderators review flagged items, system learns from decisions

6. Scalability Strategy

6.1 Horizontal Scaling

Components scale independently:

  • AKEL Workers: Add more processing workers as claim volume grows
  • Database Read Replicas: Add replicas for read-heavy workloads
  • Cache Layer: Redis cluster for distributed caching
  • API Servers: Load-balanced API instances

6.2 Vertical Scaling

Individual components can be upgraded:

  • Database Server: Increase CPU/RAM for PostgreSQL
  • Cache Memory: Expand Redis memory
  • Worker Resources: More powerful AKEL worker machines

6.3 Performance Optimization

Built-in optimizations:

  • Denormalized Data: Cache summary data in claim records (70% fewer joins)
  • Parallel Processing: AKEL pipeline processes in parallel (40% faster)
  • Intelligent Caching: Redis caches frequently accessed data
  • Background Processing: Non-urgent tasks run asynchronously

7. Monitoring & Observability

7.1 Key Metrics

System tracks:

  • Performance: AKEL processing time, API response time, cache hit rate
  • Quality: Confidence score distribution, evidence completeness, contradiction rate
  • Usage: Claims per day, active users, API requests
  • Errors: Failed AKEL runs, API errors, database issues

7.2 Alerts

Automated alerts for:

  • Processing time >30 seconds (threshold breach)
  • Error rate >1% (quality issue)
  • Cache hit rate <80% (cache problem)
  • Database connections >80% capacity (scaling needed)

7.3 Dashboards

Real-time monitoring:

  • System Health: Overall status and key metrics
  • AKEL Performance: Processing time breakdown
  • Quality Metrics: Confidence scores, completeness
  • User Activity: Usage patterns, peak times

8. Security Architecture

8.1 Authentication & Authorization

  • User Authentication: Secure login with password hashing
  • Role-Based Access: Reader, Contributor, Moderator, Admin
  • API Keys: For programmatic access
  • Rate Limiting: Prevent abuse

8.2 Data Security

  • Encryption: TLS for transport, encrypted storage for sensitive data
  • Audit Logging: Track all significant changes
  • Input Validation: Sanitize all user inputs
  • SQL Injection Protection: Parameterized queries

8.3 Abuse Prevention

  • Rate Limiting: Prevent flooding and DDoS
  • Automated Detection: Flag suspicious patterns
  • Human Review: Moderators investigate flagged content
  • Ban Mechanisms: Block abusive users/IPs

9. Deployment Architecture

9.1 Production Environment

Components:

  • Load Balancer (HAProxy or cloud LB)
  • Multiple API servers (stateless)
  • AKEL worker pool (auto-scaling)
  • PostgreSQL primary + read replicas
  • Redis cluster
  • S3-compatible storage
    Regions: Single region for V1.0, multi-region when needed

9.2 Development & Staging

Development: Local Docker Compose setup
Staging: Scaled-down production replica
CI/CD: Automated testing and deployment

9.3 Disaster Recovery

  • Database Backups: Daily automated backups to S3
  • Point-in-Time Recovery: Transaction log archival
  • Replication: Real-time replication to standby
  • Recovery Time Objective: <4 hours

9.5 Federation Architecture Diagram

Warning

Not Implemented (v2.6.33) - Federation is planned for V1.0+. Current implementation is single-instance only.

Federation Architecture (Future)


graph LR
    FH1[FactHarbor Instance 1]
    FH2[FactHarbor Instance 2]
    FH3[FactHarbor Instance 3]
    FH1 -.->|V1.0+ Sync claims| FH2
    FH2 -.->|V1.0+ Sync claims| FH3
    FH3 -.->|V1.0+ Sync claims| FH1
    U1[Users] --> FH1
    U2[Users] --> FH2
    U3[Users] --> FH3

Federation Architecture - Future (V1.0+): Independent FactHarbor instances can sync claims for broader reach while maintaining local control.

Target Features

 Feature  Purpose  Status
 Claim synchronization  Share verified claims across instances  Not implemented
 Cross-node audits  Distributed quality assurance  Not implemented
 Local control  Each instance maintains autonomy  N/A
 Contradiction detection  Cross-instance contradiction checking  Not implemented

Current Implementation

  • Single-instance deployment only
  • No inter-instance communication
  • All data stored locally in SQLite

10. Future Architecture Evolution

10.1 When to Add Complexity

See When to Add Complexity for specific triggers.
Elasticsearch: When PostgreSQL search consistently >500ms
TimescaleDB: When metrics queries consistently >1s
Federation: When 10,000+ users and explicit demand
Complex Reputation: When 100+ active contributors

10.2 Federation (V2.0+)

Deferred until:

  • Core product proven with 10,000+ users
  • User demand for decentralization
  • Single-node limits reached
    See Federation & Decentralization for future plans.

11. Technology Stack Summary

Backend:

  • Python (FastAPI or Django)
  • PostgreSQL (primary database)
  • Redis (caching)
    Frontend:
  • Modern JavaScript framework (React, Vue, or Svelte)
  • Server-side rendering for SEO
    AI/LLM:
  • Multi-provider orchestration (Claude, GPT-4, local models)
  • Fallback and cross-checking support
    Infrastructure:
  • Docker containers
  • Kubernetes or cloud platform auto-scaling
  • S3-compatible object storage
    Monitoring:
  • Prometheus + Grafana
  • Structured logging (ELK or cloud logging)
  • Error tracking (Sentry)

12. Related Pages