Architecture

Version 2.1 by Robert Schaub on 2025/12/18 12:54

Architecture

FactHarbor's architecture is designed for simplicity, automation, and continuous improvement.

1. Core Principles

  • AI-First: AKEL (AI) is the primary system, humans supplement
  • Publish by Default: No centralized approval (removed in V0.9.50), publish with confidence scores
  • System Over Data: Fix algorithms, not individual outputs
  • Measure Everything: Quality metrics drive improvements
  • Scale Through Automation: Minimal human intervention
  • Start Simple: Add complexity only when metrics prove necessary

2. High-Level Architecture

Information

Current Implementation (v2.10.2) - Two-service architecture: Next.js web app for UI and analysis, .NET API for job persistence.

High-Level Architecture


graph TB
    subgraph Client[Client Layer]
        BROWSER[Web Browser]
    end

    subgraph NextJS[Next.js Web App]
        ANALYZE[analyze page]
        JOBS[jobs page]
        JOBVIEW[jobs id page]
        ANALYZE_API[api fh analyze]
        JOBS_API[api fh jobs]
        RUN_JOB[api internal run-job]
        ORCH[orchestrated.ts]
        CANON[monolithic-canonical.ts]
        SHARED[Shared Modules]
        WEBSEARCH[web-search.ts]
        SR[source-reliability.ts]
    end

    subgraph DotNet[.NET API]
        DOTNET_API[ASP.NET Core API]
        JOBS_CTRL[JobsController]
        HEALTH_CTRL[HealthController]
        SQLITE[(SQLite factharbor.db)]
    end

    subgraph External[External Services]
        LLM[LLM Providers]
        SEARCH[Search Providers]
    end

    BROWSER --> ANALYZE
    BROWSER --> JOBS
    BROWSER --> JOBVIEW
    ANALYZE --> ANALYZE_API
    ANALYZE_API --> DOTNET_API
    DOTNET_API --> SQLITE
    RUN_JOB --> ORCH
    RUN_JOB --> CANON
    ORCH --> SHARED
    CANON --> SHARED
    ORCH --> LLM
    CANON --> LLM
    ORCH --> WEBSEARCH
    WEBSEARCH --> SEARCH
    SHARED --> SR

Component Summary

 Component  Technology  Purpose
 Web App  Next.js 14+  UI, API routes, AKEL pipeline
 API  ASP.NET Core 8.0  Job persistence, health checks
 Database  SQLite (3 databases)  Jobs/events, UCM config, SR cache
 LLM  AI SDK (Vercel)  Multi-provider LLM abstraction with model tiering
 Search  Google CSE / SerpAPI  Web search for evidence

Key Files

 File  Lines  Purpose
 orchestrated.ts  13300  Main orchestrated pipeline
 monolithic-canonical.ts  1500  Monolithic canonical pipeline
 analysis-contexts.ts  600  AnalysisContext pre-detection
 aggregation.ts  400  Verdict aggregation + claim weighting
 evidence-filter.ts  300  Deterministic evidence quality filtering
 source-reliability.ts  500  LLM-based source reliability scoring

Environment Variables

 Variable  Default  Purpose
 FH_SEARCH_ENABLED  true  Enable web search
 FH_DETERMINISTIC  true  Zero temperature
 FH_API_URL  localhost:5139  .NET API endpoint

2.1 Three-Layer Architecture

FactHarbor uses a clean three-layer architecture:

Interface Layer

Handles all user and system interactions:

  • Web UI: Browse claims, view evidence, submit feedback
  • REST API: Programmatic access for integrations
  • Authentication & Authorization: User identity and permissions
  • Rate Limiting: Protect against abuse

Processing Layer

Core business logic and AI processing:

  • AKEL Pipeline: AI-driven claim analysis (parallel processing)
  • Parse and extract claim components
  • Gather evidence from multiple sources
  • Check source track records
  • Extract scenarios from evidence
  • Synthesize verdicts
  • Calculate risk scores
  • Background Jobs: Automated maintenance tasks
  • Source track record updates (weekly)
  • Cache warming and invalidation
  • Metrics aggregation
  • Data archival
  • Quality Monitoring: Automated quality checks
  • Anomaly detection
  • Contradiction detection
  • Completeness validation
  • Moderation Detection: Automated abuse detection
  • Spam identification
  • Manipulation detection
  • Flag suspicious activity

Data & Storage Layer

Persistent data storage and caching:

  • PostgreSQL: Primary database for all core data
  • Claims, evidence, sources, users
  • Scenarios, edits, audit logs
  • Built-in full-text search
  • Time-series capabilities for metrics
  • Redis: High-speed caching layer
  • Session data
  • Frequently accessed claims
  • API rate limiting
  • S3 Storage: Long-term archival
  • Old edit history (90+ days)
  • AKEL processing logs
  • Backup snapshots
    Optional future additions (add only when metrics prove necessary):
  • Elasticsearch: If PostgreSQL full-text search becomes slow
  • TimescaleDB: If metrics queries become a bottleneck

2.2 Design Philosophy

Start Simple, Evolve Based on Metrics
The architecture deliberately starts simple:

  • Single primary database (PostgreSQL handles most workloads initially)
  • Three clear layers (easy to understand and maintain)
  • Automated operations (minimal human intervention)
  • Measure before optimizing (add complexity only when proven necessary)
    See Design Decisions and When to Add Complexity for detailed rationale.

3. AKEL Architecture


See AI Knowledge Extraction Layer (AKEL) for detailed information.

4. Storage Architecture

Storage Architecture

1. Current Implementation (v2.10.2)

1.1 Three-Database Architecture


graph TB
    subgraph NextJS[Next.js Web App]
        PIPELINE[Orchestrated Pipeline]
        CONFIG_SVC[Config Storage]
        SR_SVC[SR Cache]
    end

    subgraph DotNet[.NET API]
        CONTROLLERS[Controllers]
        EF[Entity Framework]
    end

    CONFIG_DB[(config.db)]
    SR_DB[(source-reliability.db)]
    FH_DB[(factharbor.db)]

    CONFIG_SVC --> CONFIG_DB
    SR_SVC --> SR_DB
    PIPELINE -->|via API| CONTROLLERS
    CONTROLLERS --> EF
    EF --> FH_DB

 Database  Purpose  Access Layer  Key Tables
 factharbor.db  Jobs, events, analysis results  .NET API (Entity Framework)  Jobs, JobEvents, AnalysisMetrics
 config.db  UCM configuration management  Next.js (better-sqlite3)  config_blobs, config_active, config_usage, job_config_snapshots
 source-reliability.db  Source reliability cache  Next.js (better-sqlite3)  source_reliability

1.2 Current Caching Mechanisms

 What  Mechanism  TTL  Status
 Source reliability scores  SQLite + batch prefetch to in-memory Map  90 days (configurable via UCM)  IMPLEMENTED
 UCM config values  In-memory Map with TTL-based expiry  60 seconds  IMPLEMENTED
 URL content (fetched pages)  Not cached  N/A  NOT IMPLEMENTED
 Claim-level analysis results  Not cached  N/A  NOT IMPLEMENTED
 LLM responses  Not cached  N/A  NOT IMPLEMENTED

1.3 Storage Patterns

  • Analysis results: JSON blob in ResultJson column (per job), stored once by .NET API
  • Config blobs: Content-addressable with SHA-256 hash as PK, history tracked
  • Job config snapshots: Pipeline + search + SR config captured per job for auditability
  • SR cache: Per-domain reliability assessment with multi-model consensus scores

Current limitations:

  • No relational queries across claims, evidence, or sources from different analyses
  • No full-text search on analysis content
  • Single-writer limitation (SQLite) — fine for single-instance but blocks horizontal scaling
  • Every analysis re-fetches URL content and recomputes all LLM calls from scratch

2. What Is Worth Caching?

Warning

This section identifies caching opportunities. The EVALUATE items require deeper analysis during Alpha planning before committing to scope and timeline.

2.1 Caching Value Analysis

 Cacheable Item  Estimated Savings  Latency Impact  Complexity  Recommendation
 Claim-level results  30-50% LLM cost on duplicate claims  None (cache lookup)  MEDIUM — needs canonical claim hash + TTL + prompt-version awareness  EVALUATE in Alpha
 URL content  $0 API cost but 5-15s latency per source  Major — eliminates re-fetch  LOW — URL hash + content + timestamp  EVALUATE in Alpha
 LLM responses  Highest per-call savings  None  HIGH — prompt hash + input hash, invalidation on prompt change  DEFER — claim-level caching captures most benefit
 Search query results  Marginal — search APIs are cheap  Minor  MEDIUM — results go stale quickly  NOT RECOMMENDED — volatile, low ROI

2.2 Cost Impact Modeling

Assuming $0.10-$2.00 per analysis (depending on article complexity and model tier):

 Usage Level  Current Cost/day  With Claim Cache (-35%)  With URL Cache
 10 analyses/day  $1-20  $0.65-13  Same cost, 30-60s faster
 100 analyses/day  $10-200  $6.50-130  Same cost, 5-15 min faster
 1000 analyses/day  $100-2,000  $65-1,300  Same cost, 50-150 min faster

Key insight: Claim caching saves money; URL caching saves time. Both follow the existing SQLite + in-memory Map pattern from source reliability.

3. Redis: Do We Still Need It?

3.1 Current Reality Assessment

 Original Redis Use Case  Current Solution  Gap?
 Hot data caching  In-memory Map (config), SQLite (SR)  No gap at current scale
 Session management  No user auth = no sessions  Not needed until Beta
 Rate limiting  Not implemented  Can be in-process for single-instance
 Pub/sub for real-time  SSE events work without Redis  No gap for single-instance

3.2 When Redis Becomes Necessary

Redis adds value when:

  • Multiple application instances need shared cache/state (horizontal scaling)
  • Sub-millisecond cache lookups required (SQLite is 1-5ms, sufficient for current needs)
  • Distributed rate limiting needed across multiple servers

Trigger criteria (following When-to-Add-Complexity philosophy):

  • Single-instance SQLite cache latency >100ms
  • Need for >1 application instance
  • Rate limiting required across instances
Information

Decision: DEFER Redis. Not needed for current or near-term development. SQLite + in-memory Map handles all current caching needs.

4. PostgreSQL: When and Why?

4.1 Current SQLite Limitations

 Limitation  Impact  When It Hurts
 JSON blob storage (no relational queries)  Cannot query across analyses  When browse/search is needed
 Single-writer  No concurrent writes  When horizontal scaling is needed
 No complex aggregation  Cannot run cross-analysis analytics  When quality dashboards need SQL
 No full-text search  Cannot search claim text or evidence  When browse/search is needed

4.2 What PostgreSQL Enables

  • Browse/search claims across all analyses
  • Quality metrics dashboards with SQL aggregation
  • Evidence deduplication (FR54) with relational queries
  • User accounts and permissions (Beta requirement)
  • Multi-instance deployments

4.3 Migration Path

The .NET API already has PostgreSQL support configured (appsettings.json). Switching is a configuration change, not a code rewrite.

Note: Keep SQLite for config.db (portable) and source-reliability.db (standalone). Only factharbor.db needs PostgreSQL.

Information

Decision: EVALUATE for Alpha/Beta. Add PostgreSQL when user accounts + search + evidence dedup needed. Requires deeper analysis during Alpha planning.

5. Vector Database Assessment

Information

Full assessment: Docs/WIP/Vector_DB_Assessment.md (February 2, 2026)

Conclusion: Vector search is not required for core functionality. Vectors add value only for approximate similarity (near-duplicate claim detection, edge case clustering) and should remain optional and offline to preserve pipeline performance and determinism.

When to add: Only after Shadow Mode data collection proves that near-duplicate detection needs exceed text-hash capability. Start with lightweight normalization + n-gram overlap (no vector DB needed).

Information

Decision: DEFER. Re-evaluate after Shadow Mode data collection.

6. Revised Storage Roadmap

Previous Roadmap (Superseded)

Phase 1: Add Redis for caching
Phase 2: Migrate to PostgreSQL for normalized data
Phase 3: Add S3 for archives and backups

Current Roadmap


graph LR
    subgraph Phase1[Phase 1: Alpha]
        P1A[Expand SQLite caching]
        P1B[Keep 3-DB architecture]
    end

    subgraph Phase2[Phase 2: Beta]
        P2A[Add PostgreSQL for factharbor.db]
        P2B[Add normalized claim/evidence tables]
        P2C[Keep SQLite for config + SR]
    end

    subgraph Phase3[Phase 3: V1.0]
        P3A[Add Redis IF multi-instance needed]
        P3B[PostgreSQL primary for production]
    end

    subgraph Phase4[Phase 4: V1.0+]
        P4A[Vector DB IF Shadow Mode proves value]
        P4B[S3 IF storage exceeds 50GB]
    end

    Phase1 --> Phase2
    Phase2 --> Phase3
    Phase3 --> Phase4

Phase 1 (Alpha): Evaluate and potentially add URL content cache + claim-level cache in SQLite. Keep 3-DB architecture and in-memory Map caches.

Phase 2 (Beta): Add PostgreSQL for factharbor.db (user data, normalized claims, search). Keep SQLite for config.db (portable) and source-reliability.db (standalone).

Phase 3 (V1.0): Add Redis ONLY IF multi-instance deployment required. PostgreSQL becomes primary for all production data.

Phase 4 (V1.0+): Add vector DB ONLY IF Shadow Mode data proves value. Add S3 ONLY IF storage exceeds 50GB.

7. Decision Summary

 Technology  Decision  When  Status
 SQLite URL cache  EVALUATE  Alpha planning  Needs further analysis
 SQLite claim cache  EVALUATE  Alpha planning  Needs further analysis
 Redis  DEFER  Multi-instance  Agreed
 PostgreSQL  EVALUATE  Alpha/Beta  Needs further analysis
 Vector DB  DEFER  Post-Shadow Mode  Agreed
 S3  DEFER  V1.0+  Agreed
Information

DEFER items are agreed. EVALUATE items (URL cache, claim cache, PostgreSQL) require deeper analysis during Alpha release planning — scope, dependencies, and prioritization to be determined as part of Alpha milestones.

Related Pages

Document Status: PARTIALLY APPROVED (February 2026) — DEFER decisions agreed; EVALUATE items need Alpha-phase analysis


See Storage Strategy for detailed information.

4.5 Versioning Architecture

UCM Configuration Versioning Architecture


graph LR
    ADMIN[UCM Administrator] -->|creates| BLOB[Config Blob - immutable]
    BLOB -->|content-addressed| STORE[(config_blobs)]
    ADMIN -->|activates| ACTIVE[config_active]
    ACTIVE -->|points to| BLOB
    JOB[Analysis Job] -->|snapshots at start| USAGE[config_usage]
    USAGE -->|references| BLOB
    REPORT[Analysis Report] -->|cites| USAGE

How UCM Config Versioning Works

 Concept  Description
 config_blobs  Immutable, content-addressed config versions. Each change creates a new blob; old blobs are never deleted.
 config_active  Pointer to the currently active config blob per config type. Changing this activates a new config version.
 config_usage  Links each analysis job to the exact config snapshot used. Enables reproducibility.
 Immutability  Analysis outputs are never edited. To improve results, update UCM config and re-analyse.

Current Implementation (v2.10.2)

 Feature  Status
 UCM config storage  Implemented (config.db SQLite)
 Config hot-reload  Implemented (60s TTL)
 Per-job config snapshots  Implemented (job_config_snapshots)
 Content-addressed blobs  Implemented (hash-based deduplication)
 Config activation tracking  Implemented (config_active table)
 Admin UI for config management  Not yet implemented (CLI/direct DB)

Design Principles

  • Every config change creates a new immutable blob — no in-place mutation
  • Every analysis job records the config snapshot used at time of execution
  • Reports can be reproduced by re-running with the same config snapshot
  • Config history is the audit trail — who changed what, when, and why
  • Analysis data is never edited — "improve the system, not the data"

5. Automated Systems in Detail

FactHarbor relies heavily on automation to achieve scale and quality. Here's how each automated system works:

5.1 AKEL (AI Knowledge Evaluation Layer)

What it does: Primary AI processing engine that analyzes claims automatically
Inputs:

  • User-submitted claim text
  • Existing evidence and sources
  • Source track record database
    Processing steps:
  1. Parse & Extract: Identify key components, entities, assertions
    2. Gather Evidence: Search web and database for relevant sources
    3. Check Sources: Evaluate source reliability using track records
    4. Extract Scenarios: Identify different contexts from evidence
    5. Synthesize Verdict: Compile evidence assessment per scenario
    6. Calculate Risk: Assess potential harm and controversy
    Outputs:
  • Structured claim record
  • Evidence links with relevance scores
  • Scenarios with context descriptions
  • Verdict summary per scenario
  • Overall confidence score
  • Risk assessment
    Timing: 10-18 seconds total (parallel processing)

5.2 Background Jobs

Source Track Record Updates (Weekly):

  • Analyze claim outcomes from past week
  • Calculate source accuracy and reliability
  • Update source_track_record table
  • Never triggered by individual claims (prevents circular dependencies)
    Cache Management (Continuous):
  • Warm cache for popular claims
  • Invalidate cache on claim updates
  • Monitor cache hit rates
    Metrics Aggregation (Hourly):
  • Roll up detailed metrics
  • Calculate system health indicators
  • Generate performance reports
    Data Archival (Daily):
  • Move old AKEL logs to S3 (90+ days)
  • Archive old edit history
  • Compress and backup data

5.3 Quality Monitoring

Automated checks run continuously:

  • Anomaly Detection: Flag unusual patterns
  • Sudden confidence score changes
  • Unusual evidence distributions
  • Suspicious source patterns
  • Contradiction Detection: Identify conflicts
  • Evidence that contradicts other evidence
  • Claims with internal contradictions
  • Source track record anomalies
  • Completeness Validation: Ensure thoroughness
  • Sufficient evidence gathered
  • Multiple source types represented
  • Key scenarios identified

5.4 Moderation Detection

Automated abuse detection:

  • Spam Identification: Pattern matching for spam claims
  • Manipulation Detection: Identify coordinated editing
  • Gaming Detection: Flag attempts to game source scores
  • Suspicious Activity: Log unusual behavior patterns
    Human Review: Moderators review flagged items, system learns from decisions

6. Scalability Strategy

6.1 Horizontal Scaling

Components scale independently:

  • AKEL Workers: Add more processing workers as claim volume grows
  • Database Read Replicas: Add replicas for read-heavy workloads
  • Cache Layer: Redis cluster for distributed caching
  • API Servers: Load-balanced API instances

6.2 Vertical Scaling

Individual components can be upgraded:

  • Database Server: Increase CPU/RAM for PostgreSQL
  • Cache Memory: Expand Redis memory
  • Worker Resources: More powerful AKEL worker machines

6.3 Performance Optimization

Built-in optimizations:

  • Denormalized Data: Cache summary data in claim records (70% fewer joins)
  • Parallel Processing: AKEL pipeline processes in parallel (40% faster)
  • Intelligent Caching: Redis caches frequently accessed data
  • Background Processing: Non-urgent tasks run asynchronously

7. Monitoring & Observability

7.1 Key Metrics

System tracks:

  • Performance: AKEL processing time, API response time, cache hit rate
  • Quality: Confidence score distribution, evidence completeness, contradiction rate
  • Usage: Claims per day, active users, API requests
  • Errors: Failed AKEL runs, API errors, database issues

7.2 Alerts

Automated alerts for:

  • Processing time >30 seconds (threshold breach)
  • Error rate >1% (quality issue)
  • Cache hit rate <80% (cache problem)
  • Database connections >80% capacity (scaling needed)

7.3 Dashboards

Real-time monitoring:

  • System Health: Overall status and key metrics
  • AKEL Performance: Processing time breakdown
  • Quality Metrics: Confidence scores, completeness
  • User Activity: Usage patterns, peak times

8. Security Architecture

8.1 Authentication & Authorization

  • User Authentication: Secure login with password hashing
  • Role-Based Access: Reader, Contributor, Moderator, Admin
  • API Keys: For programmatic access
  • Rate Limiting: Prevent abuse

8.2 Data Security

  • Encryption: TLS for transport, encrypted storage for sensitive data
  • Audit Logging: Track all significant changes
  • Input Validation: Sanitize all user inputs
  • SQL Injection Protection: Parameterized queries

8.3 Abuse Prevention

  • Rate Limiting: Prevent flooding and DDoS
  • Automated Detection: Flag suspicious patterns
  • Human Review: Moderators investigate flagged content
  • Ban Mechanisms: Block abusive users/IPs

9. Deployment Architecture

9.1 Production Environment

Components:

  • Load Balancer (HAProxy or cloud LB)
  • Multiple API servers (stateless)
  • AKEL worker pool (auto-scaling)
  • PostgreSQL primary + read replicas
  • Redis cluster
  • S3-compatible storage
    Regions: Single region for V1.0, multi-region when needed

9.2 Development & Staging

Development: Local Docker Compose setup
Staging: Scaled-down production replica
CI/CD: Automated testing and deployment

9.3 Disaster Recovery

  • Database Backups: Daily automated backups to S3
  • Point-in-Time Recovery: Transaction log archival
  • Replication: Real-time replication to standby
  • Recovery Time Objective: <4 hours

9.5 Federation Architecture Diagram

Warning

Not Implemented (v2.10.2) — Federation is planned for V2.0+. Current implementation is single-instance only.

Federation Architecture (Future)


graph LR
    FH1[FactHarbor Instance 1]
    FH2[FactHarbor Instance 2]
    FH3[FactHarbor Instance 3]
    FH1 -.->|V1.0+ Sync claims| FH2
    FH2 -.->|V1.0+ Sync claims| FH3
    FH3 -.->|V1.0+ Sync claims| FH1
    U1[Users] --> FH1
    U2[Users] --> FH2
    U3[Users] --> FH3

Federation Architecture - Future (V1.0+): Independent FactHarbor instances can sync claims for broader reach while maintaining local control.

Target Features

 Feature  Purpose  Status
 Claim synchronization  Share verified claims across instances  Not implemented
 Cross-node audits  Distributed quality assurance  Not implemented
 Local control  Each instance maintains autonomy  N/A
 Contradiction detection  Cross-instance contradiction checking  Not implemented

Current Implementation

  • Single-instance deployment only
  • No inter-instance communication
  • All data stored locally in SQLite

10. Future Architecture Evolution

10.1 When to Add Complexity

See When to Add Complexity for specific triggers.
Elasticsearch: When PostgreSQL search consistently >500ms
TimescaleDB: When metrics queries consistently >1s  
Federation: When 10,000+ users and explicit demand
Complex Reputation: When 100+ active contributors

10.2 Federation (V2.0+)

Deferred until:

  • Core product proven with 10,000+ users
  • User demand for decentralization
  • Single-node limits reached
    See Federation & Decentralization for future plans.

11. Technology Stack Summary

Backend:

  • Python (FastAPI or Django)
  • PostgreSQL (primary database)
  • Redis (caching)
    Frontend:
  • Modern JavaScript framework (React, Vue, or Svelte)
  • Server-side rendering for SEO
    AI/LLM:
  • Multi-provider orchestration (Claude, GPT-4, local models)
  • Fallback and cross-checking support
    Infrastructure:
  • Docker containers
  • Kubernetes or cloud platform auto-scaling
  • S3-compatible object storage
    Monitoring:
  • Prometheus + Grafana
  • Structured logging (ELK or cloud logging)
  • Error tracking (Sentry)

12. Related Pages