Architecture

Version 2.1 by Robert Schaub on 2025/12/18 12:54

Architecture

FactHarbor's architecture is designed for simplicity, automation, and continuous improvement.

1. Core Principles

AI-First: AKEL (AI) is the primary system, humans supplement
Publish by Default: No centralized approval (removed in V0.9.50), publish with confidence scores
System Over Data: Fix algorithms, not individual outputs
Measure Everything: Quality metrics drive improvements
Scale Through Automation: Minimal human intervention
Start Simple: Add complexity only when metrics prove necessary

2. High-Level Architecture

Current Implementation (v2.10.2) - Two-service architecture: Next.js web app for UI and analysis, .NET API for job persistence.

High-Level Architecture


graph TB
    subgraph Client[Client Layer]
        BROWSER[Web Browser]
    end

    subgraph NextJS[Next.js Web App]
        ANALYZE[analyze page]
        JOBS[jobs page]
        JOBVIEW[jobs id page]
        ANALYZE_API[api fh analyze]
        JOBS_API[api fh jobs]
        RUN_JOB[api internal run-job]
        ORCH[orchestrated.ts]
        CANON[monolithic-canonical.ts]
        SHARED[Shared Modules]
        WEBSEARCH[web-search.ts]
        SR[source-reliability.ts]
    end

    subgraph DotNet[.NET API]
        DOTNET_API[ASP.NET Core API]
        JOBS_CTRL[JobsController]
        HEALTH_CTRL[HealthController]
        SQLITE[(SQLite factharbor.db)]
    end

    subgraph External[External Services]
        LLM[LLM Providers]
        SEARCH[Search Providers]
    end

    BROWSER --> ANALYZE
    BROWSER --> JOBS
    BROWSER --> JOBVIEW
    ANALYZE --> ANALYZE_API
    ANALYZE_API --> DOTNET_API
    DOTNET_API --> SQLITE
    RUN_JOB --> ORCH
    RUN_JOB --> CANON
    ORCH --> SHARED
    CANON --> SHARED
    ORCH --> LLM
    CANON --> LLM
    ORCH --> WEBSEARCH
    WEBSEARCH --> SEARCH
    SHARED --> SR

Component Summary

Component	Technology	Purpose
Web App	Next.js 14+	UI, API routes, AKEL pipeline
API	ASP.NET Core 8.0	Job persistence, health checks
Database	SQLite (3 databases)	Jobs/events, UCM config, SR cache
LLM	AI SDK (Vercel)	Multi-provider LLM abstraction with model tiering
Search	Google CSE / SerpAPI	Web search for evidence

Key Files

File	Lines	Purpose
orchestrated.ts	13300	Main orchestrated pipeline
monolithic-canonical.ts	1500	Monolithic canonical pipeline
analysis-contexts.ts	600	AnalysisContext pre-detection
aggregation.ts	400	Verdict aggregation + claim weighting
evidence-filter.ts	300	Deterministic evidence quality filtering
source-reliability.ts	500	LLM-based source reliability scoring

Environment Variables

Variable	Default	Purpose
FH_SEARCH_ENABLED	true	Enable web search
FH_DETERMINISTIC	true	Zero temperature
FH_API_URL	localhost:5139	.NET API endpoint

2.1 Three-Layer Architecture

FactHarbor uses a clean three-layer architecture:

Interface Layer

Handles all user and system interactions:

Web UI: Browse claims, view evidence, submit feedback
REST API: Programmatic access for integrations
Authentication & Authorization: User identity and permissions
Rate Limiting: Protect against abuse

Processing Layer

Core business logic and AI processing:

AKEL Pipeline: AI-driven claim analysis (parallel processing)
Parse and extract claim components
Gather evidence from multiple sources
Check source track records
Extract scenarios from evidence
Synthesize verdicts
Calculate risk scores
Background Jobs: Automated maintenance tasks
Source track record updates (weekly)
Cache warming and invalidation
Metrics aggregation
Data archival
Quality Monitoring: Automated quality checks
Anomaly detection
Contradiction detection
Completeness validation
Moderation Detection: Automated abuse detection
Spam identification
Manipulation detection
Flag suspicious activity

Data & Storage Layer

Persistent data storage and caching:

PostgreSQL: Primary database for all core data
Claims, evidence, sources, users
Scenarios, edits, audit logs
Built-in full-text search
Time-series capabilities for metrics
Redis: High-speed caching layer
Session data
Frequently accessed claims
API rate limiting
S3 Storage: Long-term archival
Old edit history (90+ days)
AKEL processing logs
Backup snapshots
Optional future additions (add only when metrics prove necessary):
Elasticsearch: If PostgreSQL full-text search becomes slow
TimescaleDB: If metrics queries become a bottleneck

2.2 Design Philosophy

Start Simple, Evolve Based on Metrics
The architecture deliberately starts simple:

Single primary database (PostgreSQL handles most workloads initially)
Three clear layers (easy to understand and maintain)
Automated operations (minimal human intervention)
Measure before optimizing (add complexity only when proven necessary)
See Design Decisions and When to Add Complexity for detailed rationale.

3. AKEL Architecture

See AI Knowledge Extraction Layer (AKEL) for detailed information.

4. Storage Architecture

Storage Architecture

1. Current Implementation (v2.10.2)

1.1 Three-Database Architecture


graph TB
    subgraph NextJS[Next.js Web App]
        PIPELINE[Orchestrated Pipeline]
        CONFIG_SVC[Config Storage]
        SR_SVC[SR Cache]
    end

    subgraph DotNet[.NET API]
        CONTROLLERS[Controllers]
        EF[Entity Framework]
    end

    CONFIG_DB[(config.db)]
    SR_DB[(source-reliability.db)]
    FH_DB[(factharbor.db)]

    CONFIG_SVC --> CONFIG_DB
    SR_SVC --> SR_DB
    PIPELINE -->|via API| CONTROLLERS
    CONTROLLERS --> EF
    EF --> FH_DB

Database	Purpose	Access Layer	Key Tables
factharbor.db	Jobs, events, analysis results	.NET API (Entity Framework)	Jobs, JobEvents, AnalysisMetrics
config.db	UCM configuration management	Next.js (better-sqlite3)	config_blobs, config_active, config_usage, job_config_snapshots
source-reliability.db	Source reliability cache	Next.js (better-sqlite3)	source_reliability

1.2 Current Caching Mechanisms

What	Mechanism	TTL	Status
Source reliability scores	SQLite + batch prefetch to in-memory Map	90 days (configurable via UCM)	IMPLEMENTED
UCM config values	In-memory Map with TTL-based expiry	60 seconds	IMPLEMENTED
URL content (fetched pages)	Not cached	N/A	NOT IMPLEMENTED
Claim-level analysis results	Not cached	N/A	NOT IMPLEMENTED
LLM responses	Not cached	N/A	NOT IMPLEMENTED

1.3 Storage Patterns

Analysis results: JSON blob in ResultJson column (per job), stored once by .NET API
Config blobs: Content-addressable with SHA-256 hash as PK, history tracked
Job config snapshots: Pipeline + search + SR config captured per job for auditability
SR cache: Per-domain reliability assessment with multi-model consensus scores

Current limitations:

No relational queries across claims, evidence, or sources from different analyses
No full-text search on analysis content
Single-writer limitation (SQLite) — fine for single-instance but blocks horizontal scaling
Every analysis re-fetches URL content and recomputes all LLM calls from scratch

2. What Is Worth Caching?

This section identifies caching opportunities. The EVALUATE items require deeper analysis during Alpha planning before committing to scope and timeline.

2.1 Caching Value Analysis

Cacheable Item	Estimated Savings	Latency Impact	Complexity	Recommendation
Claim-level results	30-50% LLM cost on duplicate claims	None (cache lookup)	MEDIUM — needs canonical claim hash + TTL + prompt-version awareness	EVALUATE in Alpha
URL content	$0 API cost but 5-15s latency per source	Major — eliminates re-fetch	LOW — URL hash + content + timestamp	EVALUATE in Alpha
LLM responses	Highest per-call savings	None	HIGH — prompt hash + input hash, invalidation on prompt change	DEFER — claim-level caching captures most benefit
Search query results	Marginal — search APIs are cheap	Minor	MEDIUM — results go stale quickly	NOT RECOMMENDED — volatile, low ROI

2.2 Cost Impact Modeling

Assuming $0.10-$2.00 per analysis (depending on article complexity and model tier):

Usage Level	Current Cost/day	With Claim Cache (-35%)	With URL Cache
10 analyses/day	$1-20	$0.65-13	Same cost, 30-60s faster
100 analyses/day	$10-200	$6.50-130	Same cost, 5-15 min faster
1000 analyses/day	$100-2,000	$65-1,300	Same cost, 50-150 min faster

Key insight: Claim caching saves money; URL caching saves time. Both follow the existing SQLite + in-memory Map pattern from source reliability.

3. Redis: Do We Still Need It?

3.1 Current Reality Assessment

Original Redis Use Case	Current Solution	Gap?
Hot data caching	In-memory Map (config), SQLite (SR)	No gap at current scale
Session management	No user auth = no sessions	Not needed until Beta
Rate limiting	Not implemented	Can be in-process for single-instance
Pub/sub for real-time	SSE events work without Redis	No gap for single-instance

3.2 When Redis Becomes Necessary

Redis adds value when:

Multiple application instances need shared cache/state (horizontal scaling)
Sub-millisecond cache lookups required (SQLite is 1-5ms, sufficient for current needs)
Distributed rate limiting needed across multiple servers

Trigger criteria (following When-to-Add-Complexity philosophy):

Single-instance SQLite cache latency >100ms
Need for >1 application instance
Rate limiting required across instances

Decision: DEFER Redis. Not needed for current or near-term development. SQLite + in-memory Map handles all current caching needs.

4. PostgreSQL: When and Why?

4.1 Current SQLite Limitations

Limitation	Impact	When It Hurts
JSON blob storage (no relational queries)	Cannot query across analyses	When browse/search is needed
Single-writer	No concurrent writes	When horizontal scaling is needed
No complex aggregation	Cannot run cross-analysis analytics	When quality dashboards need SQL
No full-text search	Cannot search claim text or evidence	When browse/search is needed

4.2 What PostgreSQL Enables

Browse/search claims across all analyses
Quality metrics dashboards with SQL aggregation
Evidence deduplication (FR54) with relational queries
User accounts and permissions (Beta requirement)
Multi-instance deployments

4.3 Migration Path

The .NET API already has PostgreSQL support configured (appsettings.json). Switching is a configuration change, not a code rewrite.

Note: Keep SQLite for config.db (portable) and source-reliability.db (standalone). Only factharbor.db needs PostgreSQL.

Decision: EVALUATE for Alpha/Beta. Add PostgreSQL when user accounts + search + evidence dedup needed. Requires deeper analysis during Alpha planning.

5. Vector Database Assessment

Full assessment: Docs/WIP/Vector_DB_Assessment.md (February 2, 2026)

Conclusion: Vector search is not required for core functionality. Vectors add value only for approximate similarity (near-duplicate claim detection, edge case clustering) and should remain optional and offline to preserve pipeline performance and determinism.

When to add: Only after Shadow Mode data collection proves that near-duplicate detection needs exceed text-hash capability. Start with lightweight normalization + n-gram overlap (no vector DB needed).

Decision: DEFER. Re-evaluate after Shadow Mode data collection.

6. Revised Storage Roadmap

Previous Roadmap (Superseded)

Phase 1: Add Redis for caching
Phase 2: Migrate to PostgreSQL for normalized data
Phase 3: Add S3 for archives and backups

Current Roadmap


graph LR
    subgraph Phase1[Phase 1: Alpha]
        P1A[Expand SQLite caching]
        P1B[Keep 3-DB architecture]
    end

    subgraph Phase2[Phase 2: Beta]
        P2A[Add PostgreSQL for factharbor.db]
        P2B[Add normalized claim/evidence tables]
        P2C[Keep SQLite for config + SR]
    end

    subgraph Phase3[Phase 3: V1.0]
        P3A[Add Redis IF multi-instance needed]
        P3B[PostgreSQL primary for production]
    end

    subgraph Phase4[Phase 4: V1.0+]
        P4A[Vector DB IF Shadow Mode proves value]
        P4B[S3 IF storage exceeds 50GB]
    end

    Phase1 --> Phase2
    Phase2 --> Phase3
    Phase3 --> Phase4

Phase 1 (Alpha): Evaluate and potentially add URL content cache + claim-level cache in SQLite. Keep 3-DB architecture and in-memory Map caches.

Phase 2 (Beta): Add PostgreSQL for factharbor.db (user data, normalized claims, search). Keep SQLite for config.db (portable) and source-reliability.db (standalone).

Phase 3 (V1.0): Add Redis ONLY IF multi-instance deployment required. PostgreSQL becomes primary for all production data.

Phase 4 (V1.0+): Add vector DB ONLY IF Shadow Mode data proves value. Add S3 ONLY IF storage exceeds 50GB.

7. Decision Summary

Technology	Decision	When	Status
SQLite URL cache	EVALUATE	Alpha planning	Needs further analysis
SQLite claim cache	EVALUATE	Alpha planning	Needs further analysis
Redis	DEFER	Multi-instance	Agreed
PostgreSQL	EVALUATE	Alpha/Beta	Needs further analysis
Vector DB	DEFER	Post-Shadow Mode	Agreed
S3	DEFER	V1.0+	Agreed

DEFER items are agreed. EVALUATE items (URL cache, claim cache, PostgreSQL) require deeper analysis during Alpha release planning — scope, dependencies, and prioritization to be determined as part of Alpha milestones.

POC to Alpha Transition — Phase redefinition (caching is Alpha milestone)
When to Add Complexity — Decision philosophy
Architecture — System architecture
Data Model — Database schema

Document Status: PARTIALLY APPROVED (February 2026) — DEFER decisions agreed; EVALUATE items need Alpha-phase analysis

See Storage Strategy for detailed information.

4.5 Versioning Architecture

UCM Configuration Versioning Architecture


graph LR
    ADMIN[UCM Administrator] -->|creates| BLOB[Config Blob - immutable]
    BLOB -->|content-addressed| STORE[(config_blobs)]
    ADMIN -->|activates| ACTIVE[config_active]
    ACTIVE -->|points to| BLOB
    JOB[Analysis Job] -->|snapshots at start| USAGE[config_usage]
    USAGE -->|references| BLOB
    REPORT[Analysis Report] -->|cites| USAGE

How UCM Config Versioning Works

Concept	Description
config_blobs	Immutable, content-addressed config versions. Each change creates a new blob; old blobs are never deleted.
config_active	Pointer to the currently active config blob per config type. Changing this activates a new config version.
config_usage	Links each analysis job to the exact config snapshot used. Enables reproducibility.
Immutability	Analysis outputs are never edited. To improve results, update UCM config and re-analyse.

Current Implementation (v2.10.2)

Feature	Status
UCM config storage	Implemented (config.db SQLite)
Config hot-reload	Implemented (60s TTL)
Per-job config snapshots	Implemented (job_config_snapshots)
Content-addressed blobs	Implemented (hash-based deduplication)
Config activation tracking	Implemented (config_active table)
Admin UI for config management	Not yet implemented (CLI/direct DB)

Design Principles

Every config change creates a new immutable blob — no in-place mutation
Every analysis job records the config snapshot used at time of execution
Reports can be reproduced by re-running with the same config snapshot
Config history is the audit trail — who changed what, when, and why
Analysis data is never edited — "improve the system, not the data"

5. Automated Systems in Detail

FactHarbor relies heavily on automation to achieve scale and quality. Here's how each automated system works:

5.1 AKEL (AI Knowledge Evaluation Layer)

What it does: Primary AI processing engine that analyzes claims automatically
Inputs:

User-submitted claim text
Existing evidence and sources
Source track record database
Processing steps:

Parse & Extract: Identify key components, entities, assertions
2. Gather Evidence: Search web and database for relevant sources
3. Check Sources: Evaluate source reliability using track records
4. Extract Scenarios: Identify different contexts from evidence
5. Synthesize Verdict: Compile evidence assessment per scenario
6. Calculate Risk: Assess potential harm and controversy
Outputs:

Structured claim record
Evidence links with relevance scores
Scenarios with context descriptions
Verdict summary per scenario
Overall confidence score
Risk assessment
Timing: 10-18 seconds total (parallel processing)

5.2 Background Jobs

Source Track Record Updates (Weekly):

Analyze claim outcomes from past week
Calculate source accuracy and reliability
Update source_track_record table
Never triggered by individual claims (prevents circular dependencies)
Cache Management (Continuous):
Warm cache for popular claims
Invalidate cache on claim updates
Monitor cache hit rates
Metrics Aggregation (Hourly):
Roll up detailed metrics
Calculate system health indicators
Generate performance reports
Data Archival (Daily):
Move old AKEL logs to S3 (90+ days)
Archive old edit history
Compress and backup data

5.3 Quality Monitoring

Automated checks run continuously:

Anomaly Detection: Flag unusual patterns
Sudden confidence score changes
Unusual evidence distributions
Suspicious source patterns
Contradiction Detection: Identify conflicts
Evidence that contradicts other evidence
Claims with internal contradictions
Source track record anomalies
Completeness Validation: Ensure thoroughness
Sufficient evidence gathered
Multiple source types represented
Key scenarios identified

5.4 Moderation Detection

Automated abuse detection:

Spam Identification: Pattern matching for spam claims
Manipulation Detection: Identify coordinated editing
Gaming Detection: Flag attempts to game source scores
Suspicious Activity: Log unusual behavior patterns
Human Review: Moderators review flagged items, system learns from decisions

6. Scalability Strategy

6.1 Horizontal Scaling

Components scale independently:

AKEL Workers: Add more processing workers as claim volume grows
Database Read Replicas: Add replicas for read-heavy workloads
Cache Layer: Redis cluster for distributed caching
API Servers: Load-balanced API instances

6.2 Vertical Scaling

Individual components can be upgraded:

Database Server: Increase CPU/RAM for PostgreSQL
Cache Memory: Expand Redis memory
Worker Resources: More powerful AKEL worker machines

6.3 Performance Optimization

Built-in optimizations:

Denormalized Data: Cache summary data in claim records (70% fewer joins)
Parallel Processing: AKEL pipeline processes in parallel (40% faster)
Intelligent Caching: Redis caches frequently accessed data
Background Processing: Non-urgent tasks run asynchronously

7. Monitoring & Observability

7.1 Key Metrics

System tracks:

Performance: AKEL processing time, API response time, cache hit rate
Quality: Confidence score distribution, evidence completeness, contradiction rate
Usage: Claims per day, active users, API requests
Errors: Failed AKEL runs, API errors, database issues

7.2 Alerts

Automated alerts for:

Processing time >30 seconds (threshold breach)
Error rate >1% (quality issue)
Cache hit rate <80% (cache problem)
Database connections >80% capacity (scaling needed)

7.3 Dashboards

Real-time monitoring:

System Health: Overall status and key metrics
AKEL Performance: Processing time breakdown
Quality Metrics: Confidence scores, completeness
User Activity: Usage patterns, peak times

8. Security Architecture

8.1 Authentication & Authorization

User Authentication: Secure login with password hashing
Role-Based Access: Reader, Contributor, Moderator, Admin
API Keys: For programmatic access
Rate Limiting: Prevent abuse

8.2 Data Security

Encryption: TLS for transport, encrypted storage for sensitive data
Audit Logging: Track all significant changes
Input Validation: Sanitize all user inputs
SQL Injection Protection: Parameterized queries

8.3 Abuse Prevention

Rate Limiting: Prevent flooding and DDoS
Automated Detection: Flag suspicious patterns
Human Review: Moderators investigate flagged content
Ban Mechanisms: Block abusive users/IPs

9. Deployment Architecture

9.1 Production Environment

Components:

Load Balancer (HAProxy or cloud LB)
Multiple API servers (stateless)
AKEL worker pool (auto-scaling)
PostgreSQL primary + read replicas
Redis cluster
S3-compatible storage
Regions: Single region for V1.0, multi-region when needed

9.2 Development & Staging

Development: Local Docker Compose setup
Staging: Scaled-down production replica
CI/CD: Automated testing and deployment

9.3 Disaster Recovery

Database Backups: Daily automated backups to S3
Point-in-Time Recovery: Transaction log archival
Replication: Real-time replication to standby
Recovery Time Objective: <4 hours

9.5 Federation Architecture Diagram

Not Implemented (v2.10.2) — Federation is planned for V2.0+. Current implementation is single-instance only.

Federation Architecture (Future)


graph LR
    FH1[FactHarbor Instance 1]
    FH2[FactHarbor Instance 2]
    FH3[FactHarbor Instance 3]
    FH1 -.->|V1.0+ Sync claims| FH2
    FH2 -.->|V1.0+ Sync claims| FH3
    FH3 -.->|V1.0+ Sync claims| FH1
    U1[Users] --> FH1
    U2[Users] --> FH2
    U3[Users] --> FH3

Federation Architecture - Future (V1.0+): Independent FactHarbor instances can sync claims for broader reach while maintaining local control.

Target Features

Feature	Purpose	Status
Claim synchronization	Share verified claims across instances	Not implemented
Cross-node audits	Distributed quality assurance	Not implemented
Local control	Each instance maintains autonomy	N/A
Contradiction detection	Cross-instance contradiction checking	Not implemented

Current Implementation

Single-instance deployment only
No inter-instance communication
All data stored locally in SQLite

10. Future Architecture Evolution

10.1 When to Add Complexity

See When to Add Complexity for specific triggers.
Elasticsearch: When PostgreSQL search consistently >500ms
TimescaleDB: When metrics queries consistently >1s
Federation: When 10,000+ users and explicit demand
Complex Reputation: When 100+ active contributors

10.2 Federation (V2.0+)

Deferred until:

Core product proven with 10,000+ users
User demand for decentralization
Single-node limits reached
See Federation & Decentralization for future plans.

11. Technology Stack Summary

Backend:

Python (FastAPI or Django)
PostgreSQL (primary database)
Redis (caching)
Frontend:
Modern JavaScript framework (React, Vue, or Svelte)
Server-side rendering for SEO
AI/LLM:
Multi-provider orchestration (Claude, GPT-4, local models)
Fallback and cross-checking support
Infrastructure:
Docker containers
Kubernetes or cloud platform auto-scaling
S3-compatible object storage
Monitoring:
Prometheus + Grafana
Structured logging (ELK or cloud logging)
Error tracking (Sentry)

Architecture

Architecture

1. Core Principles

2. High-Level Architecture

High-Level Architecture

Component Summary

Key Files

Environment Variables

2.1 Three-Layer Architecture

Interface Layer

Processing Layer

Data & Storage Layer

2.2 Design Philosophy

3. AKEL Architecture

4. Storage Architecture

Storage Architecture

1. Current Implementation (v2.10.2)

1.1 Three-Database Architecture

1.2 Current Caching Mechanisms

1.3 Storage Patterns

2. What Is Worth Caching?

2.1 Caching Value Analysis

2.2 Cost Impact Modeling

3. Redis: Do We Still Need It?

3.1 Current Reality Assessment

3.2 When Redis Becomes Necessary

4. PostgreSQL: When and Why?

4.1 Current SQLite Limitations

4.2 What PostgreSQL Enables

4.3 Migration Path

5. Vector Database Assessment

6. Revised Storage Roadmap

Previous Roadmap (Superseded)

Current Roadmap

7. Decision Summary

Related Pages

4.5 Versioning Architecture

UCM Configuration Versioning Architecture

How UCM Config Versioning Works

Current Implementation (v2.10.2)

Design Principles

5. Automated Systems in Detail

5.1 AKEL (AI Knowledge Evaluation Layer)

5.2 Background Jobs

5.3 Quality Monitoring

5.4 Moderation Detection

6. Scalability Strategy

6.1 Horizontal Scaling

6.2 Vertical Scaling

6.3 Performance Optimization

7. Monitoring & Observability

7.1 Key Metrics

7.2 Alerts

7.3 Dashboards

8. Security Architecture

8.1 Authentication & Authorization

8.2 Data Security

8.3 Abuse Prevention

9. Deployment Architecture

9.1 Production Environment

9.2 Development & Staging

9.3 Disaster Recovery

9.5 Federation Architecture Diagram

Federation Architecture (Future)

Target Features

Current Implementation

10. Future Architecture Evolution

10.1 When to Add Complexity

10.2 Federation (V2.0+)

11. Technology Stack Summary

12. Related Pages

Applications

Need help?