Architecture
Architecture
FactHarbor uses a modular-monolith architecture (POC → Beta 0) designed to evolve into a distributed, federated, multi-node system (Release 1.0+).
Modules are strongly separated, versioned, and auditable. All logic is transparent and deterministic.
High-Level System Architecture
FactHarbor is composed of the following major modules:
- UI Frontend
- REST API Layer
- Core Logic Layer
- Claim Processing
- Scenario Engine
- Evidence Repository
- Verdict Engine
- Re-evaluation Engine
- Roles / Identity / Reputation
- AKEL (AI Knowledge Extraction Layer)
- Federation Layer
- Workers & Background Jobs
- Storage Layer (Postgres + VectorDB + ObjectStore)
High-Level Architecture
graph TB
subgraph Client[Client Layer]
BROWSER[Web Browser]
end
subgraph NextJS[Next.js Web App]
ANALYZE[analyze page]
JOBS[jobs page]
JOBVIEW[jobs id page]
ANALYZE_API[api fh analyze]
JOBS_API[api fh jobs]
RUN_JOB[api internal run-job]
ORCH[orchestrated.ts]
CANON[monolithic-canonical.ts]
SHARED[Shared Modules]
WEBSEARCH[web-search.ts]
SR[source-reliability.ts]
end
subgraph DotNet[.NET API]
DOTNET_API[ASP.NET Core API]
JOBS_CTRL[JobsController]
HEALTH_CTRL[HealthController]
SQLITE[(SQLite factharbor.db)]
end
subgraph External[External Services]
LLM[LLM Providers]
SEARCH[Search Providers]
end
BROWSER --> ANALYZE
BROWSER --> JOBS
BROWSER --> JOBVIEW
ANALYZE --> ANALYZE_API
ANALYZE_API --> DOTNET_API
DOTNET_API --> SQLITE
RUN_JOB --> ORCH
RUN_JOB --> CANON
ORCH --> SHARED
CANON --> SHARED
ORCH --> LLM
CANON --> LLM
ORCH --> WEBSEARCH
WEBSEARCH --> SEARCH
SHARED --> SR
Component Summary
| Component | Technology | Purpose |
|---|---|---|
| Web App | Next.js 14+ | UI, API routes, AKEL pipeline |
| API | ASP.NET Core 8.0 | Job persistence, health checks |
| Database | SQLite (3 databases) | Jobs/events, UCM config, SR cache |
| LLM | AI SDK (Vercel) | Multi-provider LLM abstraction with model tiering |
| Search | Google CSE / SerpAPI | Web search for evidence |
Key Files
| File | Lines | Purpose |
|---|---|---|
| orchestrated.ts | 13300 | Main orchestrated pipeline |
| monolithic-canonical.ts | 1500 | Monolithic canonical pipeline |
| analysis-contexts.ts | 600 | AnalysisContext pre-detection |
| aggregation.ts | 400 | Verdict aggregation + claim weighting |
| evidence-filter.ts | 300 | Deterministic evidence quality filtering |
| source-reliability.ts | 500 | LLM-based source reliability scoring |
Environment Variables
| Variable | Default | Purpose |
|---|---|---|
| FH_SEARCH_ENABLED | true | Enable web search |
| FH_DETERMINISTIC | true | Zero temperature |
| FH_API_URL | localhost:5139 | .NET API endpoint |
Key ideas:
- Core logic is deterministic, auditable, and versioned
- AKEL drafts structured outputs but never publishes directly
- Workers run long or asynchronous tasks
- Storage is separated for scalability and clarity
- Federation Layer provides optional distributed operation
Storage Architecture
FactHarbor separates structured data, embeddings, and evidence files:
- PostgreSQL — canonical structured entities, all versioning, lineage, signatures
- Vector DB (Qdrant or pgvector) — semantic search, duplication detection, cluster mapping
- Object Storage — PDFs, datasets, raw evidence, transcripts
- Optional (Release 1.0): Redis for caching, IPFS for decentralized object storage
Storage Architecture
1. Current Implementation (v2.10.2)
1.1 Three-Database Architecture
graph TB
subgraph NextJS[Next.js Web App]
PIPELINE[Orchestrated Pipeline]
CONFIG_SVC[Config Storage]
SR_SVC[SR Cache]
end
subgraph DotNet[.NET API]
CONTROLLERS[Controllers]
EF[Entity Framework]
end
CONFIG_DB[(config.db)]
SR_DB[(source-reliability.db)]
FH_DB[(factharbor.db)]
CONFIG_SVC --> CONFIG_DB
SR_SVC --> SR_DB
PIPELINE -->|via API| CONTROLLERS
CONTROLLERS --> EF
EF --> FH_DB
| Database | Purpose | Access Layer | Key Tables |
|---|---|---|---|
| factharbor.db | Jobs, events, analysis results | .NET API (Entity Framework) | Jobs, JobEvents, AnalysisMetrics |
| config.db | UCM configuration management | Next.js (better-sqlite3) | config_blobs, config_active, config_usage, job_config_snapshots |
| source-reliability.db | Source reliability cache | Next.js (better-sqlite3) | source_reliability |
1.2 Current Caching Mechanisms
| What | Mechanism | TTL | Status |
|---|---|---|---|
| Source reliability scores | SQLite + batch prefetch to in-memory Map | 90 days (configurable via UCM) | IMPLEMENTED |
| UCM config values | In-memory Map with TTL-based expiry | 60 seconds | IMPLEMENTED |
| URL content (fetched pages) | Not cached | N/A | NOT IMPLEMENTED |
| Claim-level analysis results | Not cached | N/A | NOT IMPLEMENTED |
| LLM responses | Not cached | N/A | NOT IMPLEMENTED |
1.3 Storage Patterns
- Analysis results: JSON blob in ResultJson column (per job), stored once by .NET API
- Config blobs: Content-addressable with SHA-256 hash as PK, history tracked
- Job config snapshots: Pipeline + search + SR config captured per job for auditability
- SR cache: Per-domain reliability assessment with multi-model consensus scores
Current limitations:
- No relational queries across claims, evidence, or sources from different analyses
- No full-text search on analysis content
- Single-writer limitation (SQLite) — fine for single-instance but blocks horizontal scaling
- Every analysis re-fetches URL content and recomputes all LLM calls from scratch
2. What Is Worth Caching?
2.1 Caching Value Analysis
| Cacheable Item | Estimated Savings | Latency Impact | Complexity | Recommendation |
|---|---|---|---|---|
| Claim-level results | 30-50% LLM cost on duplicate claims | None (cache lookup) | MEDIUM — needs canonical claim hash + TTL + prompt-version awareness | EVALUATE in Alpha |
| URL content | $0 API cost but 5-15s latency per source | Major — eliminates re-fetch | LOW — URL hash + content + timestamp | EVALUATE in Alpha |
| LLM responses | Highest per-call savings | None | HIGH — prompt hash + input hash, invalidation on prompt change | DEFER — claim-level caching captures most benefit |
| Search query results | Marginal — search APIs are cheap | Minor | MEDIUM — results go stale quickly | NOT RECOMMENDED — volatile, low ROI |
2.2 Cost Impact Modeling
Assuming $0.10-$2.00 per analysis (depending on article complexity and model tier):
| Usage Level | Current Cost/day | With Claim Cache (-35%) | With URL Cache |
|---|---|---|---|
| 10 analyses/day | $1-20 | $0.65-13 | Same cost, 30-60s faster |
| 100 analyses/day | $10-200 | $6.50-130 | Same cost, 5-15 min faster |
| 1000 analyses/day | $100-2,000 | $65-1,300 | Same cost, 50-150 min faster |
Key insight: Claim caching saves money; URL caching saves time. Both follow the existing SQLite + in-memory Map pattern from source reliability.
3. Redis: Do We Still Need It?
3.1 Current Reality Assessment
| Original Redis Use Case | Current Solution | Gap? |
|---|---|---|
| Hot data caching | In-memory Map (config), SQLite (SR) | No gap at current scale |
| Session management | No user auth = no sessions | Not needed until Beta |
| Rate limiting | Not implemented | Can be in-process for single-instance |
| Pub/sub for real-time | SSE events work without Redis | No gap for single-instance |
3.2 When Redis Becomes Necessary
Redis adds value when:
- Multiple application instances need shared cache/state (horizontal scaling)
- Sub-millisecond cache lookups required (SQLite is 1-5ms, sufficient for current needs)
- Distributed rate limiting needed across multiple servers
Trigger criteria (following When-to-Add-Complexity philosophy):
- Single-instance SQLite cache latency >100ms
- Need for >1 application instance
- Rate limiting required across instances
4. PostgreSQL: When and Why?
4.1 Current SQLite Limitations
| Limitation | Impact | When It Hurts |
|---|---|---|
| JSON blob storage (no relational queries) | Cannot query across analyses | When browse/search is needed |
| Single-writer | No concurrent writes | When horizontal scaling is needed |
| No complex aggregation | Cannot run cross-analysis analytics | When quality dashboards need SQL |
| No full-text search | Cannot search claim text or evidence | When browse/search is needed |
4.2 What PostgreSQL Enables
- Browse/search claims across all analyses
- Quality metrics dashboards with SQL aggregation
- Evidence deduplication (FR54) with relational queries
- User accounts and permissions (Beta requirement)
- Multi-instance deployments
4.3 Migration Path
The .NET API already has PostgreSQL support configured (appsettings.json). Switching is a configuration change, not a code rewrite.
Note: Keep SQLite for config.db (portable) and source-reliability.db (standalone). Only factharbor.db needs PostgreSQL.
5. Vector Database Assessment
Conclusion: Vector search is not required for core functionality. Vectors add value only for approximate similarity (near-duplicate claim detection, edge case clustering) and should remain optional and offline to preserve pipeline performance and determinism.
When to add: Only after Shadow Mode data collection proves that near-duplicate detection needs exceed text-hash capability. Start with lightweight normalization + n-gram overlap (no vector DB needed).
6. Revised Storage Roadmap
Previous Roadmap (Superseded)
Phase 2: Migrate to PostgreSQL for normalized data
Phase 3: Add S3 for archives and backups
Current Roadmap
graph LR
subgraph Phase1[Phase 1: Alpha]
P1A[Expand SQLite caching]
P1B[Keep 3-DB architecture]
end
subgraph Phase2[Phase 2: Beta]
P2A[Add PostgreSQL for factharbor.db]
P2B[Add normalized claim/evidence tables]
P2C[Keep SQLite for config + SR]
end
subgraph Phase3[Phase 3: V1.0]
P3A[Add Redis IF multi-instance needed]
P3B[PostgreSQL primary for production]
end
subgraph Phase4[Phase 4: V1.0+]
P4A[Vector DB IF Shadow Mode proves value]
P4B[S3 IF storage exceeds 50GB]
end
Phase1 --> Phase2
Phase2 --> Phase3
Phase3 --> Phase4
Phase 1 (Alpha): Evaluate and potentially add URL content cache + claim-level cache in SQLite. Keep 3-DB architecture and in-memory Map caches.
Phase 2 (Beta): Add PostgreSQL for factharbor.db (user data, normalized claims, search). Keep SQLite for config.db (portable) and source-reliability.db (standalone).
Phase 3 (V1.0): Add Redis ONLY IF multi-instance deployment required. PostgreSQL becomes primary for all production data.
Phase 4 (V1.0+): Add vector DB ONLY IF Shadow Mode data proves value. Add S3 ONLY IF storage exceeds 50GB.
7. Decision Summary
| Technology | Decision | When | Status |
|---|---|---|---|
| SQLite URL cache | EVALUATE | Alpha planning | Needs further analysis |
| SQLite claim cache | EVALUATE | Alpha planning | Needs further analysis |
| Redis | DEFER | Multi-instance | Agreed |
| PostgreSQL | EVALUATE | Alpha/Beta | Needs further analysis |
| Vector DB | DEFER | Post-Shadow Mode | Agreed |
| S3 | DEFER | V1.0+ | Agreed |
Related Pages
- POC to Alpha Transition — Phase redefinition (caching is Alpha milestone)
- When to Add Complexity — Decision philosophy
- Architecture — System architecture
- Data Model — Database schema
Document Status: PARTIALLY APPROVED (February 2026) — DEFER decisions agreed; EVALUATE items need Alpha-phase analysis
Core Backend Module Architecture
Each module has a clear responsibility and versioned boundaries to allow future extraction into microservices.
1. Claim Processing Module
Responsibilities:
- Ingest text, URLs, documents, transcripts, federated input
- Extract claims (AKEL-assisted)
- Normalize structure
- Classify (type, domain, evaluability, safety)
- Deduplicate via embeddings
- Assign to claim clusters
Flow:
Ingest → Normalize → Classify → Deduplicate → Cluster
2. Scenario Engine
Responsibilities:
- Create and validate scenarios
- Enforce required fields (definitions, assumptions, boundaries...)
- Perform safety checks (AKEL-assisted)
- Manage versioning and lifecycle
- Provide contextual evaluation settings to the Verdict Engine
Flow:
Create → Validate → Version → Lifecycle → Safety
3. Evidence Repository
Responsibilities:
- Store metadata + files (object store)
- Classify evidence
- Compute preliminary reliability
- Maintain version history
- Detect retractions or disputes
- Provide structured metadata to the Verdict Engine
Flow:
Store → Classify → Score → Version → Update/Retract
4. Verdict Engine
Responsibilities:
- Aggregate scenario-linked evidence
- Compute likelihood ranges per scenario
- Generate reasoning chain
- Track uncertainty factors
- Maintain verdict version timelines
Flow:
Aggregate → Compute → Explain → Version → Timeline
5. Re-evaluation Engine
Responsibilities:
- Listen for upstream changes
- Trigger partial or full recomputation
- Update verdicts + summary views
- Maintain consistency across federated nodes
Triggers include:
- Evidence updated or retracted
- Scenario definition or assumption changes
- Claim type or evaluability changes
- Contradiction detection
- Federation sync updates
Flow:
Trigger → Impact Analysis → Recompute → Publish Update
AKEL Integration Summary
AKEL is fully documented in its own chapter.
Here is only the architectural integration summary:
- Receives raw input for claims
- Proposes scenario drafts
- Extracts and summarizes evidence
- Gives reliability hints
- Suggests draft verdicts
- Monitors contradictions
- Syncs metadata with trusted nodes
AKEL runs in parallel to human review — never overrides it.
Triple-Path Pipeline Architecture
graph TB
subgraph Input[User Input]
URL[URL Input]
TEXT[Text Input]
end
subgraph Shared[Shared Modules]
CONTEXTS[analysis-contexts.ts Context Detection]
AGG[aggregation.ts Verdict Aggregation]
CLAIM_D[claim-decomposition.ts]
EF[evidence-filter.ts ~330 lines]
QG[quality-gates.ts ~410 lines]
SR[source-reliability.ts ~620 lines]
VC[verdict-corrections.ts ~310 lines]
TS[truth-scale.ts ~280 lines]
BU[budgets.ts ~250 lines]
end
subgraph Dispatch[Pipeline Dispatch]
SELECT{Select Pipeline}
end
subgraph Pipelines[Pipeline Implementations]
ORCH[Orchestrated Pipeline]
CANON[Monolithic Canonical]
DYN[Monolithic Dynamic]
end
subgraph LLM[LLM Layer]
PROVIDER[AI SDK Provider]
end
subgraph Output[Result]
RESULT[AnalysisResult JSON]
REPORT[Markdown Report]
end
URL --> SELECT
TEXT --> SELECT
SELECT -->|orchestrated| ORCH
SELECT -->|monolithic_canonical| CANON
SELECT -->|monolithic_dynamic| DYN
CONTEXTS --> ORCH
CONTEXTS --> CANON
AGG --> ORCH
AGG --> CANON
CLAIM_D --> ORCH
CLAIM_D --> CANON
EF --> ORCH
QG --> ORCH
SR --> ORCH
SR --> CANON
SR --> DYN
VC --> ORCH
TS --> CANON
TS --> DYN
BU --> ORCH
BU --> CANON
BU --> DYN
ORCH --> PROVIDER
CANON --> PROVIDER
DYN --> PROVIDER
ORCH --> RESULT
CANON --> RESULT
DYN --> RESULT
RESULT --> REPORT
Pipeline Variants
| Variant | File | Lines | Approach | Output Schema |
|---|---|---|---|---|
| Orchestrated | orchestrated.ts | 13,300 | Multi-step workflow with explicit stages | Canonical (structured) |
| Monolithic Canonical | monolithic-canonical.ts | 1,500 | Single LLM tool-loop call | Canonical (structured) |
| Monolithic Dynamic | monolithic-dynamic.ts | 735 | Single LLM tool-loop call | Dynamic (flexible) |
Shared Modules
| Module | Lines | Used By | Purpose |
|---|---|---|---|
| analysis-contexts.ts | Orch, Canon | Heuristic context pre-detection before LLM | |
| aggregation.ts | Orch, Canon | Verdict weighting, contestation validation | |
| claim-decomposition.ts | Orch, Canon | Claim text parsing and normalization | |
| evidence-filter.ts | 330 | Orch | Probative value filtering, false positive rate calculation |
| quality-gates.ts | 410 | Orch | Gate 1 (claim validation) and Gate 4 (verdict confidence) |
| source-reliability.ts | 620 | Orch, Canon, Dyn | LLM-based source reliability evaluation with cache |
| verdict-corrections.ts | 310 | Orch | Post-hoc verdict direction mismatch corrections |
| truth-scale.ts | 280 | Canon, Dyn | Percentage-to-verdict label mapping |
| budgets.ts | 250 | Orch, Canon, Dyn | Token/cost budget tracking and enforcement |
Orchestrated Pipeline Steps
- Understand - Detect input type, extract claims, identify dependencies
2. Research (iterative) - Generate queries, fetch sources, extract evidence
3. Verdict Generation - Generate claim and article verdicts
4. Summary - Build two-panel summary
5. Report - Generate markdown report
Detailed Pipeline Diagrams
For internal implementation details of each pipeline variant:
- Orchestrated Pipeline Internal - 7-step staged workflow (13,300 lines)
- Monolithic Canonical Internal - Single-context canonical output (1,500 lines)
- Monolithic Dynamic Internal - Flexible experimental output (735 lines)
Federated Architecture
Each FactHarbor node:
- Has its own dataset (claims, scenarios, evidence, verdicts)
- Runs its own AKEL
- Maintains local governance and reviewer rules
- May partially mirror global or domain-specific data
- Contributes to global knowledge clusters
Nodes synchronize via:
- Signed version bundles
- Merkle-tree lineage structures
- Optionally IPFS for evidence
- Trust-weighted acceptance
Benefits:
- Community independence
- Scalability
- Resilience
- Domain specialization
Federation Architecture (Future)
graph LR
FH1[FactHarbor Instance 1]
FH2[FactHarbor Instance 2]
FH3[FactHarbor Instance 3]
FH1 -.->|V1.0+ Sync claims| FH2
FH2 -.->|V1.0+ Sync claims| FH3
FH3 -.->|V1.0+ Sync claims| FH1
U1[Users] --> FH1
U2[Users] --> FH2
U3[Users] --> FH3
Federation Architecture - Future (V1.0+): Independent FactHarbor instances can sync claims for broader reach while maintaining local control.
Target Features
| Feature | Purpose | Status |
|---|---|---|
| Claim synchronization | Share verified claims across instances | Not implemented |
| Cross-node audits | Distributed quality assurance | Not implemented |
| Local control | Each instance maintains autonomy | N/A |
| Contradiction detection | Cross-instance contradiction checking | Not implemented |
Current Implementation
- Single-instance deployment only
- No inter-instance communication
- All data stored locally in SQLite
Request → Verdict Flow
Simple end-to-end flow:
User → UI Frontend → REST API → FactHarbor Core
→ (Claim Processing → Scenario Engine → Evidence Repository → Verdict Engine)
→ Summary View → UI Frontend → User
Federation Sync Workflow
Sequence:
Detect Local Change → Build Signed Bundle → Push to Peers → Validate Signature → Merge or Fork → Trigger Re-evaluation
Versioning Architecture
All entities (Claim, Scenario, Evidence, Verdict) use immutable version chains:
- VersionID
- ParentVersionID
- Timestamp
- AuthorType (Human, AI, ExternalNode)
- ChangeReason
- Signature (optional POC, required in 1.0)
UCM Configuration Versioning Architecture
graph LR
ADMIN[UCM Administrator] -->|creates| BLOB[Config Blob - immutable]
BLOB -->|content-addressed| STORE[(config_blobs)]
ADMIN -->|activates| ACTIVE[config_active]
ACTIVE -->|points to| BLOB
JOB[Analysis Job] -->|snapshots at start| USAGE[config_usage]
USAGE -->|references| BLOB
REPORT[Analysis Report] -->|cites| USAGE
How UCM Config Versioning Works
| Concept | Description |
|---|---|
| config_blobs | Immutable, content-addressed config versions. Each change creates a new blob; old blobs are never deleted. |
| config_active | Pointer to the currently active config blob per config type. Changing this activates a new config version. |
| config_usage | Links each analysis job to the exact config snapshot used. Enables reproducibility. |
| Immutability | Analysis outputs are never edited. To improve results, update UCM config and re-analyse. |
Current Implementation (v2.10.2)
| Feature | Status |
|---|---|
| UCM config storage | Implemented (config.db SQLite) |
| Config hot-reload | Implemented (60s TTL) |
| Per-job config snapshots | Implemented (job_config_snapshots) |
| Content-addressed blobs | Implemented (hash-based deduplication) |
| Config activation tracking | Implemented (config_active table) |
| Admin UI for config management | Not yet implemented (CLI/direct DB) |
Design Principles
- Every config change creates a new immutable blob — no in-place mutation
- Every analysis job records the config snapshot used at time of execution
- Reports can be reproduced by re-running with the same config snapshot
- Config history is the audit trail — who changed what, when, and why
- Analysis data is never edited — "improve the system, not the data"