Architecture
Architecture
FactHarbor's architecture is designed for simplicity, automation, and continuous improvement.
1. Core Principles
- AI-First: AKEL (AI) is the primary system, humans supplement
- Publish by Default: No centralized approval (removed in V0.9.50), publish with confidence scores
- System Over Data: Fix algorithms, not individual outputs
- Measure Everything: Quality metrics drive improvements
- Scale Through Automation: Minimal human intervention
- Start Simple: Add complexity only when metrics prove necessary
2. High-Level Architecture
High-Level Architecture
graph TB
subgraph Client[Client Layer]
BROWSER[Web Browser]
end
subgraph NextJS[Next.js Web App]
ANALYZE[analyze page]
JOBS[jobs page]
JOBVIEW[jobs id page]
ANALYZE_API[api fh analyze]
JOBS_API[api fh jobs]
RUN_JOB[api internal run-job]
ORCH[orchestrated.ts]
CANON[monolithic-canonical.ts]
SHARED[Shared Modules]
WEBSEARCH[web-search.ts]
SR[source-reliability.ts]
end
subgraph DotNet[.NET API]
DOTNET_API[ASP.NET Core API]
JOBS_CTRL[JobsController]
HEALTH_CTRL[HealthController]
SQLITE[(SQLite factharbor.db)]
end
subgraph External[External Services]
LLM[LLM Providers]
SEARCH[Search Providers]
end
BROWSER --> ANALYZE
BROWSER --> JOBS
BROWSER --> JOBVIEW
ANALYZE --> ANALYZE_API
ANALYZE_API --> DOTNET_API
DOTNET_API --> SQLITE
RUN_JOB --> ORCH
RUN_JOB --> CANON
ORCH --> SHARED
CANON --> SHARED
ORCH --> LLM
CANON --> LLM
ORCH --> WEBSEARCH
WEBSEARCH --> SEARCH
SHARED --> SR
Component Summary
| Component | Technology | Purpose |
|---|---|---|
| Web App | Next.js 14+ | UI, API routes, AKEL pipeline |
| API | ASP.NET Core 8.0 | Job persistence, health checks |
| Database | SQLite (3 databases) | Jobs/events, UCM config, SR cache |
| LLM | AI SDK (Vercel) | Multi-provider LLM abstraction with model tiering |
| Search | Google CSE / SerpAPI | Web search for evidence |
Key Files
| File | Lines | Purpose |
|---|---|---|
| orchestrated.ts | 13300 | Main orchestrated pipeline |
| monolithic-canonical.ts | 1500 | Monolithic canonical pipeline |
| analysis-contexts.ts | 600 | AnalysisContext pre-detection |
| aggregation.ts | 400 | Verdict aggregation + claim weighting |
| evidence-filter.ts | 300 | Deterministic evidence quality filtering |
| source-reliability.ts | 500 | LLM-based source reliability scoring |
Environment Variables
| Variable | Default | Purpose |
|---|---|---|
| FH_SEARCH_ENABLED | true | Enable web search |
| FH_DETERMINISTIC | true | Zero temperature |
| FH_API_URL | localhost:5139 | .NET API endpoint |
2.1 Three-Layer Architecture
FactHarbor uses a clean three-layer architecture:
Interface Layer
Handles all user and system interactions:
- Web UI: Browse claims, view evidence, submit feedback
- REST API: Programmatic access for integrations
- Authentication & Authorization: User identity and permissions
- Rate Limiting: Protect against abuse
Processing Layer
Core business logic and AI processing:
- AKEL Pipeline: AI-driven claim analysis (parallel processing)
- Parse and extract claim components
- Gather evidence from multiple sources
- Check source track records
- Extract scenarios from evidence
- Synthesize verdicts
- Calculate risk scores
- Background Jobs: Automated maintenance tasks
- Source track record updates (weekly)
- Cache warming and invalidation
- Metrics aggregation
- Data archival
- Quality Monitoring: Automated quality checks
- Anomaly detection
- Contradiction detection
- Completeness validation
- Moderation Detection: Automated abuse detection
- Spam identification
- Manipulation detection
- Flag suspicious activity
Data & Storage Layer
Persistent data storage and caching:
- PostgreSQL: Primary database for all core data
- Claims, evidence, sources, users
- Scenarios, edits, audit logs
- Built-in full-text search
- Time-series capabilities for metrics
- Redis: High-speed caching layer
- Session data
- Frequently accessed claims
- API rate limiting
- S3 Storage: Long-term archival
- Old edit history (90+ days)
- AKEL processing logs
- Backup snapshots
Optional future additions (add only when metrics prove necessary): - Elasticsearch: If PostgreSQL full-text search becomes slow
- TimescaleDB: If metrics queries become a bottleneck
2.2 Design Philosophy
Start Simple, Evolve Based on Metrics
The architecture deliberately starts simple:
- Single primary database (PostgreSQL handles most workloads initially)
- Three clear layers (easy to understand and maintain)
- Automated operations (minimal human intervention)
- Measure before optimizing (add complexity only when proven necessary)
See Design Decisions and When to Add Complexity for detailed rationale.
3. AKEL Architecture
See AI Knowledge Extraction Layer (AKEL) for detailed information.
3.5 Claim Processing Architecture
FactHarbor's claim processing architecture is designed to handle both single-claim and multi-claim submissions efficiently.
Multi-Claim Handling
Users often submit:
- Text with multiple claims: Articles, statements, or paragraphs containing several distinct factual claims
- Web pages: URLs that are analyzed to extract all verifiable claims
- Single claims: Simple, direct factual statements
The first processing step is always Claim Extraction: identifying and isolating individual verifiable claims from submitted content.
Processing Phases
POC Implementation (Two-Phase):
Phase 1 - Claim Extraction:
- LLM analyzes submitted content
- Extracts all distinct, verifiable claims
- Returns structured list of claims with context
Phase 2 - Parallel Analysis:
- Each claim processed independently by LLM
- Single call per claim generates: Evidence, Scenarios, Sources, Verdict, Risk
- Parallelized across all claims
- Results aggregated for presentation
Production Implementation (Three-Phase):
Phase 1 - Extraction + Validation:
- Extract claims from content
- Validate clarity and uniqueness
- Filter vague or duplicate claims
Phase 2 - Evidence Gathering (Parallel):
- Independent evidence gathering per claim
- Source validation and scenario generation
- Quality gates prevent poor data from advancing
Phase 3 - Verdict Generation (Parallel):
- Generate verdict from validated evidence
- Confidence scoring and risk assessment
- Low-confidence cases routed to human review
Architectural Benefits
Scalability:
- Process 100 claims with 3x latency of single claim
- Parallel processing across independent claims
- Linear cost scaling with claim count
Quality:
- Validation gates between phases
- Errors isolated to individual claims
- Clear observability per processing step
Flexibility:
- Each phase optimizable independently
- Can use different model sizes per phase
- Easy to add human review at decision points
4. Storage Architecture
Storage Architecture
1. Current Implementation (v2.10.2)
1.1 Three-Database Architecture
graph TB
subgraph NextJS[Next.js Web App]
PIPELINE[Orchestrated Pipeline]
CONFIG_SVC[Config Storage]
SR_SVC[SR Cache]
end
subgraph DotNet[.NET API]
CONTROLLERS[Controllers]
EF[Entity Framework]
end
CONFIG_DB[(config.db)]
SR_DB[(source-reliability.db)]
FH_DB[(factharbor.db)]
CONFIG_SVC --> CONFIG_DB
SR_SVC --> SR_DB
PIPELINE -->|via API| CONTROLLERS
CONTROLLERS --> EF
EF --> FH_DB
| Database | Purpose | Access Layer | Key Tables |
|---|---|---|---|
| factharbor.db | Jobs, events, analysis results | .NET API (Entity Framework) | Jobs, JobEvents, AnalysisMetrics |
| config.db | UCM configuration management | Next.js (better-sqlite3) | config_blobs, config_active, config_usage, job_config_snapshots |
| source-reliability.db | Source reliability cache | Next.js (better-sqlite3) | source_reliability |
1.2 Current Caching Mechanisms
| What | Mechanism | TTL | Status |
|---|---|---|---|
| Source reliability scores | SQLite + batch prefetch to in-memory Map | 90 days (configurable via UCM) | IMPLEMENTED |
| UCM config values | In-memory Map with TTL-based expiry | 60 seconds | IMPLEMENTED |
| URL content (fetched pages) | Not cached | N/A | NOT IMPLEMENTED |
| Claim-level analysis results | Not cached | N/A | NOT IMPLEMENTED |
| LLM responses | Not cached | N/A | NOT IMPLEMENTED |
1.3 Storage Patterns
- Analysis results: JSON blob in ResultJson column (per job), stored once by .NET API
- Config blobs: Content-addressable with SHA-256 hash as PK, history tracked
- Job config snapshots: Pipeline + search + SR config captured per job for auditability
- SR cache: Per-domain reliability assessment with multi-model consensus scores
Current limitations:
- No relational queries across claims, evidence, or sources from different analyses
- No full-text search on analysis content
- Single-writer limitation (SQLite) — fine for single-instance but blocks horizontal scaling
- Every analysis re-fetches URL content and recomputes all LLM calls from scratch
2. What Is Worth Caching?
2.1 Caching Value Analysis
| Cacheable Item | Estimated Savings | Latency Impact | Complexity | Recommendation |
|---|---|---|---|---|
| Claim-level results | 30-50% LLM cost on duplicate claims | None (cache lookup) | MEDIUM — needs canonical claim hash + TTL + prompt-version awareness | EVALUATE in Alpha |
| URL content | $0 API cost but 5-15s latency per source | Major — eliminates re-fetch | LOW — URL hash + content + timestamp | EVALUATE in Alpha |
| LLM responses | Highest per-call savings | None | HIGH — prompt hash + input hash, invalidation on prompt change | DEFER — claim-level caching captures most benefit |
| Search query results | Marginal — search APIs are cheap | Minor | MEDIUM — results go stale quickly | NOT RECOMMENDED — volatile, low ROI |
2.2 Cost Impact Modeling
Assuming $0.10-$2.00 per analysis (depending on article complexity and model tier):
| Usage Level | Current Cost/day | With Claim Cache (-35%) | With URL Cache |
|---|---|---|---|
| 10 analyses/day | $1-20 | $0.65-13 | Same cost, 30-60s faster |
| 100 analyses/day | $10-200 | $6.50-130 | Same cost, 5-15 min faster |
| 1000 analyses/day | $100-2,000 | $65-1,300 | Same cost, 50-150 min faster |
Key insight: Claim caching saves money; URL caching saves time. Both follow the existing SQLite + in-memory Map pattern from source reliability.
3. Redis: Do We Still Need It?
3.1 Current Reality Assessment
| Original Redis Use Case | Current Solution | Gap? |
|---|---|---|
| Hot data caching | In-memory Map (config), SQLite (SR) | No gap at current scale |
| Session management | No user auth = no sessions | Not needed until Beta |
| Rate limiting | Not implemented | Can be in-process for single-instance |
| Pub/sub for real-time | SSE events work without Redis | No gap for single-instance |
3.2 When Redis Becomes Necessary
Redis adds value when:
- Multiple application instances need shared cache/state (horizontal scaling)
- Sub-millisecond cache lookups required (SQLite is 1-5ms, sufficient for current needs)
- Distributed rate limiting needed across multiple servers
Trigger criteria (following When-to-Add-Complexity philosophy):
- Single-instance SQLite cache latency >100ms
- Need for >1 application instance
- Rate limiting required across instances
4. PostgreSQL: When and Why?
4.1 Current SQLite Limitations
| Limitation | Impact | When It Hurts |
|---|---|---|
| JSON blob storage (no relational queries) | Cannot query across analyses | When browse/search is needed |
| Single-writer | No concurrent writes | When horizontal scaling is needed |
| No complex aggregation | Cannot run cross-analysis analytics | When quality dashboards need SQL |
| No full-text search | Cannot search claim text or evidence | When browse/search is needed |
4.2 What PostgreSQL Enables
- Browse/search claims across all analyses
- Quality metrics dashboards with SQL aggregation
- Evidence deduplication (FR54) with relational queries
- User accounts and permissions (Beta requirement)
- Multi-instance deployments
4.3 Migration Path
The .NET API already has PostgreSQL support configured (appsettings.json). Switching is a configuration change, not a code rewrite.
Note: Keep SQLite for config.db (portable) and source-reliability.db (standalone). Only factharbor.db needs PostgreSQL.
5. Vector Database Assessment
Conclusion: Vector search is not required for core functionality. Vectors add value only for approximate similarity (near-duplicate claim detection, edge case clustering) and should remain optional and offline to preserve pipeline performance and determinism.
When to add: Only after Shadow Mode data collection proves that near-duplicate detection needs exceed text-hash capability. Start with lightweight normalization + n-gram overlap (no vector DB needed).
6. Revised Storage Roadmap
Previous Roadmap (Superseded)
Phase 2: Migrate to PostgreSQL for normalized data
Phase 3: Add S3 for archives and backups
Current Roadmap
graph LR
subgraph Phase1[Phase 1: Alpha]
P1A[Expand SQLite caching]
P1B[Keep 3-DB architecture]
end
subgraph Phase2[Phase 2: Beta]
P2A[Add PostgreSQL for factharbor.db]
P2B[Add normalized claim/evidence tables]
P2C[Keep SQLite for config + SR]
end
subgraph Phase3[Phase 3: V1.0]
P3A[Add Redis IF multi-instance needed]
P3B[PostgreSQL primary for production]
end
subgraph Phase4[Phase 4: V1.0+]
P4A[Vector DB IF Shadow Mode proves value]
P4B[S3 IF storage exceeds 50GB]
end
Phase1 --> Phase2
Phase2 --> Phase3
Phase3 --> Phase4
Phase 1 (Alpha): Evaluate and potentially add URL content cache + claim-level cache in SQLite. Keep 3-DB architecture and in-memory Map caches.
Phase 2 (Beta): Add PostgreSQL for factharbor.db (user data, normalized claims, search). Keep SQLite for config.db (portable) and source-reliability.db (standalone).
Phase 3 (V1.0): Add Redis ONLY IF multi-instance deployment required. PostgreSQL becomes primary for all production data.
Phase 4 (V1.0+): Add vector DB ONLY IF Shadow Mode data proves value. Add S3 ONLY IF storage exceeds 50GB.
7. Decision Summary
| Technology | Decision | When | Status |
|---|---|---|---|
| SQLite URL cache | EVALUATE | Alpha planning | Needs further analysis |
| SQLite claim cache | EVALUATE | Alpha planning | Needs further analysis |
| Redis | DEFER | Multi-instance | Agreed |
| PostgreSQL | EVALUATE | Alpha/Beta | Needs further analysis |
| Vector DB | DEFER | Post-Shadow Mode | Agreed |
| S3 | DEFER | V1.0+ | Agreed |
Related Pages
- POC to Alpha Transition — Phase redefinition (caching is Alpha milestone)
- When to Add Complexity — Decision philosophy
- Architecture — System architecture
- Data Model — Database schema
Document Status: PARTIALLY APPROVED (February 2026) — DEFER decisions agreed; EVALUATE items need Alpha-phase analysis
See Storage Strategy for detailed information.
4.5 Versioning Architecture
UCM Configuration Versioning Architecture
graph LR
ADMIN[UCM Administrator] -->|creates| BLOB[Config Blob - immutable]
BLOB -->|content-addressed| STORE[(config_blobs)]
ADMIN -->|activates| ACTIVE[config_active]
ACTIVE -->|points to| BLOB
JOB[Analysis Job] -->|snapshots at start| USAGE[config_usage]
USAGE -->|references| BLOB
REPORT[Analysis Report] -->|cites| USAGE
How UCM Config Versioning Works
| Concept | Description |
|---|---|
| config_blobs | Immutable, content-addressed config versions. Each change creates a new blob; old blobs are never deleted. |
| config_active | Pointer to the currently active config blob per config type. Changing this activates a new config version. |
| config_usage | Links each analysis job to the exact config snapshot used. Enables reproducibility. |
| Immutability | Analysis outputs are never edited. To improve results, update UCM config and re-analyse. |
Current Implementation (v2.10.2)
| Feature | Status |
|---|---|
| UCM config storage | Implemented (config.db SQLite) |
| Config hot-reload | Implemented (60s TTL) |
| Per-job config snapshots | Implemented (job_config_snapshots) |
| Content-addressed blobs | Implemented (hash-based deduplication) |
| Config activation tracking | Implemented (config_active table) |
| Admin UI for config management | Not yet implemented (CLI/direct DB) |
Design Principles
- Every config change creates a new immutable blob — no in-place mutation
- Every analysis job records the config snapshot used at time of execution
- Reports can be reproduced by re-running with the same config snapshot
- Config history is the audit trail — who changed what, when, and why
- Analysis data is never edited — "improve the system, not the data"
5. Automated Systems in Detail
FactHarbor relies heavily on automation to achieve scale and quality. Here's how each automated system works:
5.1 AKEL (AI Knowledge Evaluation Layer)
What it does: Primary AI processing engine that analyzes claims automatically
Inputs:
- User-submitted claim text
- Existing evidence and sources
- Source track record database
Processing steps:
- Parse & Extract: Identify key components, entities, assertions
2. Gather Evidence: Search web and database for relevant sources
3. Check Sources: Evaluate source reliability using track records
4. Extract Scenarios: Identify different contexts from evidence
5. Synthesize Verdict: Compile evidence assessment per scenario
6. Calculate Risk: Assess potential harm and controversy
Outputs:
- Structured claim record
- Evidence links with relevance scores
- Scenarios with context descriptions
- Verdict summary per scenario
- Overall confidence score
- Risk assessment
Timing: 10-18 seconds total (parallel processing)
5.2 Background Jobs
Source Track Record Updates (Weekly):
- Analyze claim outcomes from past week
- Calculate source accuracy and reliability
- Update source_track_record table
- Never triggered by individual claims (prevents circular dependencies)
Cache Management (Continuous): - Warm cache for popular claims
- Invalidate cache on claim updates
- Monitor cache hit rates
Metrics Aggregation (Hourly): - Roll up detailed metrics
- Calculate system health indicators
- Generate performance reports
Data Archival (Daily): - Move old AKEL logs to S3 (90+ days)
- Archive old edit history
- Compress and backup data
5.3 Quality Monitoring
Automated checks run continuously:
- Anomaly Detection: Flag unusual patterns
- Sudden confidence score changes
- Unusual evidence distributions
- Suspicious source patterns
- Contradiction Detection: Identify conflicts
- Evidence that contradicts other evidence
- Claims with internal contradictions
- Source track record anomalies
- Completeness Validation: Ensure thoroughness
- Sufficient evidence gathered
- Multiple source types represented
- Key scenarios identified
5.4 Moderation Detection
Automated abuse detection:
- Spam Identification: Pattern matching for spam claims
- Manipulation Detection: Identify coordinated editing
- Gaming Detection: Flag attempts to game source scores
- Suspicious Activity: Log unusual behavior patterns
Human Review: Moderators review flagged items, system learns from decisions
6. Scalability Strategy
6.1 Horizontal Scaling
Components scale independently:
- AKEL Workers: Add more processing workers as claim volume grows
- Database Read Replicas: Add replicas for read-heavy workloads
- Cache Layer: Redis cluster for distributed caching
- API Servers: Load-balanced API instances
6.2 Vertical Scaling
Individual components can be upgraded:
- Database Server: Increase CPU/RAM for PostgreSQL
- Cache Memory: Expand Redis memory
- Worker Resources: More powerful AKEL worker machines
6.3 Performance Optimization
Built-in optimizations:
- Denormalized Data: Cache summary data in claim records (70% fewer joins)
- Parallel Processing: AKEL pipeline processes in parallel (40% faster)
- Intelligent Caching: Redis caches frequently accessed data
- Background Processing: Non-urgent tasks run asynchronously
7. Monitoring & Observability
7.1 Key Metrics
System tracks:
- Performance: AKEL processing time, API response time, cache hit rate
- Quality: Confidence score distribution, evidence completeness, contradiction rate
- Usage: Claims per day, active users, API requests
- Errors: Failed AKEL runs, API errors, database issues
7.2 Alerts
Automated alerts for:
- Processing time >30 seconds (threshold breach)
- Error rate >1% (quality issue)
- Cache hit rate <80% (cache problem)
- Database connections >80% capacity (scaling needed)
7.3 Dashboards
Real-time monitoring:
- System Health: Overall status and key metrics
- AKEL Performance: Processing time breakdown
- Quality Metrics: Confidence scores, completeness
- User Activity: Usage patterns, peak times
8. Security Architecture
8.1 Authentication & Authorization
- User Authentication: Secure login with password hashing
- Role-Based Access: Reader, Contributor, Moderator, Admin
- API Keys: For programmatic access
- Rate Limiting: Prevent abuse
8.2 Data Security
- Encryption: TLS for transport, encrypted storage for sensitive data
- Audit Logging: Track all significant changes
- Input Validation: Sanitize all user inputs
- SQL Injection Protection: Parameterized queries
8.3 Abuse Prevention
- Rate Limiting: Prevent flooding and DDoS
- Automated Detection: Flag suspicious patterns
- Human Review: Moderators investigate flagged content
- Ban Mechanisms: Block abusive users/IPs
9. Deployment Architecture
9.1 Production Environment
Components:
- Load Balancer (HAProxy or cloud LB)
- Multiple API servers (stateless)
- AKEL worker pool (auto-scaling)
- PostgreSQL primary + read replicas
- Redis cluster
- S3-compatible storage
Regions: Single region for V1.0, multi-region when needed
9.2 Development & Staging
Development: Local Docker Compose setup
Staging: Scaled-down production replica
CI/CD: Automated testing and deployment
9.3 Disaster Recovery
- Database Backups: Daily automated backups to S3
- Point-in-Time Recovery: Transaction log archival
- Replication: Real-time replication to standby
- Recovery Time Objective: <4 hours
9.5 Federation Architecture Diagram
Federation Architecture (Future)
graph LR
FH1[FactHarbor Instance 1]
FH2[FactHarbor Instance 2]
FH3[FactHarbor Instance 3]
FH1 -.->|V1.0+ Sync claims| FH2
FH2 -.->|V1.0+ Sync claims| FH3
FH3 -.->|V1.0+ Sync claims| FH1
U1[Users] --> FH1
U2[Users] --> FH2
U3[Users] --> FH3
Federation Architecture - Future (V1.0+): Independent FactHarbor instances can sync claims for broader reach while maintaining local control.
Target Features
| Feature | Purpose | Status |
|---|---|---|
| Claim synchronization | Share verified claims across instances | Not implemented |
| Cross-node audits | Distributed quality assurance | Not implemented |
| Local control | Each instance maintains autonomy | N/A |
| Contradiction detection | Cross-instance contradiction checking | Not implemented |
Current Implementation
- Single-instance deployment only
- No inter-instance communication
- All data stored locally in SQLite
10. Future Architecture Evolution
10.1 When to Add Complexity
See When to Add Complexity for specific triggers.
Elasticsearch: When PostgreSQL search consistently >500ms
TimescaleDB: When metrics queries consistently >1s
Federation: When 10,000+ users and explicit demand
Complex Reputation: When 100+ active contributors
10.2 Federation (V2.0+)
Deferred until:
- Core product proven with 10,000+ users
- User demand for decentralization
- Single-node limits reached
See Federation & Decentralization for future plans.
11. Technology Stack Summary
Backend:
- Python (FastAPI or Django)
- PostgreSQL (primary database)
- Redis (caching)
Frontend: - Modern JavaScript framework (React, Vue, or Svelte)
- Server-side rendering for SEO
AI/LLM: - Multi-provider orchestration (Claude, GPT-4, local models)
- Fallback and cross-checking support
Infrastructure: - Docker containers
- Kubernetes or cloud platform auto-scaling
- S3-compatible object storage
Monitoring: - Prometheus + Grafana
- Structured logging (ELK or cloud logging)
- Error tracking (Sentry)