Version 1.1 by Robert Schaub on 2026/01/02 09:59

Show last authors
1
2 = FactHarbor POC1 Architecture Analysis=
3
4
5 **Version:** 2.6.17
6 **Analysis Date:** January 2026
7 **Document Purpose:** Technical diagrams, gap analysis, and optimization recommendations
8
9 ---
10
11 == 1. AKEL Flow Diagram (with LLM and WebSearch Interactions)==
12
13
14 {{mermaid}}
15 flowchart TB
16 subgraph Input["📥 Input Layer"]
17 URL[URL Input]
18 TEXT[Text Input]
19 end
20
21 subgraph Retrieval["🔍 Content Retrieval"]
22 FETCH[extractTextFromUrl]
23 PDF[PDF Parser<br/>pdf-parse v1]
24 HTML[HTML Parser<br/>cheerio]
25 end
26
27 subgraph AKEL["🧠 AKEL Pipeline"]
28 direction TB
29
30 subgraph Step1["Step 1: Understand"]
31 UNDERSTAND[understandClaim<br/>━━━━━━━━━━━━━<br/>• Detect input type<br/>• Extract claims<br/>• Identify dependencies<br/>• Assign risk tiers]
32 LLM1[("🤖 LLM Call #1<br/>Claude/GPT/Gemini")]
33 end
34
35 subgraph Step2["Step 2: Research (Iterative)"]
36 DECIDE[decideNextResearch<br/>━━━━━━━━━━━━━<br/>• Generate queries<br/>• Focus areas]
37
38 SEARCH[("🌐 Web Search<br/>Google CSE / SerpAPI")]
39
40 FETCHSRC[fetchSourceContent<br/>━━━━━━━━━━━━━<br/>• Parallel fetching<br/>• Timeout handling]
41
42 EXTRACT[extractFacts<br/>━━━━━━━━━━━━━<br/>• Parse sources<br/>• Extract facts]
43 LLM2[("🤖 LLM Call #2-N<br/>Per source")]
44 end
45
46 subgraph Step3["Step 3: Verdict Generation"]
47 VERDICT[generateVerdicts<br/>━━━━━━━━━━━━━<br/>• Claim verdicts<br/>• Article verdict<br/>• Dependency propagation]
48 LLM3[("🤖 LLM Call #N+1<br/>Final synthesis")]
49 end
50
51 subgraph Step4["Step 4: Report"]
52 REPORT[buildTwoPanelSummary<br/>━━━━━━━━━━━━━<br/>• Format results<br/>• Generate markdown]
53 end
54 end
55
56 subgraph Output["📤 Output"]
57 RESULT[AnalysisResult JSON]
58 MARKDOWN[Report Markdown]
59 end
60
61 %% Flow connections
62 URL --> FETCH
63 TEXT --> UNDERSTAND
64 FETCH --> PDF
65 FETCH --> HTML
66 PDF --> UNDERSTAND
67 HTML --> UNDERSTAND
68
69 UNDERSTAND --> LLM1
70 LLM1 --> DECIDE
71
72 DECIDE --> SEARCH
73 SEARCH --> FETCHSRC
74 FETCHSRC --> EXTRACT
75 EXTRACT --> LLM2
76 LLM2 --> DECIDE
77
78 DECIDE -->|"Research Complete"| VERDICT
79 VERDICT --> LLM3
80 LLM3 --> REPORT
81
82 REPORT --> RESULT
83 REPORT --> MARKDOWN
84
85 %% Styling
86 classDef llm fill:#e1f5fe,stroke:#01579b,stroke-width:2px
87 classDef search fill:#fff3e0,stroke:#e65100,stroke-width:2px
88 classDef step fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
89
90 class LLM1,LLM2,LLM3 llm
91 class SEARCH search
92 class UNDERSTAND,DECIDE,FETCHSRC,EXTRACT,VERDICT,REPORT step
93 {{/mermaid}}
94
95 ---
96
97
98 == 2. ERD Data Model (Current POC1 Implementation)==
99
100
101 {{mermaid}}
102 erDiagram
103 JOB ||--o{ JOB_EVENT : "has"
104 JOB ||--|| ANALYSIS_RESULT : "produces"
105 ANALYSIS_RESULT ||--o{ CLAIM_VERDICT : "contains"
106 ANALYSIS_RESULT ||--o{ FETCHED_SOURCE : "references"
107 ANALYSIS_RESULT ||--o{ EXTRACTED_FACT : "contains"
108 CLAIM_VERDICT }o--o{ EXTRACTED_FACT : "supported by"
109 FETCHED_SOURCE ||--o{ EXTRACTED_FACT : "provides"
110 CLAIM_VERDICT ||--o{ CLAIM_VERDICT : "depends on"
111
112 JOB {
113 string JobId PK "GUID"
114 string Status "QUEUED|RUNNING|COMPLETE|FAILED"
115 int Progress "0-100"
116 datetime CreatedUtc
117 datetime UpdatedUtc
118 string InputType "text|url"
119 string InputValue "URL or text content"
120 string InputPreview "First 100 chars"
121 json ResultJson "Full analysis result"
122 string ReportMarkdown "Formatted report"
123 }
124
125 JOB_EVENT {
126 long Id PK
127 string JobId FK
128 datetime TsUtc
129 string Level "info|warn|error"
130 string Message
131 }
132
133 ANALYSIS_RESULT {
134 string schemaVersion "2.6.17"
135 string inputType "question|claim|article"
136 boolean isQuestion
137 string articleThesis
138 int articleTruthPercentage "0-100"
139 string articleVerdict "7-point scale"
140 json claimPattern "total/supported/uncertain/refuted"
141 boolean isPseudoscience
142 int llmCalls "Total LLM invocations"
143 json searchQueries "All search queries"
144 }
145
146 CLAIM_VERDICT {
147 string claimId PK "SC1, SC2, etc."
148 string claimText
149 boolean isCentral
150 string claimRole "attribution|source|timing|core"
151 string_array dependsOn "Prerequisite claim IDs"
152 boolean dependencyFailed
153 string llmVerdict "WELL-SUPPORTED|PARTIALLY-SUPPORTED|UNCERTAIN|REFUTED"
154 string verdict "7-point: True to False"
155 int confidence "0-100"
156 int truthPercentage "0-100"
157 string riskTier "A|B|C"
158 string reasoning
159 string_array supportingFactIds
160 string highlightColor "green to dark-red"
161 }
162
163 FETCHED_SOURCE {
164 string id PK "S1, S2, etc."
165 string url
166 string title
167 int trackRecordScore "0-100 or null"
168 string fullText "Extracted content"
169 datetime fetchedAt
170 string category "legal|news|academic"
171 boolean fetchSuccess
172 string searchQuery "Which query found this"
173 }
174
175 EXTRACTED_FACT {
176 string id PK "S1-F1, S1-F2, etc."
177 string fact "The factual statement"
178 string category "legal_provision|evidence|expert_quote|statistic|event|criticism"
179 string specificity "high|medium"
180 string sourceId FK
181 string sourceUrl
182 string sourceTitle
183 string sourceExcerpt
184 string relatedProceedingId
185 boolean isContestedClaim
186 string claimSource
187 }
188 {{/mermaid}}
189
190 ---
191
192
193 == 3. Overall Architecture with Interactions==
194
195
196 {{mermaid}}
197 flowchart TB
198 subgraph Client["🖥️ Client Layer"]
199 BROWSER[Web Browser]
200 ANALYZE_PAGE["/analyze page<br/>React + TailwindCSS"]
201 JOBS_PAGE["/jobs page<br/>Job history & status"]
202 end
203
204 subgraph NextJS["⚡ Next.js Web App (apps/web)"]
205 direction TB
206
207 subgraph API_Routes["API Routes"]
208 ANALYZE_API["/api/fh/analyze<br/>━━━━━━━━━━━━━<br/>POST: Create job"]
209 JOBS_API["/api/fh/jobs<br/>━━━━━━━━━━━━━<br/>GET: List jobs<br/>POST: Create job"]
210 JOB_API["/api/fh/jobs/[id]<br/>━━━━━━━━━━━━━<br/>GET: Job status"]
211 EVENTS_API["/api/fh/jobs/[id]/events<br/>━━━━━━━━━━━━━<br/>GET: Job events (SSE)"]
212 RUN_JOB["/api/internal/run-job<br/>━━━━━━━━━━━━━<br/>POST: Execute analysis"]
213 end
214
215 subgraph Lib["Core Libraries"]
216 ANALYZER["analyzer.ts<br/>━━━━━━━━━━━━━<br/>AKEL Pipeline<br/>2918 lines"]
217 RETRIEVAL["retrieval.ts<br/>━━━━━━━━━━━━━<br/>URL content extraction"]
218 WEBSEARCH["web-search.ts<br/>━━━━━━━━━━━━━<br/>Search abstraction"]
219 MBFC["mbfc-loader.ts<br/>━━━━━━━━━━━━━<br/>Source reliability"]
220 end
221 end
222
223 subgraph DotNet["🔧 .NET API (apps/api)"]
224 DOTNET_API["FactHarbor.Api<br/>ASP.NET Core"]
225
226 subgraph Controllers["Controllers"]
227 ANALYZE_CTRL["AnalyzeController"]
228 JOBS_CTRL["JobsController"]
229 INTERNAL_CTRL["InternalJobsController"]
230 end
231
232 subgraph Services["Services"]
233 JOB_SVC["JobService<br/>━━━━━━━━━━━━━<br/>Job CRUD operations"]
234 RUNNER_CLIENT["RunnerClient<br/>━━━━━━━━━━━━━<br/>Calls Next.js runner"]
235 end
236
237 DB[(SQLite Database<br/>━━━━━━━━━━━━━<br/>JobEntity<br/>JobEventEntity)]
238 end
239
240 subgraph External["🌐 External Services"]
241 LLM_PROVIDERS["LLM Providers<br/>━━━━━━━━━━━━━<br/>• Anthropic Claude<br/>• OpenAI GPT<br/>• Google Gemini<br/>• Mistral"]
242 SEARCH_PROVIDERS["Search Providers<br/>━━━━━━━━━━━━━<br/>• Google CSE<br/>• SerpAPI<br/>• Brave<br/>• Tavily"]
243 WEB["Web Content<br/>━━━━━━━━━━━━━<br/>• News sites<br/>• PDFs<br/>• Academic sources"]
244 end
245
246 %% Client interactions
247 BROWSER --> ANALYZE_PAGE
248 BROWSER --> JOBS_PAGE
249 ANALYZE_PAGE --> ANALYZE_API
250 JOBS_PAGE --> JOBS_API
251
252 %% Next.js internal
253 ANALYZE_API --> JOBS_API
254 JOBS_API -->|"Proxy"| DOTNET_API
255 JOB_API -->|"Proxy"| DOTNET_API
256 EVENTS_API -->|"Proxy"| DOTNET_API
257
258 %% .NET flow
259 DOTNET_API --> ANALYZE_CTRL
260 DOTNET_API --> JOBS_CTRL
261 DOTNET_API --> INTERNAL_CTRL
262 ANALYZE_CTRL --> JOB_SVC
263 JOBS_CTRL --> JOB_SVC
264 JOB_SVC --> DB
265 JOB_SVC --> RUNNER_CLIENT
266 RUNNER_CLIENT -->|"HTTP POST"| RUN_JOB
267
268 %% Analysis execution
269 RUN_JOB --> ANALYZER
270 ANALYZER --> RETRIEVAL
271 ANALYZER --> WEBSEARCH
272 ANALYZER --> MBFC
273
274 %% External calls
275 ANALYZER -->|"AI SDK"| LLM_PROVIDERS
276 WEBSEARCH --> SEARCH_PROVIDERS
277 RETRIEVAL --> WEB
278
279 %% Styling
280 classDef external fill:#fff3e0,stroke:#e65100,stroke-width:2px
281 classDef core fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
282 classDef api fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
283
284 class LLM_PROVIDERS,SEARCH_PROVIDERS,WEB external
285 class ANALYZER,RETRIEVAL,WEBSEARCH,MBFC core
286 class ANALYZE_API,JOBS_API,JOB_API,EVENTS_API,RUN_JOB api
287 {{/mermaid}}
288
289 ---
290
291
292 == 4. Specification vs Implementation Gap Analysis==
293
294
295
296 === 4.1 Data Model Gaps===
297
298
299 | Specification Entity | POC1 Status | Gap Description |
300 |---------------------|-------------|-----------------|
301 | **Claim** | ⚠️ Partial | No persistent storage; claims exist only in JSON result. Missing: `status`, `confidence_score`, `risk_score`, `completeness_score`, `version`, `views`, `edit_count` |
302 | **Evidence** | ⚠️ Partial | Implemented as `ExtractedFact` but lacks: `supports` enum, proper `relevance_score` |
303 | **Source** | ⚠️ Partial | `FetchedSource` exists but missing: `type` enum, `accuracy_history`, `correction_frequency`, weekly update scheduler |
304 | **Scenario** | ❌ Missing | Not implemented. Claims are evaluated directly without scenario contexts |
305 | **Verdict** | ⚠️ Partial | `ClaimVerdict` exists but missing: `likelihood_range`, `uncertainty_factors` array, proper `explanation_summary` |
306 | **User** | ❌ Missing | No user authentication or role system |
307 | **Edit** | ❌ Missing | No audit trail for changes |
308
309
310 === 4.2 AKEL Component Gaps ===
311
312 | Spec Component | POC1 Status | Gap Description |
313 |----------------|-------------|-----------------|
314 | **AKEL Orchestrator** | ✅ Implemented | `runAnalysis()` function serves this role |
315 | **Claim Extractor** | ✅ Implemented | `understandClaim()` with claim role/dependency tracking |
316 | **Claim Classifier** | ⚠️ Partial | Risk tier (A/B/C) assigned, but no domain classification |
317 | **Scenario Generator** | ❌ Missing | Claims evaluated without scenario extraction |
318 | **Evidence Summarizer** | ✅ Implemented | `extractFacts()` function |
319 | **Contradiction Detector** | ⚠️ Partial | `isContestedClaim` flag exists but no active contradiction search |
320 | **Quality Gate Validator** | ❌ Missing | No source quality gates, no mandatory checks |
321 | **Audit Sampling Scheduler** | ❌ Missing | No audit system |
322 | **Embedding Handler** | ❌ Missing | Not needed for POC |
323 | **Federation Sync** | ❌ Missing | Not needed for POC |
324
325
326 === 4.3 Architecture Gaps===
327
328
329 | Spec Requirement | POC1 Status | Gap Description |
330 |------------------|-------------|-----------------|
331 | **Three-Layer Architecture** | ✅ Implemented | Interface (Next.js) → Processing (AKEL) → Data (SQLite) |
332 | **LLM Abstraction Layer** | ✅ Implemented | AI SDK supports multiple providers with failover |
333 | **PostgreSQL Primary DB** | ⚠️ Different | Using SQLite for simplicity (acceptable for POC) |
334 | **Redis Caching** | ❌ Missing | No caching layer |
335 | **S3 Archival** | ❌ Missing | No long-term storage |
336 | **Background Jobs** | ❌ Missing | No scheduler for source updates, cache warming |
337 | **Quality Monitoring** | ⚠️ Partial | LLM call counting exists, but no anomaly detection |
338
339
340 === 4.4 Publication & Review Gaps===
341
342
343 | Spec Feature | POC1 Status | Gap Description |
344 |--------------|-------------|-----------------|
345 | **Risk Tier Publication Rules** | ❌ Missing | All results published immediately regardless of tier |
346 | **Human Review Queue** | ❌ Missing | No review workflow |
347 | **AI-Generated Labeling** | ⚠️ Partial | Results show "AI analysis" but no formal labeling system |
348 | **Audit Rate Sampling** | ❌ Missing | No sampling audits |
349
350 ---
351
352
353 == 5. Optimization Recommendations==
354
355
356
357 === 5.1 Cost Optimizations===
358
359
360 {{mermaid}}
361 pie title Current LLM Cost Distribution (Estimated per Analysis)
362 "Step 1: Understand" : 15
363 "Step 2: Research (per source)" : 60
364 "Step 3: Verdicts" : 25
365 {{/mermaid}}
366
367 | Optimization | Estimated Savings | Implementation Effort |
368 |--------------|-------------------|----------------------|
369 | **Cache claim understanding** | 30-50% on repeated claims | Medium |
370 | **Use Haiku for fact extraction** | 40% on Step 2 costs | Low (config change) |
371 | **Batch fact extraction** | 20% fewer API calls | Medium |
372 | **Skip search for known claims** | 50%+ for cached claims | High (needs claim DB) |
373 | **Reduce max iterations** | Linear reduction | Low (config change) |
374
375
376 === 5.2 Timing Optimizations===
377
378
379 {{mermaid}}
380 gantt
381 title Current Analysis Timeline (Typical)
382 dateFormat ss
383 axisFormat %S sec
384
385 section Current Flow
386 URL Fetch :a1, 00, 2s
387 Step 1 Understand :a2, after a1, 15s
388 Search Iteration 1 :a3, after a2, 8s
389 Fetch Sources 1 :a4, after a3, 10s
390 Extract Facts 1 :a5, after a4, 12s
391 Search Iteration 2 :a6, after a5, 8s
392 Fetch Sources 2 :a7, after a6, 10s
393 Extract Facts 2 :a8, after a7, 12s
394 Generate Verdicts :a9, after a8, 15s
395
396 section Optimized Flow
397 URL Fetch :b1, 00, 2s
398 Step 1 Understand :b2, after b1, 10s
399 Search + Fetch (parallel) :b3, after b2, 12s
400 Extract Facts (batched) :b4, after b3, 8s
401 Generate Verdicts :b5, after b4, 10s
402 {{/mermaid}}
403
404 | Optimization | Time Savings | Notes |
405 |--------------|--------------|-------|
406 | **Parallel source fetching** | Already implemented | Currently fetches 3 sources in parallel |
407 | **Streaming LLM responses** | 20-30% perceived | User sees progress faster |
408 | **Search query batching** | 10-15% | Send multiple queries to search API |
409 | **Reduce prompt size** | 5-10% per call | Optimize system prompts |
410 | **Use faster models for extraction** | 30-40% on Step 2 | Claude Haiku vs Sonnet |
411
412
413 === 5.3 Priority Recommendations===
414
415
416 1. **HIGH PRIORITY - Implement Claim Caching**
417 - Cache claim verdicts by content hash
418 - Reduces costs for repeated/similar claims
419 - Enables the separated verdict architecture (see Section 6)
420
421 2. **MEDIUM PRIORITY - Use Tiered Models**
422 - Step 1 (Understand): Sonnet (needs reasoning)
423 - Step 2 (Extract): Haiku (simple extraction)
424 - Step 3 (Verdicts): Sonnet (needs synthesis)
425
426 3. **LOW PRIORITY - Add Redis Cache**
427 - Cache source content (24h TTL)
428 - Cache search results (1h TTL)
429 - Reduces external API calls
430
431 ---
432
433
434 == 6. Separated Verdict Architecture Proposal==
435
436
437
438 === 6.1 Current Architecture===
439
440
441 {{mermaid}}
442 flowchart LR
443 subgraph Current["Current: Monolithic Analysis"]
444 INPUT[Article Input] --> ANALYZE[Full Analysis Pipeline]
445 ANALYZE --> CLAIMS[Claim Verdicts]
446 ANALYZE --> ARTICLE[Article Verdict]
447 CLAIMS -.->|"Aggregated"| ARTICLE
448 end
449 {{/mermaid}}
450
451 **Issues:**
452 - Every analysis re-processes all claims
453 - No caching of individual claim verdicts
454 - Article verdict tightly coupled to claim extraction
455
456
457 === 6.2 Proposed Separated Architecture===
458
459
460 {{mermaid}}
461 flowchart TB
462 subgraph Input["Input Processing"]
463 ARTICLE[Article/Text Input]
464 EXTRACT[Claim Extraction]
465 end
466
467 subgraph ClaimLayer["Claim Verdict Layer (Cacheable)"]
468 CACHE[(Claim Cache<br/>━━━━━━━━━━━━━<br/>Key: claim_hash<br/>TTL: 7 days)]
469
470 CLAIM1["Claim 1 Analysis"]
471 CLAIM2["Claim 2 Analysis"]
472 CLAIM3["Claim N Analysis"]
473
474 VERDICT1[Claim 1 Verdict]
475 VERDICT2[Claim 2 Verdict]
476 VERDICT3[Claim N Verdict]
477 end
478
479 subgraph ArticleLayer["Article Verdict Layer (Dynamic)"]
480 AGGREGATE[Aggregate Claim Verdicts]
481 CONTEXT[Apply Article Context<br/>━━━━━━━━━━━━━<br/>• Claim relationships<br/>• Logical structure<br/>• Author intent]
482 ARTICLE_VERDICT[Article Verdict]
483 end
484
485 %% Flow
486 ARTICLE --> EXTRACT
487 EXTRACT --> CLAIM1
488 EXTRACT --> CLAIM2
489 EXTRACT --> CLAIM3
490
491 CLAIM1 -->|"Cache Miss"| VERDICT1
492 CLAIM2 -->|"Cache Hit"| VERDICT2
493 CLAIM3 -->|"Cache Miss"| VERDICT3
494
495 CLAIM1 <-.-> CACHE
496 CLAIM2 <-.-> CACHE
497 CLAIM3 <-.-> CACHE
498
499 VERDICT1 --> AGGREGATE
500 VERDICT2 --> AGGREGATE
501 VERDICT3 --> AGGREGATE
502
503 AGGREGATE --> CONTEXT
504 CONTEXT --> ARTICLE_VERDICT
505
506 classDef cache fill:#fff9c4,stroke:#f57f17,stroke-width:2px
507 classDef dynamic fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
508 class CACHE cache
509 class CONTEXT,ARTICLE_VERDICT dynamic
510 {{/mermaid}}
511
512
513 === 6.3 Benefits Analysis===
514
515
516 | Benefit | Impact | Rationale |
517 |---------|--------|-----------|
518 | **Cost Reduction** | 40-70% for repeated claims | Many articles share common claims (e.g., "COVID vaccines are safe") |
519 | **Faster Analysis** | 50%+ for cached claims | Skip research + LLM calls for known claims |
520 | **Consistency** | High | Same claim always gets same verdict (until cache expires) |
521 | **Freshness Control** | Configurable TTL | Balance consistency vs. new evidence |
522 | **Scalability** | Linear improvement | More users = higher cache hit rate |
523
524
525 === 6.4 Implementation Considerations ===
526
527 **Claim Hashing Strategy:**
528 {{code language="typescript"}}
529 function getClaimHash(claim: string): string {
530 // Normalize: lowercase, remove punctuation, stem words
531 const normalized = normalize(claim);
532 // Hash for cache key
533 return crypto.createHash('sha256').update(normalized).digest('hex').slice(0, 16);
534 }
535 {{/code}}
536
537 **Cache Invalidation Triggers:**
538 - TTL expiration (default 7 days)
539 - Major news event related to claim topic
540 - Source track record significant change
541 - Manual invalidation by moderator
542
543 **Article Verdict Considerations:**
544 - Article verdict should ALWAYS be dynamic (never cached)
545 - Same claims in different article contexts may yield different article verdicts
546 - Example: "Vaccines are safe" + "Vaccines cause autism" → article may be misleading even if first claim is true
547
548 ### 6.5 Recommendation
549
550 **YES, separating is beneficial** with the following caveats:
551
552 1. **Claim verdicts should be cached** with semantic similarity matching (not just exact match)
553 2. **Article verdicts should always be dynamic** to account for:
554 - Claim relationships and logical structure
555 - Author's argumentative strategy
556 - Context and framing
557 - Selective use of true claims to support false conclusions
558
559 3. **Implementation phases:**
560 - Phase 1: Exact-match claim caching (simple hash)
561 - Phase 2: Semantic similarity caching (embedding-based)
562 - Phase 3: Federated claim sharing across instances
563
564 ---
565
566
567 == 7. Summary==
568
569
570
571 === Current State===
572
573 - POC1 implements core AKEL pipeline successfully
574 - Claim dependency tracking is implemented
575 - Multiple LLM providers supported
576 - No persistent claim storage or caching
577
578
579 === Key Gaps from Specification===
580
581 - No scenario extraction
582 - No user/role system
583 - No audit trail
584 - No source track record updates
585 - No review queue
586
587
588 === Recommended Next Steps===
589
590 1. Implement claim caching layer
591 2. Separate claim vs article verdict generation
592 3. Add Redis for source/search caching
593 4. Implement tiered model selection
594 5. Add basic audit logging