Version 3.1 by Robert Schaub on 2026/01/02 10:02

Show last authors
1 = FactHarbor POC1 Architecture Analysis =
2
3
4 **Version:** 2.6.17
5 **Analysis Date:** January 2026
6 **Document Purpose:** Technical diagrams, gap analysis, and optimization recommendations
7
8 ----
9
10 == 1. AKEL Flow Diagram (with LLM and WebSearch Interactions) ==
11
12
13 {{mermaid}}
14 flowchart TB
15 subgraph Input["📥 Input Layer"]
16 URL[URL Input]
17 TEXT[Text Input]
18 end
19
20 subgraph Retrieval["🔍 Content Retrieval"]
21 FETCH[extractTextFromUrl]
22 PDF[PDF Parser<br/>pdf-parse v1]
23 HTML[HTML Parser<br/>cheerio]
24 end
25
26 subgraph AKEL["🧠 AKEL Pipeline"]
27 direction TB
28
29 subgraph Step1["Step 1: Understand"]
30 UNDERSTAND[understandClaim<br/>━━━━━━━━━━━━━<br/>• Detect input type<br/>• Extract claims<br/>• Identify dependencies<br/>• Assign risk tiers]
31 LLM1[("🤖 LLM Call #1<br/>Claude/GPT/Gemini")]
32 end
33
34 subgraph Step2["Step 2: Research (Iterative)"]
35 DECIDE[decideNextResearch<br/>━━━━━━━━━━━━━<br/>• Generate queries<br/>• Focus areas]
36
37 SEARCH[("🌐 Web Search<br/>Google CSE / SerpAPI")]
38
39 FETCHSRC[fetchSourceContent<br/>━━━━━━━━━━━━━<br/>• Parallel fetching<br/>• Timeout handling]
40
41 EXTRACT[extractFacts<br/>━━━━━━━━━━━━━<br/>• Parse sources<br/>• Extract facts]
42 LLM2[("🤖 LLM Call #2-N<br/>Per source")]
43 end
44
45 subgraph Step3["Step 3: Verdict Generation"]
46 VERDICT[generateVerdicts<br/>━━━━━━━━━━━━━<br/>• Claim verdicts<br/>• Article verdict<br/>• Dependency propagation]
47 LLM3[("🤖 LLM Call #N+1<br/>Final synthesis")]
48 end
49
50 subgraph Step4["Step 4: Report"]
51 REPORT[buildTwoPanelSummary<br/>━━━━━━━━━━━━━<br/>• Format results<br/>• Generate markdown]
52 end
53 end
54
55 subgraph Output["📤 Output"]
56 RESULT[AnalysisResult JSON]
57 MARKDOWN[Report Markdown]
58 end
59
60 %% Flow connections
61 URL --> FETCH
62 TEXT --> UNDERSTAND
63 FETCH --> PDF
64 FETCH --> HTML
65 PDF --> UNDERSTAND
66 HTML --> UNDERSTAND
67
68 UNDERSTAND --> LLM1
69 LLM1 --> DECIDE
70
71 DECIDE --> SEARCH
72 SEARCH --> FETCHSRC
73 FETCHSRC --> EXTRACT
74 EXTRACT --> LLM2
75 LLM2 --> DECIDE
76
77 DECIDE -->|"Research Complete"| VERDICT
78 VERDICT --> LLM3
79 LLM3 --> REPORT
80
81 REPORT --> RESULT
82 REPORT --> MARKDOWN
83
84 %% Styling
85 classDef llm fill:#e1f5fe,stroke:#01579b,stroke-width:2px
86 classDef search fill:#fff3e0,stroke:#e65100,stroke-width:2px
87 classDef step fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
88
89 class LLM1,LLM2,LLM3 llm
90 class SEARCH search
91 class UNDERSTAND,DECIDE,FETCHSRC,EXTRACT,VERDICT,REPORT step
92 {{/mermaid}}
93
94 ----
95
96
97 == 2. ERD Data Model (Current POC1 Implementation) ==
98
99
100 {{mermaid}}
101 erDiagram
102 JOB ||--o{ JOB_EVENT : "has"
103 JOB ||--|| ANALYSIS_RESULT : "produces"
104 ANALYSIS_RESULT ||--o{ CLAIM_VERDICT : "contains"
105 ANALYSIS_RESULT ||--o{ FETCHED_SOURCE : "references"
106 ANALYSIS_RESULT ||--o{ EXTRACTED_FACT : "contains"
107 CLAIM_VERDICT }o--o{ EXTRACTED_FACT : "supported by"
108 FETCHED_SOURCE ||--o{ EXTRACTED_FACT : "provides"
109 CLAIM_VERDICT ||--o{ CLAIM_VERDICT : "depends on"
110
111 JOB {
112 string JobId PK "GUID"
113 string Status "QUEUED|RUNNING|COMPLETE|FAILED"
114 int Progress "0-100"
115 datetime CreatedUtc
116 datetime UpdatedUtc
117 string InputType "text|url"
118 string InputValue "URL or text content"
119 string InputPreview "First 100 chars"
120 json ResultJson "Full analysis result"
121 string ReportMarkdown "Formatted report"
122 }
123
124 JOB_EVENT {
125 long Id PK
126 string JobId FK
127 datetime TsUtc
128 string Level "info|warn|error"
129 string Message
130 }
131
132 ANALYSIS_RESULT {
133 string schemaVersion "2.6.17"
134 string inputType "question|claim|article"
135 boolean isQuestion
136 string articleThesis
137 int articleTruthPercentage "0-100"
138 string articleVerdict "7-point scale"
139 json claimPattern "total/supported/uncertain/refuted"
140 boolean isPseudoscience
141 int llmCalls "Total LLM invocations"
142 json searchQueries "All search queries"
143 }
144
145 CLAIM_VERDICT {
146 string claimId PK "SC1, SC2, etc."
147 string claimText
148 boolean isCentral
149 string claimRole "attribution|source|timing|core"
150 string_array dependsOn "Prerequisite claim IDs"
151 boolean dependencyFailed
152 string llmVerdict "WELL-SUPPORTED|PARTIALLY-SUPPORTED|UNCERTAIN|REFUTED"
153 string verdict "7-point: True to False"
154 int confidence "0-100"
155 int truthPercentage "0-100"
156 string riskTier "A|B|C"
157 string reasoning
158 string_array supportingFactIds
159 string highlightColor "green to dark-red"
160 }
161
162 FETCHED_SOURCE {
163 string id PK "S1, S2, etc."
164 string url
165 string title
166 int trackRecordScore "0-100 or null"
167 string fullText "Extracted content"
168 datetime fetchedAt
169 string category "legal|news|academic"
170 boolean fetchSuccess
171 string searchQuery "Which query found this"
172 }
173
174 EXTRACTED_FACT {
175 string id PK "S1-F1, S1-F2, etc."
176 string fact "The factual statement"
177 string category "legal_provision|evidence|expert_quote|statistic|event|criticism"
178 string specificity "high|medium"
179 string sourceId FK
180 string sourceUrl
181 string sourceTitle
182 string sourceExcerpt
183 string relatedProceedingId
184 boolean isContestedClaim
185 string claimSource
186 }
187 {{/mermaid}}
188
189 ----
190
191
192 == 3. Overall Architecture with Interactions ==
193
194
195 {{mermaid}}
196 flowchart TB
197 subgraph Client["🖥️ Client Layer"]
198 BROWSER[Web Browser]
199 ANALYZE_PAGE["/analyze page<br/>React + TailwindCSS"]
200 JOBS_PAGE["/jobs page<br/>Job history & status"]
201 end
202
203 subgraph NextJS["⚡ Next.js Web App (apps/web)"]
204 direction TB
205
206 subgraph API_Routes["API Routes"]
207 ANALYZE_API["/api/fh/analyze<br/>━━━━━━━━━━━━━<br/>POST: Create job"]
208 JOBS_API["/api/fh/jobs<br/>━━━━━━━━━━━━━<br/>GET: List jobs<br/>POST: Create job"]
209 JOB_API["/api/fh/jobs/[id]<br/>━━━━━━━━━━━━━<br/>GET: Job status"]
210 EVENTS_API["/api/fh/jobs/[id]/events<br/>━━━━━━━━━━━━━<br/>GET: Job events (SSE)"]
211 RUN_JOB["/api/internal/run-job<br/>━━━━━━━━━━━━━<br/>POST: Execute analysis"]
212 end
213
214 subgraph Lib["Core Libraries"]
215 ANALYZER["analyzer.ts<br/>━━━━━━━━━━━━━<br/>AKEL Pipeline<br/>2918 lines"]
216 RETRIEVAL["retrieval.ts<br/>━━━━━━━━━━━━━<br/>URL content extraction"]
217 WEBSEARCH["web-search.ts<br/>━━━━━━━━━━━━━<br/>Search abstraction"]
218 MBFC["mbfc-loader.ts<br/>━━━━━━━━━━━━━<br/>Source reliability"]
219 end
220 end
221
222 subgraph DotNet["🔧 .NET API (apps/api)"]
223 DOTNET_API["FactHarbor.Api<br/>ASP.NET Core"]
224
225 subgraph Controllers["Controllers"]
226 ANALYZE_CTRL["AnalyzeController"]
227 JOBS_CTRL["JobsController"]
228 INTERNAL_CTRL["InternalJobsController"]
229 end
230
231 subgraph Services["Services"]
232 JOB_SVC["JobService<br/>━━━━━━━━━━━━━<br/>Job CRUD operations"]
233 RUNNER_CLIENT["RunnerClient<br/>━━━━━━━━━━━━━<br/>Calls Next.js runner"]
234 end
235
236 DB[(SQLite Database<br/>━━━━━━━━━━━━━<br/>JobEntity<br/>JobEventEntity)]
237 end
238
239 subgraph External["🌐 External Services"]
240 LLM_PROVIDERS["LLM Providers<br/>━━━━━━━━━━━━━<br/>• Anthropic Claude<br/>• OpenAI GPT<br/>• Google Gemini<br/>• Mistral"]
241 SEARCH_PROVIDERS["Search Providers<br/>━━━━━━━━━━━━━<br/>• Google CSE<br/>• SerpAPI<br/>• Brave<br/>• Tavily"]
242 WEB["Web Content<br/>━━━━━━━━━━━━━<br/>• News sites<br/>• PDFs<br/>• Academic sources"]
243 end
244
245 %% Client interactions
246 BROWSER --> ANALYZE_PAGE
247 BROWSER --> JOBS_PAGE
248 ANALYZE_PAGE --> ANALYZE_API
249 JOBS_PAGE --> JOBS_API
250
251 %% Next.js internal
252 ANALYZE_API --> JOBS_API
253 JOBS_API -->|"Proxy"| DOTNET_API
254 JOB_API -->|"Proxy"| DOTNET_API
255 EVENTS_API -->|"Proxy"| DOTNET_API
256
257 %% .NET flow
258 DOTNET_API --> ANALYZE_CTRL
259 DOTNET_API --> JOBS_CTRL
260 DOTNET_API --> INTERNAL_CTRL
261 ANALYZE_CTRL --> JOB_SVC
262 JOBS_CTRL --> JOB_SVC
263 JOB_SVC --> DB
264 JOB_SVC --> RUNNER_CLIENT
265 RUNNER_CLIENT -->|"HTTP POST"| RUN_JOB
266
267 %% Analysis execution
268 RUN_JOB --> ANALYZER
269 ANALYZER --> RETRIEVAL
270 ANALYZER --> WEBSEARCH
271 ANALYZER --> MBFC
272
273 %% External calls
274 ANALYZER -->|"AI SDK"| LLM_PROVIDERS
275 WEBSEARCH --> SEARCH_PROVIDERS
276 RETRIEVAL --> WEB
277
278 %% Styling
279 classDef external fill:#fff3e0,stroke:#e65100,stroke-width:2px
280 classDef core fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
281 classDef api fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
282
283 class LLM_PROVIDERS,SEARCH_PROVIDERS,WEB external
284 class ANALYZER,RETRIEVAL,WEBSEARCH,MBFC core
285 class ANALYZE_API,JOBS_API,JOB_API,EVENTS_API,RUN_JOB api
286 {{/mermaid}}
287
288 ----
289
290
291 == 4. Specification vs Implementation Gap Analysis ==
292
293
294
295 === 4.1 Data Model Gaps ===
296
297
298 | Specification Entity | POC1 Status | Gap Description |
299 |-|-|-|
300 | **Claim** | ⚠️ Partial | No persistent storage; claims exist only in JSON result. Missing: `status`, `confidence_score`, `risk_score`, `completeness_score`, `version`, `views`, `edit_count` |
301 | **Evidence** | ⚠️ Partial | Implemented as `ExtractedFact` but lacks: `supports` enum, proper `relevance_score` |
302 | **Source** | ⚠️ Partial | `FetchedSource` exists but missing: `type` enum, `accuracy_history`, `correction_frequency`, weekly update scheduler |
303 | **Scenario** | ❌ Missing | Not implemented. Claims are evaluated directly without scenario contexts |
304 | **Verdict** | ⚠️ Partial | `ClaimVerdict` exists but missing: `likelihood_range`, `uncertainty_factors` array, proper `explanation_summary` |
305 | **User** | ❌ Missing | No user authentication or role system |
306 | **Edit** | ❌ Missing | No audit trail for changes |
307
308 === 4.2 AKEL Component Gaps ===
309
310 | Spec Component | POC1 Status | Gap Description |
311 | |-|-|
312 | **AKEL Orchestrator** | ✅ Implemented | `runAnalysis()` function serves this role |
313 | **Claim Extractor** | ✅ Implemented | `understandClaim()` with claim role/dependency tracking |
314 | **Claim Classifier** | ⚠️ Partial | Risk tier (A/B/C) assigned, but no domain classification |
315 | **Scenario Generator** | ❌ Missing | Claims evaluated without scenario extraction |
316 | **Evidence Summarizer** | ✅ Implemented | `extractFacts()` function |
317 | **Contradiction Detector** | ⚠️ Partial | `isContestedClaim` flag exists but no active contradiction search |
318 | **Quality Gate Validator** | ❌ Missing | No source quality gates, no mandatory checks |
319 | **Audit Sampling Scheduler** | ❌ Missing | No audit system |
320 | **Embedding Handler** | ❌ Missing | Not needed for POC |
321 | **Federation Sync** | ❌ Missing | Not needed for POC |
322
323 === 4.3 Architecture Gaps ===
324
325
326 | Spec Requirement | POC1 Status | Gap Description |
327 | |-|-|
328 | **Three-Layer Architecture** | ✅ Implemented | Interface (Next.js) → Processing (AKEL) → Data (SQLite) |
329 | **LLM Abstraction Layer** | ✅ Implemented | AI SDK supports multiple providers with failover |
330 | **PostgreSQL Primary DB** | ⚠️ Different | Using SQLite for simplicity (acceptable for POC) |
331 | **Redis Caching** | ❌ Missing | No caching layer |
332 | **S3 Archival** | ❌ Missing | No long-term storage |
333 | **Background Jobs** | ❌ Missing | No scheduler for source updates, cache warming |
334 | **Quality Monitoring** | ⚠️ Partial | LLM call counting exists, but no anomaly detection |
335
336 === 4.4 Publication & Review Gaps ===
337
338
339 | Spec Feature | POC1 Status | Gap Description |
340 | |-|-|
341 | **Risk Tier Publication Rules** | ❌ Missing | All results published immediately regardless of tier |
342 | **Human Review Queue** | ❌ Missing | No review workflow |
343 | **AI-Generated Labeling** | ⚠️ Partial | Results show "AI analysis" but no formal labeling system |
344 | **Audit Rate Sampling** | ❌ Missing | No sampling audits |
345
346 ----
347
348
349 == 5. Optimization Recommendations ==
350
351
352
353 === 5.1 Cost Optimizations ===
354
355
356 {{mermaid}}
357 pie title Current LLM Cost Distribution (Estimated per Analysis)
358 "Step 1: Understand" : 15
359 "Step 2: Research (per source)" : 60
360 "Step 3: Verdicts" : 25
361 {{/mermaid}}
362
363 | Optimization | Estimated Savings | Implementation Effort |
364 | |-| |
365 | **Cache claim understanding** | 30-50% on repeated claims | Medium |
366 | **Use Haiku for fact extraction** | 40% on Step 2 costs | Low (config change) |
367 | **Batch fact extraction** | 20% fewer API calls | Medium |
368 | **Skip search for known claims** | 50%+ for cached claims | High (needs claim DB) |
369 | **Reduce max iterations** | Linear reduction | Low (config change) |
370
371 === 5.2 Timing Optimizations ===
372
373
374 {{mermaid}}
375 gantt
376 title Current Analysis Timeline (Typical)
377 dateFormat ss
378 axisFormat %S sec
379
380 section Current Flow
381 URL Fetch :a1, 00, 2s
382 Step 1 Understand :a2, after a1, 15s
383 Search Iteration 1 :a3, after a2, 8s
384 Fetch Sources 1 :a4, after a3, 10s
385 Extract Facts 1 :a5, after a4, 12s
386 Search Iteration 2 :a6, after a5, 8s
387 Fetch Sources 2 :a7, after a6, 10s
388 Extract Facts 2 :a8, after a7, 12s
389 Generate Verdicts :a9, after a8, 15s
390
391 section Optimized Flow
392 URL Fetch :b1, 00, 2s
393 Step 1 Understand :b2, after b1, 10s
394 Search + Fetch (parallel) :b3, after b2, 12s
395 Extract Facts (batched) :b4, after b3, 8s
396 Generate Verdicts :b5, after b4, 10s
397 {{/mermaid}}
398
399 | Optimization | Time Savings | Notes |
400 | | |-|
401 | **Parallel source fetching** | Already implemented | Currently fetches 3 sources in parallel |
402 | **Streaming LLM responses** | 20-30% perceived | User sees progress faster |
403 | **Search query batching** | 10-15% | Send multiple queries to search API |
404 | **Reduce prompt size** | 5-10% per call | Optimize system prompts |
405 | **Use faster models for extraction** | 30-40% on Step 2 | Claude Haiku vs Sonnet |
406
407 === 5.3 Priority Recommendations ===
408
409
410 1. **HIGH PRIORITY - Implement Claim Caching**
411 - Cache claim verdicts by content hash
412 - Reduces costs for repeated/similar claims
413 - Enables the separated verdict architecture (see Section 6)
414
415 2. **MEDIUM PRIORITY - Use Tiered Models**
416 - Step 1 (Understand): Sonnet (needs reasoning)
417 - Step 2 (Extract): Haiku (simple extraction)
418 - Step 3 (Verdicts): Sonnet (needs synthesis)
419
420 3. **LOW PRIORITY - Add Redis Cache**
421 - Cache source content (24h TTL)
422 - Cache search results (1h TTL)
423 - Reduces external API calls
424
425 ----
426
427
428 == 6. Separated Verdict Architecture Proposal ==
429
430
431
432 === 6.1 Current Architecture ===
433
434
435 {{mermaid}}
436 flowchart LR
437 subgraph Current["Current: Monolithic Analysis"]
438 INPUT[Article Input] --> ANALYZE[Full Analysis Pipeline]
439 ANALYZE --> CLAIMS[Claim Verdicts]
440 ANALYZE --> ARTICLE[Article Verdict]
441 CLAIMS -.->|"Aggregated"| ARTICLE
442 end
443 {{/mermaid}}
444
445 **Issues:**
446 - Every analysis re-processes all claims
447 - No caching of individual claim verdicts
448 - Article verdict tightly coupled to claim extraction
449
450
451 === 6.2 Proposed Separated Architecture ===
452
453
454 {{mermaid}}
455 flowchart TB
456 subgraph Input["Input Processing"]
457 ARTICLE[Article/Text Input]
458 EXTRACT[Claim Extraction]
459 end
460
461 subgraph ClaimLayer["Claim Verdict Layer (Cacheable)"]
462 CACHE[(Claim Cache<br/>━━━━━━━━━━━━━<br/>Key: claim_hash<br/>TTL: 7 days)]
463
464 CLAIM1["Claim 1 Analysis"]
465 CLAIM2["Claim 2 Analysis"]
466 CLAIM3["Claim N Analysis"]
467
468 VERDICT1[Claim 1 Verdict]
469 VERDICT2[Claim 2 Verdict]
470 VERDICT3[Claim N Verdict]
471 end
472
473 subgraph ArticleLayer["Article Verdict Layer (Dynamic)"]
474 AGGREGATE[Aggregate Claim Verdicts]
475 CONTEXT[Apply Article Context<br/>━━━━━━━━━━━━━<br/>• Claim relationships<br/>• Logical structure<br/>• Author intent]
476 ARTICLE_VERDICT[Article Verdict]
477 end
478
479 %% Flow
480 ARTICLE --> EXTRACT
481 EXTRACT --> CLAIM1
482 EXTRACT --> CLAIM2
483 EXTRACT --> CLAIM3
484
485 CLAIM1 -->|"Cache Miss"| VERDICT1
486 CLAIM2 -->|"Cache Hit"| VERDICT2
487 CLAIM3 -->|"Cache Miss"| VERDICT3
488
489 CLAIM1 <-.-> CACHE
490 CLAIM2 <-.-> CACHE
491 CLAIM3 <-.-> CACHE
492
493 VERDICT1 --> AGGREGATE
494 VERDICT2 --> AGGREGATE
495 VERDICT3 --> AGGREGATE
496
497 AGGREGATE --> CONTEXT
498 CONTEXT --> ARTICLE_VERDICT
499
500 classDef cache fill:#fff9c4,stroke:#f57f17,stroke-width:2px
501 classDef dynamic fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px
502 class CACHE cache
503 class CONTEXT,ARTICLE_VERDICT dynamic
504 {{/mermaid}}
505
506
507 === 6.3 Benefits Analysis ===
508
509
510 | Benefit | Impact | Rationale |
511 |-| |-|
512 | **Cost Reduction** | 40-70% for repeated claims | Many articles share common claims (e.g., "COVID vaccines are safe") |
513 | **Faster Analysis** | 50%+ for cached claims | Skip research + LLM calls for known claims |
514 | **Consistency** | High | Same claim always gets same verdict (until cache expires) |
515 | **Freshness Control** | Configurable TTL | Balance consistency vs. new evidence |
516 | **Scalability** | Linear improvement | More users = higher cache hit rate |
517
518 === 6.4 Implementation Considerations ===
519
520 **Claim Hashing Strategy:**
521 {{code language="typescript"}}function getClaimHash(claim: string): string {
522 // Normalize: lowercase, remove punctuation, stem words
523 const normalized = normalize(claim);
524 // Hash for cache key
525 return crypto.createHash('sha256').update(normalized).digest('hex').slice(0, 16);
526 }{{/code}}
527
528 **Cache Invalidation Triggers:**
529 - TTL expiration (default 7 days)
530 - Major news event related to claim topic
531 - Source track record significant change
532 - Manual invalidation by moderator
533
534 **Article Verdict Considerations:**
535 - Article verdict should ALWAYS be dynamic (never cached)
536 - Same claims in different article contexts may yield different article verdicts
537 - Example: "Vaccines are safe" + "Vaccines cause autism" → article may be misleading even if first claim is true
538
539 ### 6.5 Recommendation##
540
541 **YES, separating is beneficial** with the following caveats:
542
543 1. **Claim verdicts should be cached** with semantic similarity matching (not just exact match)
544 2. **Article verdicts should always be dynamic** to account for:
545 - Claim relationships and logical structure
546 - Author's argumentative strategy
547 - Context and framing
548 - Selective use of true claims to support false conclusions
549
550 3. **Implementation phases:**
551 - Phase 1: Exact-match claim caching (simple hash)
552 - Phase 2: Semantic similarity caching (embedding-based)
553 - Phase 3: Federated claim sharing across instances
554
555 ----
556
557
558 == 7. Summary ==
559
560
561
562 === Current State ===
563
564 - POC1 implements core AKEL pipeline successfully
565 - Claim dependency tracking is implemented
566 - Multiple LLM providers supported
567 - No persistent claim storage or caching
568
569
570 === Key Gaps from Specification ===
571
572 - No scenario extraction
573 - No user/role system
574 - No audit trail
575 - No source track record updates
576 - No review queue
577
578
579 === Recommended Next Steps ===
580
581 1. Implement claim caching layer
582 2. Separate claim vs article verdict generation
583 3. Add Redis for source/search caching
584 4. Implement tiered model selection
585 5. Add basic audit logging