Changes for page Data Model

Last modified by Robert Schaub on 2025/12/22 14:16

From version 1.2
edited by Robert Schaub
on 2025/12/22 14:16
Change comment: Renamed back-links.
To version 1.1
edited by Robert Schaub
on 2025/12/22 14:10
Change comment: Imported from XAR

Summary

Details

Page properties
Content
... ... @@ -1,18 +1,11 @@
1 1  = Data Model =
2 -
3 3  FactHarbor's data model is **simple, focused, designed for automated processing**.
4 -
5 5  == 1. Core Entities ==
6 -
7 7  === 1.1 Claim ===
8 -
9 9  Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count
10 -
11 11  ==== Performance Optimization: Denormalized Fields ====
12 -
13 13  **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
14 14  **Additional cached fields in claims table**:
15 -
16 16  * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores * Avoids joining evidence table for listing/preview * Updated when evidence is added/removed * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
17 17  * **source_names** (TEXT[]): Array of source names for quick display * Avoids joining through evidence to sources * Updated when sources change * Format: `["New York Times", "Nature Journal", ...]`
18 18  * **scenario_count** (INTEGER): Number of scenarios for this claim * Quick metric without counting rows * Updated when scenarios added/removed
... ... @@ -25,18 +25,13 @@
25 25  * ✅ 70% fewer joins on common queries
26 26  * ✅ Much faster claim list/search pages
27 27  * ✅ Better user experience
28 -* ⚠️ Small storage increase (10%)
21 +* ⚠️ Small storage increase (~10%)
29 29  * ⚠️ Need to keep caches in sync
30 -
31 31  === 1.2 Evidence ===
32 -
33 33  Fields: claim_id, source_id, excerpt, url, relevance_score, supports
34 -
35 35  === 1.3 Source ===
36 -
37 37  **Purpose**: Track reliability of information sources over time
38 38  **Fields**:
39 -
40 40  * **id** (UUID): Unique identifier
41 41  * **name** (text): Source name (e.g., "New York Times", "Nature Journal")
42 42  * **domain** (text): Website domain (e.g., "nytimes.com")
... ... @@ -54,30 +54,24 @@
54 54  **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
55 55  Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**
56 56  **Key**: Automated source reliability tracking
57 -
58 58  ==== Source Scoring Process (Separation of Concerns) ====
59 -
60 60  **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.
61 61  **The Problem**: * Source scores should influence claim verdicts
62 -
63 63  * Claim verdicts should update source scores
64 64  * But: Direct feedback creates circular dependency and potential feedback loops
65 65  **The Solution**: Temporal separation
66 -
67 67  ==== Weekly Background Job (Source Scoring) ====
68 -
69 69  Runs independently of claim analysis:
70 -{{code language="python"}}def update_source_scores_weekly(): """ Background job: Calculate source reliability Never triggered by individual claim analysis """ # Analyze all claims from past week claims = get_claims_from_past_week() for source in get_all_sources(): # Calculate accuracy metrics correct_verdicts = count_correct_verdicts_citing(source, claims) total_citations = count_total_citations(source, claims) accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5 # Weight by claim importance weighted_score = calculate_weighted_score(source, claims) # Update source record source.track_record_score = weighted_score source.total_citations = total_citations source.last_updated = now() source.save() # Job runs: Sunday 2 AM UTC # Never during claim processing{{/code}}
71 -
53 +{{code language="python"}}
54 +def update_source_scores_weekly(): """ Background job: Calculate source reliability Never triggered by individual claim analysis """ # Analyze all claims from past week claims = get_claims_from_past_week() for source in get_all_sources(): # Calculate accuracy metrics correct_verdicts = count_correct_verdicts_citing(source, claims) total_citations = count_total_citations(source, claims) accuracy = correct_verdicts / total_citations if total_citations > 0 else 0.5 # Weight by claim importance weighted_score = calculate_weighted_score(source, claims) # Update source record source.track_record_score = weighted_score source.total_citations = total_citations source.last_updated = now() source.save() # Job runs: Sunday 2 AM UTC # Never during claim processing
55 +{{/code}}
72 72  ==== Real-Time Claim Analysis (AKEL) ====
73 -
74 74  Uses source scores but never updates them:
75 -{{code language="python"}}def analyze_claim(claim_text): """ Real-time: Analyze claim using current source scores READ source scores, never UPDATE them """ # Gather evidence evidence_list = gather_evidence(claim_text) for evidence in evidence_list: # READ source score (snapshot from last weekly update) source = get_source(evidence.source_id) source_score = source.track_record_score # Use score to weight evidence evidence.weighted_relevance = evidence.relevance * source_score # Generate verdict using weighted evidence verdict = synthesize_verdict(evidence_list) # NEVER update source scores here # That happens in weekly background job return verdict{{/code}}
76 -
58 +{{code language="python"}}
59 +def analyze_claim(claim_text): """ Real-time: Analyze claim using current source scores READ source scores, never UPDATE them """ # Gather evidence evidence_list = gather_evidence(claim_text) for evidence in evidence_list: # READ source score (snapshot from last weekly update) source = get_source(evidence.source_id) source_score = source.track_record_score # Use score to weight evidence evidence.weighted_relevance = evidence.relevance * source_score # Generate verdict using weighted evidence verdict = synthesize_verdict(evidence_list) # NEVER update source scores here # That happens in weekly background job return verdict
60 +{{/code}}
77 77  ==== Monthly Audit (Quality Assurance) ====
78 -
79 79  Moderator review of flagged source scores:
80 -
81 81  * Verify scores make sense
82 82  * Detect gaming attempts
83 83  * Identify systematic biases
... ... @@ -111,14 +111,11 @@
111 111  Monday-Saturday: Claims processed using these scores → All claims this week use NYT=0.87 → All claims this week use Blog X=0.52
112 112  Next Sunday 2 AM: Recalculate scores including this week's claims → NYT score: 0.89 (trending up) → Blog X score: 0.48 (trending down)
113 113  ```
114 -
115 115  === 1.4 Scenario ===
116 -
117 117  **Purpose**: Different interpretations or contexts for evaluating claims
118 118  **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
119 119  **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)
120 120  **Fields**:
121 -
122 122  * **id** (UUID): Unique identifier
123 123  * **claim_id** (UUID): Foreign key to claim (one-to-many)
124 124  * **description** (text): Human-readable description of the scenario
... ... @@ -145,16 +145,11 @@
145 145  * After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
146 146  * Edit entity records the complete before/after change with timestamp and reason **Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios. === 1.6 User ===
147 147  Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count
148 -
149 -=== User Reputation System ===
150 -
127 +=== User Reputation System ==
151 151  **V1.0 Approach**: Simple manual role assignment
152 152  **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
153 -
154 154  === Roles (Manual Assignment) ===
155 -
156 156  **reader** (default):
157 -
158 158  * View published claims and evidence
159 159  * Browse and search content
160 160  * No editing permissions
... ... @@ -173,11 +173,8 @@
173 173  * System configuration
174 174  * Access to all features
175 175  * Founder-appointed initially
176 -
177 177  === Contribution Tracking (Simple) ===
178 -
179 179  **Basic metrics only**:
180 -
181 181  * `contributions_count`: Total number of contributions
182 182  * `created_at`: Account age
183 183  * `last_active`: Recent activity
... ... @@ -186,26 +186,19 @@
186 186  * No automated privilege escalation
187 187  * No reputation decay
188 188  * No threshold-based promotions
189 -
190 190  === Promotion Process ===
191 -
192 192  **Manual review by moderators/admins**:
193 -
194 194  1. User demonstrates value through contributions
195 195  2. Moderator reviews user's contribution history
196 196  3. Moderator promotes user to contributor role
197 197  4. Admin promotes trusted contributors to moderator
198 198  **Criteria** (guidelines, not automated):
199 -
200 200  * Quality of contributions
201 201  * Consistency over time
202 202  * Collaborative behavior
203 203  * Understanding of project goals
204 -
205 205  === V2.0+ Evolution ===
206 -
207 207  **Add complex reputation when**:
208 -
209 209  * 100+ active contributors
210 210  * Manual role management becomes bottleneck
211 211  * Clear patterns of abuse emerge requiring automation
... ... @@ -215,16 +215,11 @@
215 215  * Reputation decay for inactive users
216 216  * Track record scoring for contributors
217 217  See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
218 -
219 219  === 1.7 Edit ===
220 -
221 221  **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
222 222  **Purpose**: Complete audit trail for all content changes
223 -
224 224  === Edit History Details ===
225 -
226 226  **What Gets Edited**:
227 -
228 228  * **Claims** (20% edited): assertion, domain, status, scores, analysis
229 229  * **Evidence** (10% edited): excerpt, relevance_score, supports
230 230  * **Scenarios** (5% edited): description, assumptions, confidence
... ... @@ -241,14 +241,12 @@
241 241  * `MODERATION_ACTION`: Hide/unhide for abuse
242 242  * `REVERT`: Rollback to previous version
243 243  **Retention Policy** (5 years total):
244 -
245 245  1. **Hot storage** (3 months): PostgreSQL, instant access
246 246  2. **Warm storage** (2 years): Partitioned, slower queries
247 247  3. **Cold storage** (3 years): S3 compressed, download required
248 248  4. **Deletion**: After 5 years (except legal holds)
249 -**Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit)
207 +**Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit)
250 250  **Use Cases**:
251 -
252 252  * View claim history timeline
253 253  * Detect vandalism patterns
254 254  * Learn from user corrections (system improvement)
... ... @@ -255,17 +255,12 @@
255 255  * Legal compliance (audit trail)
256 256  * Rollback capability
257 257  See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases
258 -
259 259  === 1.8 Flag ===
260 -
261 261  Fields: entity_id, reported_by, issue_type, status, resolution_note
262 -
263 263  === 1.9 QualityMetric ===
264 -
265 265  **Fields**: metric_type, category, value, target, timestamp
266 266  **Purpose**: Time-series quality tracking
267 267  **Usage**:
268 -
269 269  * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times
270 270  * **Quality dashboard**: Real-time display with trend charts
271 271  * **Alerting**: Automatic alerts when metrics exceed thresholds
... ... @@ -272,13 +272,10 @@
272 272  * **A/B testing**: Compare control vs treatment metrics
273 273  * **Improvement validation**: Measure before/after changes
274 274  **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
275 -
276 276  === 1.10 ErrorPattern ===
277 -
278 278  **Fields**: error_category, claim_id, description, root_cause, frequency, status
279 279  **Purpose**: Capture errors to trigger system improvements
280 280  **Usage**:
281 -
282 282  * **Error capture**: When users flag issues or system detects problems
283 283  * **Pattern analysis**: Weekly grouping by category and frequency
284 284  * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor
... ... @@ -285,11 +285,8 @@
285 285  * **Metrics**: Track error rate reduction over time
286 286  **Example**: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}` == 1.4 Core Data Model ERD == {{include reference="Test.FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}} == 1.5 User Class Diagram ==
287 287  {{include reference="Test.FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}}
288 -
289 289  == 2. Versioning Strategy ==
290 -
291 291  **All Content Entities Are Versioned**:
292 -
293 293  * **Claim**: Every edit creates new version (V1→V2→V3...)
294 294  * **Evidence**: Changes tracked in edit history
295 295  * **Scenario**: Modifications versioned
... ... @@ -307,91 +307,68 @@
307 307  ```
308 308  Claim V1: "The sky is blue" → User edits → Claim V2: "The sky is blue during daytime" → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
309 309  ```
310 -
311 311  == 2.5. Storage vs Computation Strategy ==
312 -
313 313  **Critical architectural decision**: What to persist in databases vs compute dynamically?
314 314  **Trade-off**:
315 -
316 316  * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs
317 317  * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible
318 -
319 319  === Recommendation: Hybrid Approach ===
320 -
321 321  **STORE (in PostgreSQL):**
322 -
323 323  ==== Claims (Current State + History) ====
324 -
325 325  * **What**: assertion, domain, status, created_at, updated_at, version
326 326  * **Why**: Core entity, must be persistent
327 327  * **Also store**: confidence_score (computed once, then cached)
328 -* **Size**: 1 KB per claim
267 +* **Size**: ~1 KB per claim
329 329  * **Growth**: Linear with claims
330 330  * **Decision**: ✅ STORE - Essential
331 -
332 332  ==== Evidence (All Records) ====
333 -
334 334  * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
335 335  * **Why**: Hard to re-gather, user contributions, reproducibility
336 -* **Size**: 2 KB per evidence (with excerpt)
273 +* **Size**: ~2 KB per evidence (with excerpt)
337 337  * **Growth**: 3-10 evidence per claim
338 338  * **Decision**: ✅ STORE - Essential for reproducibility
339 -
340 340  ==== Sources (Track Records) ====
341 -
342 342  * **What**: name, domain, track_record_score, accuracy_history, correction_frequency
343 343  * **Why**: Continuously updated, expensive to recompute
344 -* **Size**: 500 bytes per source
279 +* **Size**: ~500 bytes per source
345 345  * **Growth**: Slow (limited number of sources)
346 346  * **Decision**: ✅ STORE - Essential for quality
347 -
348 348  ==== Edit History (All Versions) ====
349 -
350 350  * **What**: before_state, after_state, user_id, reason, timestamp
351 351  * **Why**: Audit trail, legal requirement, reproducibility
352 -* **Size**: 2 KB per edit
353 -* **Growth**: Linear with edits (A portion of claims get edited)
285 +* **Size**: ~2 KB per edit
286 +* **Growth**: Linear with edits (~A portion of claims get edited)
354 354  * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
355 355  * **Decision**: ✅ STORE - Essential for accountability
356 -
357 357  ==== Flags (User Reports) ====
358 -
359 359  * **What**: entity_id, reported_by, issue_type, description, status
360 360  * **Why**: Error detection, system improvement triggers
361 -* **Size**: 500 bytes per flag
292 +* **Size**: ~500 bytes per flag
362 362  * **Growth**: 5-high percentage of claims get flagged
363 363  * **Decision**: ✅ STORE - Essential for improvement
364 -
365 365  ==== ErrorPatterns (System Improvement) ====
366 -
367 367  * **What**: error_category, claim_id, description, root_cause, frequency, status
368 368  * **Why**: Learning loop, prevent recurring errors
369 -* **Size**: 1 KB per pattern
298 +* **Size**: ~1 KB per pattern
370 370  * **Growth**: Slow (limited patterns, many fixed)
371 371  * **Decision**: ✅ STORE - Essential for learning
372 -
373 373  ==== QualityMetrics (Time Series) ====
374 -
375 375  * **What**: metric_type, category, value, target, timestamp
376 376  * **Why**: Trend analysis, cannot recreate historical metrics
377 -* **Size**: 200 bytes per metric
304 +* **Size**: ~200 bytes per metric
378 378  * **Growth**: Hourly = 8,760 per year per metric type
379 379  * **Retention**: 2 years hot, then aggregate and archive
380 380  * **Decision**: ✅ STORE - Essential for monitoring
381 381  **STORE (Computed Once, Then Cached):**
382 -
383 383  ==== Analysis Summary ====
384 -
385 385  * **What**: Neutral text summary of claim analysis (200-500 words)
386 386  * **Computed**: Once by AKEL when claim first analyzed
387 387  * **Stored in**: Claim table (text field)
388 388  * **Recomputed**: Only when system significantly improves OR claim edited
389 389  * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
390 -* **Size**: 2 KB per claim
315 +* **Size**: ~2 KB per claim
391 391  * **Decision**: ✅ STORE (cached) - Cost-effective
392 -
393 393  ==== Confidence Score ====
394 -
395 395  * **What**: 0-100 score of analysis confidence
396 396  * **Computed**: Once by AKEL
397 397  * **Stored in**: Claim table (integer field)
... ... @@ -399,9 +399,7 @@
399 399  * **Why store**: Cheap to store, expensive to compute, users need it fast
400 400  * **Size**: 4 bytes per claim
401 401  * **Decision**: ✅ STORE (cached) - Performance critical
402 -
403 403  ==== Risk Score ====
404 -
405 405  * **What**: 0-100 score of claim risk level
406 406  * **Computed**: Once by AKEL
407 407  * **Stored in**: Claim table (integer field)
... ... @@ -410,33 +410,24 @@
410 410  * **Size**: 4 bytes per claim
411 411  * **Decision**: ✅ STORE (cached) - Performance critical
412 412  **COMPUTE DYNAMICALLY (Never Store):**
413 -
414 -==== Scenarios ====
415 -
416 - ⚠️ CRITICAL DECISION
417 -
334 +==== Scenarios ==== ⚠️ CRITICAL DECISION
418 418  * **What**: 2-5 possible interpretations of claim with assumptions
419 419  * **Current design**: Stored in Scenario table
420 420  * **Alternative**: Compute on-demand when user views claim details
421 -* **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim
338 +* **Storage cost**: ~1 KB per scenario × 3 scenarios average = ~3 KB per claim
422 422  * **Compute cost**: $0.005-0.01 per request (LLM API call)
423 -* **Frequency**: Viewed in detail by 20% of users
340 +* **Frequency**: Viewed in detail by ~20% of users
424 424  * **Trade-off analysis**: - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
425 425  * **Reproducibility**: Scenarios may improve as AI improves (good to recompute)
426 426  * **Speed**: Computed = 5-8 seconds delay, Stored = instant
427 427  * **Decision**: ✅ STORE (hybrid approach below)
428 428  **Scenario Strategy** (APPROVED):
429 -
430 430  1. **Store scenarios** initially when claim analyzed
431 431  2. **Mark as stale** when system improves significantly
432 432  3. **Recompute on next view** if marked stale
433 433  4. **Cache for 30 days** if frequently accessed
434 434  5. **Result**: Best of both worlds - speed + freshness
435 -
436 -==== Verdict Synthesis ====
437 -
438 - ~* **What**: Final conclusion text synthesizing all scenarios
439 -
351 +==== Verdict Synthesis ==== * **What**: Final conclusion text synthesizing all scenarios
440 440  * **Compute cost**: $0.002-0.005 per request
441 441  * **Frequency**: Every time claim viewed
442 442  * **Why not store**: Changes as evidence/scenarios change, users want fresh analysis
... ... @@ -443,23 +443,17 @@
443 443  * **Speed**: 2-3 seconds (acceptable)
444 444  **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale
445 445  * **Recommendation**: ✅ STORE cached version, mark stale when changes occur
446 -
447 447  ==== Search Results ====
448 -
449 449  * **What**: Lists of claims matching search query
450 450  * **Compute from**: Elasticsearch index
451 451  * **Cache**: 15 minutes in Redis for popular queries
452 452  * **Why not store permanently**: Constantly changing, infinite possible queries
453 -
454 454  ==== Aggregated Statistics ====
455 -
456 456  * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.
457 457  * **Compute from**: Database queries
458 458  * **Cache**: 1 hour in Redis
459 459  * **Why not store**: Can be derived, relatively cheap to compute
460 -
461 461  ==== User Reputation ====
462 -
463 463  * **What**: Score based on contributions
464 464  * **Current design**: Stored in User table
465 465  * **Alternative**: Compute from Edit table
... ... @@ -467,42 +467,36 @@
467 467  * **Frequency**: Read on every user action
468 468  * **Compute cost**: Simple COUNT query, milliseconds
469 469  * **Decision**: ✅ STORE - Performance critical, read-heavy
470 -
471 471  === Summary Table ===
472 -
473 -| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\
474 -|-----|-|-||----|-----|\\
475 -| Claim core | ✅ | - | 1 KB | STORE | Essential |\\
476 -| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\
477 -| Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\
478 -| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\
479 -| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\
480 -| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
481 -| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
482 -| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\
483 -| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\
484 -| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\
485 -| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\
486 -| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\
487 -| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\
377 +| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |
378 +|-----------|---------|---------|----------------|----------|-----------|
379 +| Claim core | ✅ | - | 1 KB | STORE | Essential |
380 +| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |
381 +| Sources | ✅ | - | 500 B (shared) | STORE | Track record |
382 +| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |
383 +| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |
384 +| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |
385 +| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |
386 +| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |
387 +| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |
388 +| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |
389 +| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |
390 +| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |
391 +| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |
488 488  | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |
489 -**Total storage per claim**: 18 KB (without edits and flags)
393 +**Total storage per claim**: ~18 KB (without edits and flags)
490 490  **For 1 million claims**:
491 -
492 -* **Storage**: 18 GB (manageable)
493 -* **PostgreSQL**: $50/month (standard instance)
494 -* **Redis cache**: $20/month (1 GB instance)
495 -* **S3 archives**: $5/month (old edits)
496 -* **Total**: $75/month infrastructure
395 +* **Storage**: ~18 GB (manageable)
396 +* **PostgreSQL**: ~$50/month (standard instance)
397 +* **Redis cache**: ~$20/month (1 GB instance)
398 +* **S3 archives**: ~$5/month (old edits)
399 +* **Total**: ~$75/month infrastructure
497 497  **LLM cost savings by caching**:
498 498  * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
499 499  * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims * Verdict stored: Save $0.003 per claim = $3K per 1M claims
500 -* **Total savings**: $35K per 1M claims vs recomputing every time
501 -
403 +* **Total savings**: ~$35K per 1M claims vs recomputing every time
502 502  === Recomputation Triggers ===
503 -
504 504  **When to mark cached data as stale and recompute:**
505 -
506 506  1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)
507 507  2. **Evidence added** → Recompute: scenarios, verdict, confidence score
508 508  3. **Source track record changes >10 points** → Recompute: confidence score, verdict
... ... @@ -509,15 +509,11 @@
509 509  4. **System improvement deployed** → Mark affected claims stale, recompute on next view
510 510  5. **Controversy detected** (high flag rate) → Recompute: risk score
511 511  **Recomputation strategy**:
512 -
513 513  * **Eager**: Immediately recompute (for user edits)
514 514  * **Lazy**: Recompute on next view (for system improvements)
515 515  * **Batch**: Nightly re-evaluation of stale claims (if <1000)
516 -
517 517  === Database Size Projection ===
518 -
519 519  **Year 1**: 10K claims
520 -
521 521  * Storage: 180 MB
522 522  * Cost: $10/month
523 523  **Year 3**: 100K claims * Storage: 1.8 GB
... ... @@ -529,21 +529,15 @@
529 529  * Cost: $300/month
530 530  * Optimization: Archive old claims to S3 ($5/TB/month)
531 531  **Conclusion**: Storage costs are manageable, LLM cost savings are substantial.
532 -
533 533  == 3. Key Simplifications ==
534 -
535 535  * **Two content states only**: Published, Hidden
536 536  * **Three user roles only**: Reader, Contributor, Moderator
537 537  * **No complex versioning**: Linear edit history
538 538  * **Reputation-based permissions**: Not role hierarchy
539 539  * **Source track records**: Continuous evaluation
540 -
541 541  == 3. What Gets Stored in the Database ==
542 -
543 543  === 3.1 Primary Storage (PostgreSQL) ===
544 -
545 545  **Claims Table**:
546 -
547 547  * Current state only (latest version)
548 548  * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
549 549  **Evidence Table**:
... ... @@ -570,44 +570,31 @@
570 570  **QualityMetric Table**:
571 571  * Time-series quality data
572 572  * Fields: id, metric_type, metric_category, value, target, timestamp
573 -
574 574  === 3.2 What's NOT Stored (Computed on-the-fly) ===
575 -
576 576  * **Verdicts**: Synthesized from evidence + scenarios when requested
577 577  * **Risk scores**: Recalculated based on current factors
578 578  * **Aggregated statistics**: Computed from base data
579 579  * **Search results**: Generated from Elasticsearch index
580 -
581 581  === 3.3 Cache Layer (Redis) ===
582 -
583 583  **Cached for performance**:
584 -
585 585  * Frequently accessed claims (TTL: 1 hour)
586 586  * Search results (TTL: 15 minutes)
587 587  * User sessions (TTL: 24 hours)
588 588  * Source track records (TTL: 1 hour)
589 -
590 590  === 3.4 File Storage (S3) ===
591 -
592 592  **Archived content**:
593 -
594 594  * Old edit history (>3 months)
595 595  * Evidence documents (archived copies)
596 596  * Database backups
597 597  * Export files
598 -
599 599  === 3.5 Search Index (Elasticsearch) ===
600 -
601 601  **Indexed for search**:
602 -
603 603  * Claim assertions (full-text)
604 604  * Evidence excerpts (full-text)
605 605  * Scenario descriptions (full-text)
606 606  * Source names (autocomplete)
607 607  Synchronized from PostgreSQL via change data capture or periodic sync.
608 -
609 609  == 4. Related Pages ==
610 -
611 -* [[Architecture>>Test.FactHarbor pre11 V0\.9\.70.Specification.Architecture.WebHome]]
488 +* [[Architecture>>Test.FactHarbor.Specification.Architecture.WebHome]]
612 612  * [[Requirements>>Test.FactHarbor.Specification.Requirements.WebHome]]
613 613  * [[Workflows>>Test.FactHarbor.Specification.Workflows.WebHome]]