Changes for page Data Model

Last modified by Robert Schaub on 2025/12/24 11:48

From version 1.2
edited by Robert Schaub
on 2025/12/24 11:47
Change comment: Renamed back-links.
To version 1.1
edited by Robert Schaub
on 2025/12/24 10:21
Change comment: Imported from XAR

Summary

Details

Page properties
Content
... ... @@ -1,32 +1,25 @@
1 1  = Data Model =
2 -
3 3  FactHarbor's data model is **simple, focused, designed for automated processing**.
4 -
5 5  == 1. Core Entities ==
6 -
7 7  === 1.1 Claim ===
8 -
9 9  Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count
10 -
11 11  ==== Performance Optimization: Denormalized Fields ====
12 -
13 13  **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
14 14  **Additional cached fields in claims table**:
15 -
16 16  * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores
17 -* Avoids joining evidence table for listing/preview
18 -* Updated when evidence is added/removed
19 -* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
10 + * Avoids joining evidence table for listing/preview
11 + * Updated when evidence is added/removed
12 + * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
20 20  * **source_names** (TEXT[]): Array of source names for quick display
21 -* Avoids joining through evidence to sources
22 -* Updated when sources change
23 -* Format: `["New York Times", "Nature Journal", ...]`
14 + * Avoids joining through evidence to sources
15 + * Updated when sources change
16 + * Format: `["New York Times", "Nature Journal", ...]`
24 24  * **scenario_count** (INTEGER): Number of scenarios for this claim
25 -* Quick metric without counting rows
26 -* Updated when scenarios added/removed
18 + * Quick metric without counting rows
19 + * Updated when scenarios added/removed
27 27  * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed
28 -* Helps invalidate stale caches
29 -* Triggers background refresh if too old
21 + * Helps invalidate stale caches
22 + * Triggers background refresh if too old
30 30  **Update Strategy**:
31 31  * **Immediate**: Update on claim edit (user-facing)
32 32  * **Deferred**: Update via background job every hour (non-critical)
... ... @@ -35,18 +35,13 @@
35 35  * ✅ 70% fewer joins on common queries
36 36  * ✅ Much faster claim list/search pages
37 37  * ✅ Better user experience
38 -* ⚠️ Small storage increase (10%)
31 +* ⚠️ Small storage increase (~10%)
39 39  * ⚠️ Need to keep caches in sync
40 -
41 41  === 1.2 Evidence ===
42 -
43 43  Fields: claim_id, source_id, excerpt, url, relevance_score, supports
44 -
45 45  === 1.3 Source ===
46 -
47 47  **Purpose**: Track reliability of information sources over time
48 48  **Fields**:
49 -
50 50  * **id** (UUID): Unique identifier
51 51  * **name** (text): Source name (e.g., "New York Times", "Nature Journal")
52 52  * **domain** (text): Website domain (e.g., "nytimes.com")
... ... @@ -64,21 +64,17 @@
64 64  **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
65 65  Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**
66 66  **Key**: Automated source reliability tracking
67 -
68 68  ==== Source Scoring Process (Separation of Concerns) ====
69 -
70 70  **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.
71 71  **The Problem**:
72 -
73 73  * Source scores should influence claim verdicts
74 74  * Claim verdicts should update source scores
75 75  * But: Direct feedback creates circular dependency and potential feedback loops
76 76  **The Solution**: Temporal separation
77 -
78 78  ==== Weekly Background Job (Source Scoring) ====
79 -
80 80  Runs independently of claim analysis:
81 -{{code language="python"}}def update_source_scores_weekly():
64 +{{code language="python"}}
65 +def update_source_scores_weekly():
82 82   """
83 83   Background job: Calculate source reliability
84 84   Never triggered by individual claim analysis
... ... @@ -98,12 +98,12 @@
98 98   source.last_updated = now()
99 99   source.save()
100 100   # Job runs: Sunday 2 AM UTC
101 - # Never during claim processing{{/code}}
102 -
85 + # Never during claim processing
86 +{{/code}}
103 103  ==== Real-Time Claim Analysis (AKEL) ====
104 -
105 105  Uses source scores but never updates them:
106 -{{code language="python"}}def analyze_claim(claim_text):
89 +{{code language="python"}}
90 +def analyze_claim(claim_text):
107 107   """
108 108   Real-time: Analyze claim using current source scores
109 109   READ source scores, never UPDATE them
... ... @@ -120,12 +120,10 @@
120 120   verdict = synthesize_verdict(evidence_list)
121 121   # NEVER update source scores here
122 122   # That happens in weekly background job
123 - return verdict{{/code}}
124 -
107 + return verdict
108 +{{/code}}
125 125  ==== Monthly Audit (Quality Assurance) ====
126 -
127 127  Moderator review of flagged source scores:
128 -
129 129  * Verify scores make sense
130 130  * Detect gaming attempts
131 131  * Identify systematic biases
... ... @@ -165,14 +165,11 @@
165 165   → NYT score: 0.89 (trending up)
166 166   → Blog X score: 0.48 (trending down)
167 167  ```
168 -
169 169  === 1.4 Scenario ===
170 -
171 171  **Purpose**: Different interpretations or contexts for evaluating claims
172 172  **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
173 173  **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)
174 174  **Fields**:
175 -
176 176  * **id** (UUID): Unique identifier
177 177  * **claim_id** (UUID): Foreign key to claim (one-to-many)
178 178  * **description** (text): Human-readable description of the scenario
... ... @@ -193,7 +193,6 @@
193 193  **Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.
194 194  
195 195  **Core Fields**:
196 -
197 197  * **id** (UUID): Primary key
198 198  * **scenario_id** (UUID FK): The scenario being assessed
199 199  * **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")
... ... @@ -209,7 +209,6 @@
209 209  
210 210  **Example**:
211 211  For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":
212 -
213 213  * Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
214 214  * After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
215 215  * Edit entity records the complete before/after change with timestamp and reason
... ... @@ -217,18 +217,12 @@
217 217  **Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios.
218 218  
219 219  === 1.6 User ===
220 -
221 221  Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count
222 -
223 -=== User Reputation System ===
224 -
198 +=== User Reputation System ==
225 225  **V1.0 Approach**: Simple manual role assignment
226 226  **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
227 -
228 228  === Roles (Manual Assignment) ===
229 -
230 230  **reader** (default):
231 -
232 232  * View published claims and evidence
233 233  * Browse and search content
234 234  * No editing permissions
... ... @@ -247,11 +247,8 @@
247 247  * System configuration
248 248  * Access to all features
249 249  * Founder-appointed initially
250 -
251 251  === Contribution Tracking (Simple) ===
252 -
253 253  **Basic metrics only**:
254 -
255 255  * `contributions_count`: Total number of contributions
256 256  * `created_at`: Account age
257 257  * `last_active`: Recent activity
... ... @@ -260,26 +260,19 @@
260 260  * No automated privilege escalation
261 261  * No reputation decay
262 262  * No threshold-based promotions
263 -
264 264  === Promotion Process ===
265 -
266 266  **Manual review by moderators/admins**:
267 -
268 268  1. User demonstrates value through contributions
269 269  2. Moderator reviews user's contribution history
270 270  3. Moderator promotes user to contributor role
271 271  4. Admin promotes trusted contributors to moderator
272 272  **Criteria** (guidelines, not automated):
273 -
274 274  * Quality of contributions
275 275  * Consistency over time
276 276  * Collaborative behavior
277 277  * Understanding of project goals
278 -
279 279  === V2.0+ Evolution ===
280 -
281 281  **Add complex reputation when**:
282 -
283 283  * 100+ active contributors
284 284  * Manual role management becomes bottleneck
285 285  * Clear patterns of abuse emerge requiring automation
... ... @@ -289,16 +289,11 @@
289 289  * Reputation decay for inactive users
290 290  * Track record scoring for contributors
291 291  See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
292 -
293 293  === 1.7 Edit ===
294 -
295 295  **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
296 296  **Purpose**: Complete audit trail for all content changes
297 -
298 298  === Edit History Details ===
299 -
300 300  **What Gets Edited**:
301 -
302 302  * **Claims** (20% edited): assertion, domain, status, scores, analysis
303 303  * **Evidence** (10% edited): excerpt, relevance_score, supports
304 304  * **Scenarios** (5% edited): description, assumptions, confidence
... ... @@ -315,14 +315,12 @@
315 315  * `MODERATION_ACTION`: Hide/unhide for abuse
316 316  * `REVERT`: Rollback to previous version
317 317  **Retention Policy** (5 years total):
318 -
319 319  1. **Hot storage** (3 months): PostgreSQL, instant access
320 320  2. **Warm storage** (2 years): Partitioned, slower queries
321 321  3. **Cold storage** (3 years): S3 compressed, download required
322 322  4. **Deletion**: After 5 years (except legal holds)
323 -**Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit)
278 +**Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit)
324 324  **Use Cases**:
325 -
326 326  * View claim history timeline
327 327  * Detect vandalism patterns
328 328  * Learn from user corrections (system improvement)
... ... @@ -329,17 +329,12 @@
329 329  * Legal compliance (audit trail)
330 330  * Rollback capability
331 331  See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases
332 -
333 333  === 1.8 Flag ===
334 -
335 335  Fields: entity_id, reported_by, issue_type, status, resolution_note
336 -
337 -=== 1.9 QualityMetric ===
338 -
288 +=== 1.9 QualityMetric ===
339 339  **Fields**: metric_type, category, value, target, timestamp
340 340  **Purpose**: Time-series quality tracking
341 341  **Usage**:
342 -
343 343  * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times
344 344  * **Quality dashboard**: Real-time display with trend charts
345 345  * **Alerting**: Automatic alerts when metrics exceed thresholds
... ... @@ -346,13 +346,10 @@
346 346  * **A/B testing**: Compare control vs treatment metrics
347 347  * **Improvement validation**: Measure before/after changes
348 348  **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
349 -
350 -=== 1.10 ErrorPattern ===
351 -
298 +=== 1.10 ErrorPattern ===
352 352  **Fields**: error_category, claim_id, description, root_cause, frequency, status
353 353  **Purpose**: Capture errors to trigger system improvements
354 354  **Usage**:
355 -
356 356  * **Error capture**: When users flag issues or system detects problems
357 357  * **Pattern analysis**: Weekly grouping by category and frequency
358 358  * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor
... ... @@ -364,13 +364,9 @@
364 364  {{include reference="Test.FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}}
365 365  
366 366  == 1.5 User Class Diagram ==
367 -
368 368  {{include reference="Test.FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}}
369 -
370 370  == 2. Versioning Strategy ==
371 -
372 372  **All Content Entities Are Versioned**:
373 -
374 374  * **Claim**: Every edit creates new version (V1→V2→V3...)
375 375  * **Evidence**: Changes tracked in edit history
376 376  * **Scenario**: Modifications versioned
... ... @@ -391,91 +391,68 @@
391 391  Claim V2: "The sky is blue during daytime"
392 392   → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
393 393  ```
394 -
395 395  == 2.5. Storage vs Computation Strategy ==
396 -
397 397  **Critical architectural decision**: What to persist in databases vs compute dynamically?
398 398  **Trade-off**:
399 -
400 400  * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs
401 401  * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible
402 -
403 403  === Recommendation: Hybrid Approach ===
404 -
405 405  **STORE (in PostgreSQL):**
406 -
407 407  ==== Claims (Current State + History) ====
408 -
409 409  * **What**: assertion, domain, status, created_at, updated_at, version
410 410  * **Why**: Core entity, must be persistent
411 411  * **Also store**: confidence_score (computed once, then cached)
412 -* **Size**: 1 KB per claim
347 +* **Size**: ~1 KB per claim
413 413  * **Growth**: Linear with claims
414 414  * **Decision**: ✅ STORE - Essential
415 -
416 416  ==== Evidence (All Records) ====
417 -
418 418  * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
419 419  * **Why**: Hard to re-gather, user contributions, reproducibility
420 -* **Size**: 2 KB per evidence (with excerpt)
353 +* **Size**: ~2 KB per evidence (with excerpt)
421 421  * **Growth**: 3-10 evidence per claim
422 422  * **Decision**: ✅ STORE - Essential for reproducibility
423 -
424 424  ==== Sources (Track Records) ====
425 -
426 426  * **What**: name, domain, track_record_score, accuracy_history, correction_frequency
427 427  * **Why**: Continuously updated, expensive to recompute
428 -* **Size**: 500 bytes per source
359 +* **Size**: ~500 bytes per source
429 429  * **Growth**: Slow (limited number of sources)
430 430  * **Decision**: ✅ STORE - Essential for quality
431 -
432 432  ==== Edit History (All Versions) ====
433 -
434 434  * **What**: before_state, after_state, user_id, reason, timestamp
435 435  * **Why**: Audit trail, legal requirement, reproducibility
436 -* **Size**: 2 KB per edit
437 -* **Growth**: Linear with edits (A portion of claims get edited)
365 +* **Size**: ~2 KB per edit
366 +* **Growth**: Linear with edits (~A portion of claims get edited)
438 438  * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
439 439  * **Decision**: ✅ STORE - Essential for accountability
440 -
441 441  ==== Flags (User Reports) ====
442 -
443 443  * **What**: entity_id, reported_by, issue_type, description, status
444 444  * **Why**: Error detection, system improvement triggers
445 -* **Size**: 500 bytes per flag
372 +* **Size**: ~500 bytes per flag
446 446  * **Growth**: 5-high percentage of claims get flagged
447 447  * **Decision**: ✅ STORE - Essential for improvement
448 -
449 449  ==== ErrorPatterns (System Improvement) ====
450 -
451 451  * **What**: error_category, claim_id, description, root_cause, frequency, status
452 452  * **Why**: Learning loop, prevent recurring errors
453 -* **Size**: 1 KB per pattern
378 +* **Size**: ~1 KB per pattern
454 454  * **Growth**: Slow (limited patterns, many fixed)
455 455  * **Decision**: ✅ STORE - Essential for learning
456 -
457 457  ==== QualityMetrics (Time Series) ====
458 -
459 459  * **What**: metric_type, category, value, target, timestamp
460 460  * **Why**: Trend analysis, cannot recreate historical metrics
461 -* **Size**: 200 bytes per metric
384 +* **Size**: ~200 bytes per metric
462 462  * **Growth**: Hourly = 8,760 per year per metric type
463 463  * **Retention**: 2 years hot, then aggregate and archive
464 464  * **Decision**: ✅ STORE - Essential for monitoring
465 465  **STORE (Computed Once, Then Cached):**
466 -
467 467  ==== Analysis Summary ====
468 -
469 469  * **What**: Neutral text summary of claim analysis (200-500 words)
470 470  * **Computed**: Once by AKEL when claim first analyzed
471 471  * **Stored in**: Claim table (text field)
472 472  * **Recomputed**: Only when system significantly improves OR claim edited
473 473  * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
474 -* **Size**: 2 KB per claim
395 +* **Size**: ~2 KB per claim
475 475  * **Decision**: ✅ STORE (cached) - Cost-effective
476 -
477 477  ==== Confidence Score ====
478 -
479 479  * **What**: 0-100 score of analysis confidence
480 480  * **Computed**: Once by AKEL
481 481  * **Stored in**: Claim table (integer field)
... ... @@ -483,9 +483,7 @@
483 483  * **Why store**: Cheap to store, expensive to compute, users need it fast
484 484  * **Size**: 4 bytes per claim
485 485  * **Decision**: ✅ STORE (cached) - Performance critical
486 -
487 487  ==== Risk Score ====
488 -
489 489  * **What**: 0-100 score of claim risk level
490 490  * **Computed**: Once by AKEL
491 491  * **Stored in**: Claim table (integer field)
... ... @@ -494,17 +494,13 @@
494 494  * **Size**: 4 bytes per claim
495 495  * **Decision**: ✅ STORE (cached) - Performance critical
496 496  **COMPUTE DYNAMICALLY (Never Store):**
497 -
498 -==== Scenarios ====
499 -
500 - ⚠️ CRITICAL DECISION
501 -
414 +==== Scenarios ==== ⚠️ CRITICAL DECISION
502 502  * **What**: 2-5 possible interpretations of claim with assumptions
503 503  * **Current design**: Stored in Scenario table
504 504  * **Alternative**: Compute on-demand when user views claim details
505 -* **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim
418 +* **Storage cost**: ~1 KB per scenario × 3 scenarios average = ~3 KB per claim
506 506  * **Compute cost**: $0.005-0.01 per request (LLM API call)
507 -* **Frequency**: Viewed in detail by 20% of users
420 +* **Frequency**: Viewed in detail by ~20% of users
508 508  * **Trade-off analysis**:
509 509   - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access
510 510   - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
... ... @@ -512,17 +512,12 @@
512 512  * **Speed**: Computed = 5-8 seconds delay, Stored = instant
513 513  * **Decision**: ✅ STORE (hybrid approach below)
514 514  **Scenario Strategy** (APPROVED):
515 -
516 516  1. **Store scenarios** initially when claim analyzed
517 517  2. **Mark as stale** when system improves significantly
518 518  3. **Recompute on next view** if marked stale
519 519  4. **Cache for 30 days** if frequently accessed
520 520  5. **Result**: Best of both worlds - speed + freshness
521 -
522 -==== Verdict Synthesis ====
523 -
524 -
525 -
433 +==== Verdict Synthesis ====
526 526  * **What**: Final conclusion text synthesizing all scenarios
527 527  * **Compute cost**: $0.002-0.005 per request
528 528  * **Frequency**: Every time claim viewed
... ... @@ -530,23 +530,17 @@
530 530  * **Speed**: 2-3 seconds (acceptable)
531 531  **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale
532 532  * **Recommendation**: ✅ STORE cached version, mark stale when changes occur
533 -
534 534  ==== Search Results ====
535 -
536 536  * **What**: Lists of claims matching search query
537 537  * **Compute from**: Elasticsearch index
538 538  * **Cache**: 15 minutes in Redis for popular queries
539 539  * **Why not store permanently**: Constantly changing, infinite possible queries
540 -
541 541  ==== Aggregated Statistics ====
542 -
543 543  * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.
544 544  * **Compute from**: Database queries
545 545  * **Cache**: 1 hour in Redis
546 546  * **Why not store**: Can be derived, relatively cheap to compute
547 -
548 548  ==== User Reputation ====
549 -
550 550  * **What**: Score based on contributions
551 551  * **Current design**: Stored in User table
552 552  * **Alternative**: Compute from Edit table
... ... @@ -556,43 +556,37 @@
556 556  * **Frequency**: Read on every user action
557 557  * **Compute cost**: Simple COUNT query, milliseconds
558 558  * **Decision**: ✅ STORE - Performance critical, read-heavy
559 -
560 560  === Summary Table ===
561 -
562 -| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\
563 -|-----|-|-||----|-----|\\
564 -| Claim core | ✅ | - | 1 KB | STORE | Essential |\\
565 -| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\
566 -| Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\
567 -| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\
568 -| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\
569 -| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
570 -| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
571 -| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\
572 -| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\
573 -| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\
574 -| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\
575 -| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\
576 -| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\
462 +| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |
463 +|-----------|---------|---------|----------------|----------|-----------|
464 +| Claim core | ✅ | - | 1 KB | STORE | Essential |
465 +| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |
466 +| Sources | ✅ | - | 500 B (shared) | STORE | Track record |
467 +| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |
468 +| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |
469 +| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |
470 +| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |
471 +| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |
472 +| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |
473 +| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |
474 +| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |
475 +| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |
476 +| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |
577 577  | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |
578 -**Total storage per claim**: 18 KB (without edits and flags)
478 +**Total storage per claim**: ~18 KB (without edits and flags)
579 579  **For 1 million claims**:
580 -
581 -* **Storage**: 18 GB (manageable)
582 -* **PostgreSQL**: $50/month (standard instance)
583 -* **Redis cache**: $20/month (1 GB instance)
584 -* **S3 archives**: $5/month (old edits)
585 -* **Total**: $75/month infrastructure
480 +* **Storage**: ~18 GB (manageable)
481 +* **PostgreSQL**: ~$50/month (standard instance)
482 +* **Redis cache**: ~$20/month (1 GB instance)
483 +* **S3 archives**: ~$5/month (old edits)
484 +* **Total**: ~$75/month infrastructure
586 586  **LLM cost savings by caching**:
587 587  * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
588 588  * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims
589 589  * Verdict stored: Save $0.003 per claim = $3K per 1M claims
590 -* **Total savings**: $35K per 1M claims vs recomputing every time
591 -
489 +* **Total savings**: ~$35K per 1M claims vs recomputing every time
592 592  === Recomputation Triggers ===
593 -
594 594  **When to mark cached data as stale and recompute:**
595 -
596 596  1. **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)
597 597  2. **Evidence added** → Recompute: scenarios, verdict, confidence score
598 598  3. **Source track record changes >10 points** → Recompute: confidence score, verdict
... ... @@ -599,15 +599,11 @@
599 599  4. **System improvement deployed** → Mark affected claims stale, recompute on next view
600 600  5. **Controversy detected** (high flag rate) → Recompute: risk score
601 601  **Recomputation strategy**:
602 -
603 603  * **Eager**: Immediately recompute (for user edits)
604 604  * **Lazy**: Recompute on next view (for system improvements)
605 605  * **Batch**: Nightly re-evaluation of stale claims (if <1000)
606 -
607 607  === Database Size Projection ===
608 -
609 609  **Year 1**: 10K claims
610 -
611 611  * Storage: 180 MB
612 612  * Cost: $10/month
613 613  **Year 3**: 100K claims
... ... @@ -621,21 +621,15 @@
621 621  * Cost: $300/month
622 622  * Optimization: Archive old claims to S3 ($5/TB/month)
623 623  **Conclusion**: Storage costs are manageable, LLM cost savings are substantial.
624 -
625 625  == 3. Key Simplifications ==
626 -
627 627  * **Two content states only**: Published, Hidden
628 628  * **Three user roles only**: Reader, Contributor, Moderator
629 629  * **No complex versioning**: Linear edit history
630 630  * **Reputation-based permissions**: Not role hierarchy
631 631  * **Source track records**: Continuous evaluation
632 -
633 633  == 3. What Gets Stored in the Database ==
634 -
635 635  === 3.1 Primary Storage (PostgreSQL) ===
636 -
637 637  **Claims Table**:
638 -
639 639  * Current state only (latest version)
640 640  * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
641 641  **Evidence Table**:
... ... @@ -662,44 +662,31 @@
662 662  **QualityMetric Table**:
663 663  * Time-series quality data
664 664  * Fields: id, metric_type, metric_category, value, target, timestamp
665 -
666 666  === 3.2 What's NOT Stored (Computed on-the-fly) ===
667 -
668 668  * **Verdicts**: Synthesized from evidence + scenarios when requested
669 669  * **Risk scores**: Recalculated based on current factors
670 670  * **Aggregated statistics**: Computed from base data
671 671  * **Search results**: Generated from Elasticsearch index
672 -
673 673  === 3.3 Cache Layer (Redis) ===
674 -
675 675  **Cached for performance**:
676 -
677 677  * Frequently accessed claims (TTL: 1 hour)
678 678  * Search results (TTL: 15 minutes)
679 679  * User sessions (TTL: 24 hours)
680 680  * Source track records (TTL: 1 hour)
681 -
682 682  === 3.4 File Storage (S3) ===
683 -
684 684  **Archived content**:
685 -
686 686  * Old edit history (>3 months)
687 687  * Evidence documents (archived copies)
688 688  * Database backups
689 689  * Export files
690 -
691 691  === 3.5 Search Index (Elasticsearch) ===
692 -
693 693  **Indexed for search**:
694 -
695 695  * Claim assertions (full-text)
696 696  * Evidence excerpts (full-text)
697 697  * Scenario descriptions (full-text)
698 698  * Source names (autocomplete)
699 699  Synchronized from PostgreSQL via change data capture or periodic sync.
700 -
701 701  == 4. Related Pages ==
702 -
703 -* [[Architecture>>Test.FactHarbor V0\.9\.100.Specification.Architecture.WebHome]]
576 +* [[Architecture>>Test.FactHarbor.Specification.Architecture.WebHome]]
704 704  * [[Requirements>>Test.FactHarbor.Specification.Requirements.WebHome]]
705 705  * [[Workflows>>Test.FactHarbor.Specification.Workflows.WebHome]]