Changes for page Data Model

Last modified by Robert Schaub on 2025/12/24 21:46

From 4.2 to 4.3

From version 1.1

edited by Robert Schaub
on 2025/12/18 12:03

Change comment: Imported from XAR

To version 4.2

edited by Robert Schaub
on 2025/12/21 13:38

Change comment: Renamed back-links.

Raw
Rendered

Summary

Page properties (1 modified, 0 added, 0 removed)

Details

Page properties

Content

@@ -1,25 +1,32 @@
  = Data Model =
++
  FactHarbor's data model is **simple, focused, designed for automated processing**.
++
  == 1. Core Entities ==
++
  === 1.1 Claim ===
++
  Fields: id, assertion, domain, **status** (Published/Hidden only), **confidence_score**, **risk_score**, completeness_score, version, views, edit_count
++
  ==== Performance Optimization: Denormalized Fields ====
++
  **Rationale**: Claims system is 95% reads, 5% writes. Denormalizing common data reduces joins and improves query performance by 70%.
  **Additional cached fields in claims table**:
++
  * **evidence_summary** (JSONB): Top 5 most relevant evidence snippets with scores
--  * Avoids joining evidence table for listing/preview
--  * Updated when evidence is added/removed
--  * Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
++* Avoids joining evidence table for listing/preview
++* Updated when evidence is added/removed
++* Format: `[{"text": "...", "source": "...", "relevance": 0.95}, ...]`
  * **source_names** (TEXT[]): Array of source names for quick display
--  * Avoids joining through evidence to sources
--  * Updated when sources change
--  * Format: `["New York Times", "Nature Journal", ...]`
++* Avoids joining through evidence to sources
++* Updated when sources change
++* Format: `["New York Times", "Nature Journal", ...]`
  * **scenario_count** (INTEGER): Number of scenarios for this claim
--  * Quick metric without counting rows
--  * Updated when scenarios added/removed
++* Quick metric without counting rows
++* Updated when scenarios added/removed
  * **cache_updated_at** (TIMESTAMP): When denormalized data was last refreshed
--  * Helps invalidate stale caches
--  * Triggers background refresh if too old
++* Helps invalidate stale caches
++* Triggers background refresh if too old
  **Update Strategy**:
  * **Immediate**: Update on claim edit (user-facing)
  * **Deferred**: Update via background job every hour (non-critical)
@@ -28,13 +28,18 @@
  * ✅ 70% fewer joins on common queries
  * ✅ Much faster claim list/search pages
  * ✅ Better user experience
--* ⚠️ Small storage increase (~10%)
++* ⚠️ Small storage increase (10%)
  * ⚠️ Need to keep caches in sync
++
  === 1.2 Evidence ===
++
  Fields: claim_id, source_id, excerpt, url, relevance_score, supports
++
  === 1.3 Source ===
++
  **Purpose**: Track reliability of information sources over time
  **Fields**:
++
  * **id** (UUID): Unique identifier
  * **name** (text): Source name (e.g., "New York Times", "Nature Journal")
  * **domain** (text): Website domain (e.g., "nytimes.com")
@@ -52,17 +52,21 @@
  **See**: SOURCE Track Record System documentation for complete details on calculation, updates, and usage
  Fields: id, name, domain, **track_record_score**, **accuracy_history**, **correction_frequency**
  **Key**: Automated source reliability tracking
++
  ==== Source Scoring Process (Separation of Concerns) ====
++
  **Critical design principle**: Prevent circular dependencies between source scoring and claim analysis.
  **The Problem**:
++
  * Source scores should influence claim verdicts
  * Claim verdicts should update source scores
  * But: Direct feedback creates circular dependency and potential feedback loops
  **The Solution**: Temporal separation
++
  ==== Weekly Background Job (Source Scoring) ====
++
  Runs independently of claim analysis:
--{{code language="python"}}
--def update_source_scores_weekly():
++{{code language="python"}}def update_source_scores_weekly():
      """
      Background job: Calculate source reliability
      Never triggered by individual claim analysis
@@ -82,12 +82,12 @@
          source.last_updated = now()
          source.save()
      # Job runs: Sunday 2 AM UTC
--    # Never during claim processing
--{{/code}}
++    # Never during claim processing{{/code}}
++
  ==== Real-Time Claim Analysis (AKEL) ====
++
  Uses source scores but never updates them:
--{{code language="python"}}
--def analyze_claim(claim_text):
++{{code language="python"}}def analyze_claim(claim_text):
      """
      Real-time: Analyze claim using current source scores
      READ source scores, never UPDATE them
@@ -104,10 +104,12 @@
      verdict = synthesize_verdict(evidence_list)
      # NEVER update source scores here
      # That happens in weekly background job
--    return verdict
--{{/code}}
++    return verdict{{/code}}
++
  ==== Monthly Audit (Quality Assurance) ====
++
  Moderator review of flagged source scores:
++
  * Verify scores make sense
  * Detect gaming attempts
  * Identify systematic biases
@@ -147,18 +147,19 @@
    → NYT score: 0.89 (trending up)
    → Blog X score: 0.48 (trending down)
  ```
++
  === 1.4 Scenario ===
++
  **Purpose**: Different interpretations or contexts for evaluating claims
  **Key Concept**: Scenarios are extracted from evidence, not generated arbitrarily. Each scenario represents a specific context, assumption set, or condition under which a claim should be evaluated.
  **Relationship**: One-to-many with Claims (**simplified for V1.0**: scenario belongs to single claim)
  **Fields**:
++
  * **id** (UUID): Unique identifier
  * **claim_id** (UUID): Foreign key to claim (one-to-many)
  * **description** (text): Human-readable description of the scenario
  * **assumptions** (JSONB): Key assumptions that define this scenario context
  * **extracted_from** (UUID): Reference to evidence that this scenario was extracted from
--* **verdict_summary** (text): Compiled verdict for this scenario
--* **confidence** (decimal 0-1): Confidence level for verdict in this scenario
  * **created_at** (timestamp): When scenario was created
  * **updated_at** (timestamp): Last modification
  **How Found**: Evidence search → Extract context → Create scenario → Link to claim
@@ -168,13 +168,48 @@
  * Scenario 2: "Real-world data (diverse population, Omicron variant)" from hospital data
  * Scenario 3: "Immunocompromised patients" from specialist study
  **V2.0 Evolution**: Many-to-many relationship can be added if users request cross-claim scenario sharing. For V1.0, keeping scenarios tied to single claims simplifies queries and reduces complexity without limiting functionality.
--=== 1.5 User ===
++
++=== 1.5 Verdict ===
++
++**Purpose**: Assessment of a claim within a specific scenario context. Each verdict provides a conclusion about whether the claim is supported, refuted, or uncertain given the scenario's assumptions and available evidence.
++
++**Core Fields**:
++
++* **id** (UUID): Primary key
++* **scenario_id** (UUID FK): The scenario being assessed
++* **likelihood_range** (text): Probabilistic assessment (e.g., "0.40-0.65 (uncertain)", "0.75-0.85 (likely true)")
++* **confidence** (decimal 0-1): How confident we are in this assessment
++* **explanation_summary** (text): Human-readable reasoning explaining the verdict
++* **uncertainty_factors** (text array): Specific factors limiting confidence (e.g., "Small sample sizes", "Lifestyle confounds", "Long-term effects unknown")
++* **created_at** (timestamp): When verdict was created
++* **updated_at** (timestamp): Last modification
++
++**Change Tracking**: Like all entities, verdict changes are tracked through the Edit entity (section 1.7), not through separate version tables. Each edit records before/after states.
++
++**Relationship**: Each Scenario has one Verdict. When understanding evolves, the verdict is updated and the change is logged in the Edit entity.
++
++**Example**:
++For claim "Exercise improves mental health" in scenario "Clinical trials (healthy adults, structured programs)":
++
++* Initial state: likelihood_range="0.40-0.65 (uncertain)", uncertainty_factors=["Small sample sizes", "Short-term studies only"]
++* After new evidence: likelihood_range="0.70-0.85 (likely true)", uncertainty_factors=["Lifestyle confounds remain"]
++* Edit entity records the complete before/after change with timestamp and reason
++
++**Key Design**: Verdicts are mutable entities tracked through the centralized Edit entity, consistent with Claims, Evidence, and Scenarios.
++
++=== 1.6 User ===
++
  Fields: username, email, **role** (Reader/Contributor/Moderator), **reputation**, contributions_count
--=== User Reputation System ==
++
++=== User Reputation System ===
++
  **V1.0 Approach**: Simple manual role assignment
  **Rationale**: Complex reputation systems aren't needed until 100+ active contributors demonstrate the need for automated reputation management. Start simple, add complexity when metrics prove necessary.
++
  === Roles (Manual Assignment) ===
++
  **reader** (default):
++
  * View published claims and evidence
  * Browse and search content
  * No editing permissions
@@ -193,8 +193,11 @@
  * System configuration
  * Access to all features
  * Founder-appointed initially
++
  === Contribution Tracking (Simple) ===
++
  **Basic metrics only**:
++
  * `contributions_count`: Total number of contributions
  * `created_at`: Account age
  * `last_active`: Recent activity
@@ -203,19 +203,26 @@
  * No automated privilege escalation
  * No reputation decay
  * No threshold-based promotions
++
  === Promotion Process ===
++
  **Manual review by moderators/admins**:
++
 . User demonstrates value through contributions
 . Moderator reviews user's contribution history
 . Moderator promotes user to contributor role
 . Admin promotes trusted contributors to moderator
  **Criteria** (guidelines, not automated):
++
  * Quality of contributions
  * Consistency over time
  * Collaborative behavior
  * Understanding of project goals
++
  === V2.0+ Evolution ===
++
  **Add complex reputation when**:
++
  * 100+ active contributors
  * Manual role management becomes bottleneck
  * Clear patterns of abuse emerge requiring automation
@@ -224,12 +224,17 @@
  * Threshold-based promotions
  * Reputation decay for inactive users
  * Track record scoring for contributors
--See [[When to Add Complexity>>FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
--=== 1.6 Edit ===
++See [[When to Add Complexity>>Test.FactHarbor.Specification.When-to-Add-Complexity]] for triggers.
++
++=== 1.7 Edit ===
++
  **Fields**: entity_type, entity_id, user_id, before_state (JSON), after_state (JSON), edit_type, reason, created_at
  **Purpose**: Complete audit trail for all content changes
++
  === Edit History Details ===
++
  **What Gets Edited**:
++
  * **Claims** (20% edited): assertion, domain, status, scores, analysis
  * **Evidence** (10% edited): excerpt, relevance_score, supports
  * **Scenarios** (5% edited): description, assumptions, confidence
@@ -246,12 +246,14 @@
  * `MODERATION_ACTION`: Hide/unhide for abuse
  * `REVERT`: Rollback to previous version
  **Retention Policy** (5 years total):
++
 . **Hot storage** (3 months): PostgreSQL, instant access
 . **Warm storage** (2 years): Partitioned, slower queries
 . **Cold storage** (3 years): S3 compressed, download required
 . **Deletion**: After 5 years (except legal holds)
--**Storage per 1M claims**: ~400 MB (20% edited × 2 KB per edit)
++**Storage per 1M claims**: 400 MB (20% edited × 2 KB per edit)
  **Use Cases**:
++
  * View claim history timeline
  * Detect vandalism patterns
  * Learn from user corrections (system improvement)
@@ -258,12 +258,17 @@
  * Legal compliance (audit trail)
  * Rollback capability
  See **Edit History Documentation** for complete details on what gets edited by whom, retention policy, and use cases
--=== 1.7 Flag ===
++
++=== 1.8 Flag ===
++
  Fields: entity_id, reported_by, issue_type, status, resolution_note
--=== 1.8 QualityMetric  ===
++
++=== 1.9 QualityMetric ===
++
  **Fields**: metric_type, category, value, target, timestamp
  **Purpose**: Time-series quality tracking
  **Usage**:
++
  * **Continuous monitoring**: Hourly calculation of error rates, confidence scores, processing times
  * **Quality dashboard**: Real-time display with trend charts
  * **Alerting**: Automatic alerts when metrics exceed thresholds
@@ -270,19 +270,31 @@
  * **A/B testing**: Compare control vs treatment metrics
  * **Improvement validation**: Measure before/after changes
  **Example**: `{type: "ErrorRate", category: "Politics", value: 0.12, target: 0.10, timestamp: "2025-12-17"}`
--=== 1.9 ErrorPattern  ===
++
++=== 1.10 ErrorPattern ===
++
  **Fields**: error_category, claim_id, description, root_cause, frequency, status
  **Purpose**: Capture errors to trigger system improvements
  **Usage**:
++
  * **Error capture**: When users flag issues or system detects problems
  * **Pattern analysis**: Weekly grouping by category and frequency
  * **Improvement workflow**: Analyze → Fix → Test → Deploy → Re-process → Monitor
  * **Metrics**: Track error rate reduction over time
  **Example**: `{category: "WrongSource", description: "Unreliable tabloid cited", root_cause: "No quality check", frequency: 23, status: "Fixed"}`
++
++== 1.4 Core Data Model ERD ==
++
++{{include reference="Test.FactHarbor.Specification.Diagrams.Core Data Model ERD.WebHome"/}}
++
  == 1.5 User Class Diagram ==
--{{include reference="FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}}
++
++{{include reference="Test.FactHarbor.Specification.Diagrams.User Class Diagram.WebHome"/}}
++
  == 2. Versioning Strategy ==
++
  **All Content Entities Are Versioned**:
++
  * **Claim**: Every edit creates new version (V1→V2→V3...)
  * **Evidence**: Changes tracked in edit history
  * **Scenario**: Modifications versioned
@@ -303,68 +303,91 @@
  Claim V2: "The sky is blue during daytime"
    → EDIT table stores: {before: "The sky is blue", after: "The sky is blue during daytime"}
  ```
++
  == 2.5. Storage vs Computation Strategy ==
++
  **Critical architectural decision**: What to persist in databases vs compute dynamically?
  **Trade-off**:
++
  * **Store more**: Better reproducibility, faster, lower LLM costs | Higher storage/maintenance costs
  * **Compute more**: Lower storage/maintenance costs | Slower, higher LLM costs, less reproducible
++
  === Recommendation: Hybrid Approach ===
++
  **STORE (in PostgreSQL):**
++
  ==== Claims (Current State + History) ====
++
  * **What**: assertion, domain, status, created_at, updated_at, version
  * **Why**: Core entity, must be persistent
  * **Also store**: confidence_score (computed once, then cached)
--* **Size**: ~1 KB per claim
++* **Size**: 1 KB per claim
  * **Growth**: Linear with claims
  * **Decision**: ✅ STORE - Essential
++
  ==== Evidence (All Records) ====
++
  * **What**: claim_id, source_id, excerpt, url, relevance_score, supports, extracted_at
  * **Why**: Hard to re-gather, user contributions, reproducibility
--* **Size**: ~2 KB per evidence (with excerpt)
++* **Size**: 2 KB per evidence (with excerpt)
  * **Growth**: 3-10 evidence per claim
  * **Decision**: ✅ STORE - Essential for reproducibility
++
  ==== Sources (Track Records) ====
++
  * **What**: name, domain, track_record_score, accuracy_history, correction_frequency
  * **Why**: Continuously updated, expensive to recompute
--* **Size**: ~500 bytes per source
++* **Size**: 500 bytes per source
  * **Growth**: Slow (limited number of sources)
  * **Decision**: ✅ STORE - Essential for quality
++
  ==== Edit History (All Versions) ====
++
  * **What**: before_state, after_state, user_id, reason, timestamp
  * **Why**: Audit trail, legal requirement, reproducibility
--* **Size**: ~2 KB per edit
--* **Growth**: Linear with edits (~A portion of claims get edited)
++* **Size**: 2 KB per edit
++* **Growth**: Linear with edits (A portion of claims get edited)
  * **Retention**: Hot storage 3 months → Warm storage 2 years → Archive to S3 3 years → Delete after 5 years total
  * **Decision**: ✅ STORE - Essential for accountability
++
  ==== Flags (User Reports) ====
++
  * **What**: entity_id, reported_by, issue_type, description, status
  * **Why**: Error detection, system improvement triggers
--* **Size**: ~500 bytes per flag
++* **Size**: 500 bytes per flag
  * **Growth**: 5-high percentage of claims get flagged
  * **Decision**: ✅ STORE - Essential for improvement
++
  ==== ErrorPatterns (System Improvement) ====
++
  * **What**: error_category, claim_id, description, root_cause, frequency, status
  * **Why**: Learning loop, prevent recurring errors
--* **Size**: ~1 KB per pattern
++* **Size**: 1 KB per pattern
  * **Growth**: Slow (limited patterns, many fixed)
  * **Decision**: ✅ STORE - Essential for learning
++
  ==== QualityMetrics (Time Series) ====
++
  * **What**: metric_type, category, value, target, timestamp
  * **Why**: Trend analysis, cannot recreate historical metrics
--* **Size**: ~200 bytes per metric
++* **Size**: 200 bytes per metric
  * **Growth**: Hourly = 8,760 per year per metric type
  * **Retention**: 2 years hot, then aggregate and archive
  * **Decision**: ✅ STORE - Essential for monitoring
  **STORE (Computed Once, Then Cached):**
++
  ==== Analysis Summary ====
++
  * **What**: Neutral text summary of claim analysis (200-500 words)
  * **Computed**: Once by AKEL when claim first analyzed
  * **Stored in**: Claim table (text field)
  * **Recomputed**: Only when system significantly improves OR claim edited
  * **Why store**: Expensive to regenerate ($0.01-0.05 per analysis), doesn't change often
--* **Size**: ~2 KB per claim
++* **Size**: 2 KB per claim
  * **Decision**: ✅ STORE (cached) - Cost-effective
++
  ==== Confidence Score ====
++
  * **What**: 0-100 score of analysis confidence
  * **Computed**: Once by AKEL
  * **Stored in**: Claim table (integer field)
@@ -372,7 +372,9 @@
  * **Why store**: Cheap to store, expensive to compute, users need it fast
  * **Size**: 4 bytes per claim
  * **Decision**: ✅ STORE (cached) - Performance critical
++
  ==== Risk Score ====
++
  * **What**: 0-100 score of claim risk level
  * **Computed**: Once by AKEL
  * **Stored in**: Claim table (integer field)
@@ -381,13 +381,17 @@
  * **Size**: 4 bytes per claim
  * **Decision**: ✅ STORE (cached) - Performance critical
  **COMPUTE DYNAMICALLY (Never Store):**
--==== Scenarios ==== ⚠️ CRITICAL DECISION
++
++==== Scenarios ====
++
++ ⚠️ CRITICAL DECISION
++
  * **What**: 2-5 possible interpretations of claim with assumptions
  * **Current design**: Stored in Scenario table
  * **Alternative**: Compute on-demand when user views claim details
--* **Storage cost**: ~1 KB per scenario × 3 scenarios average = ~3 KB per claim
++* **Storage cost**: 1 KB per scenario × 3 scenarios average = 3 KB per claim
  * **Compute cost**: $0.005-0.01 per request (LLM API call)
--* **Frequency**: Viewed in detail by ~20% of users
++* **Frequency**: Viewed in detail by 20% of users
  * **Trade-off analysis**:
    - IF STORED: 1M claims × 3 KB = 3 GB storage, $0.05/month, fast access
    - IF COMPUTED: 1M claims × 20% views × $0.01 = $2,000/month in LLM costs
@@ -395,12 +395,17 @@
  * **Speed**: Computed = 5-8 seconds delay, Stored = instant
  * **Decision**: ✅ STORE (hybrid approach below)
  **Scenario Strategy** (APPROVED):
++
 . **Store scenarios** initially when claim analyzed
 . **Mark as stale** when system improves significantly
 . **Recompute on next view** if marked stale
 . **Cache for 30 days** if frequently accessed
 . **Result**: Best of both worlds - speed + freshness
--==== Verdict Synthesis ====
++
++==== Verdict Synthesis ====
++
++
++
  * **What**: Final conclusion text synthesizing all scenarios
  * **Compute cost**: $0.002-0.005 per request
  * **Frequency**: Every time claim viewed
@@ -408,17 +408,23 @@
  * **Speed**: 2-3 seconds (acceptable)
  **Alternative**: Store "last verdict" as cached field, recompute only if claim edited or marked stale
  * **Recommendation**: ✅ STORE cached version, mark stale when changes occur
++
  ==== Search Results ====
++
  * **What**: Lists of claims matching search query
  * **Compute from**: Elasticsearch index
  * **Cache**: 15 minutes in Redis for popular queries
  * **Why not store permanently**: Constantly changing, infinite possible queries
++
  ==== Aggregated Statistics ====
++
  * **What**: "Total claims: 1,234,567", "Average confidence: 78%", etc.
  * **Compute from**: Database queries
  * **Cache**: 1 hour in Redis
  * **Why not store**: Can be derived, relatively cheap to compute
++
  ==== User Reputation ====
++
  * **What**: Score based on contributions
  * **Current design**: Stored in User table
  * **Alternative**: Compute from Edit table
@@ -428,37 +428,43 @@
  * **Frequency**: Read on every user action
  * **Compute cost**: Simple COUNT query, milliseconds
  * **Decision**: ✅ STORE - Performance critical, read-heavy
++
  === Summary Table ===
--| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |
--|-----------|---------|---------|----------------|----------|-----------|
--| Claim core | ✅ | - | 1 KB | STORE | Essential |
--| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |
--| Sources | ✅ | - | 500 B (shared) | STORE | Track record |
--| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |
--| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |
--| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |
--| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |
--| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |
--| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |
--| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |
--| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |
--| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |
--| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |
++
++| Data Type | Storage | Compute | Size per Claim | Decision | Rationale |\\
++|-----|-|-||----|-----|\\
++| Claim core | ✅ | - | 1 KB | STORE | Essential |\\
++| Evidence | ✅ | - | 2 KB × 5 = 10 KB | STORE | Reproducibility |\\
++| Sources | ✅ | - | 500 B (shared) | STORE | Track record |\\
++| Edit history | ✅ | - | 2 KB × 20% = 400 B avg | STORE | Audit |\\
++| Analysis summary | ✅ | Once | 2 KB | STORE (cached) | Cost-effective |\\
++| Confidence score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
++| Risk score | ✅ | Once | 4 B | STORE (cached) | Fast access |\\
++| Scenarios | ✅ | When stale | 3 KB | STORE (hybrid) | Balance cost/speed |\\
++| Verdict | ✅ | When stale | 1 KB | STORE (cached) | Fast access |\\
++| Flags | ✅ | - | 500 B × 10% = 50 B avg | STORE | Improvement |\\
++| ErrorPatterns | ✅ | - | 1 KB (global) | STORE | Learning |\\
++| QualityMetrics | ✅ | - | 200 B (time series) | STORE | Trending |\\
++| Search results | - | ✅ | - | COMPUTE + 15min cache | Dynamic |\\
  | Aggregations | - | ✅ | - | COMPUTE + 1hr cache | Derivable |
--**Total storage per claim**: ~18 KB (without edits and flags)
++**Total storage per claim**: 18 KB (without edits and flags)
  **For 1 million claims**:
--* **Storage**: ~18 GB (manageable)
--* **PostgreSQL**: ~$50/month (standard instance)
--* **Redis cache**: ~$20/month (1 GB instance)
--* **S3 archives**: ~$5/month (old edits)
--* **Total**: ~$75/month infrastructure
++
++* **Storage**: 18 GB (manageable)
++* **PostgreSQL**: $50/month (standard instance)
++* **Redis cache**: $20/month (1 GB instance)
++* **S3 archives**: $5/month (old edits)
++* **Total**: $75/month infrastructure
  **LLM cost savings by caching**:
  * Analysis summary stored: Save $0.03 per claim = $30K per 1M claims
  * Scenarios stored: Save $0.01 per claim × 20% views = $2K per 1M claims
  * Verdict stored: Save $0.003 per claim = $3K per 1M claims
--* **Total savings**: ~$35K per 1M claims vs recomputing every time
++* **Total savings**: $35K per 1M claims vs recomputing every time
++
  === Recomputation Triggers ===
++
  **When to mark cached data as stale and recompute:**
++
 . **User edits claim** → Recompute: all (analysis, scenarios, verdict, scores)
 . **Evidence added** → Recompute: scenarios, verdict, confidence score
 . **Source track record changes >10 points** → Recompute: confidence score, verdict
@@ -465,11 +465,15 @@
 . **System improvement deployed** → Mark affected claims stale, recompute on next view
 . **Controversy detected** (high flag rate) → Recompute: risk score
  **Recomputation strategy**:
++
  * **Eager**: Immediately recompute (for user edits)
  * **Lazy**: Recompute on next view (for system improvements)
  * **Batch**: Nightly re-evaluation of stale claims (if <1000)
++
  === Database Size Projection ===
++
  **Year 1**: 10K claims
++
  * Storage: 180 MB
  * Cost: $10/month
  **Year 3**: 100K claims
@@ -483,15 +483,21 @@
  * Cost: $300/month
  * Optimization: Archive old claims to S3 ($5/TB/month)
  **Conclusion**: Storage costs are manageable, LLM cost savings are substantial.
++
  == 3. Key Simplifications ==
++
  * **Two content states only**: Published, Hidden
  * **Three user roles only**: Reader, Contributor, Moderator
  * **No complex versioning**: Linear edit history
  * **Reputation-based permissions**: Not role hierarchy
  * **Source track records**: Continuous evaluation
++
  == 3. What Gets Stored in the Database ==
++
  === 3.1 Primary Storage (PostgreSQL) ===
++
  **Claims Table**:
++
  * Current state only (latest version)
  * Fields: id, assertion, domain, status, confidence_score, risk_score, completeness_score, version, created_at, updated_at
  **Evidence Table**:
@@ -518,31 +518,44 @@
  **QualityMetric Table**:
  * Time-series quality data
  * Fields: id, metric_type, metric_category, value, target, timestamp
++
  === 3.2 What's NOT Stored (Computed on-the-fly) ===
++
  * **Verdicts**: Synthesized from evidence + scenarios when requested
  * **Risk scores**: Recalculated based on current factors
  * **Aggregated statistics**: Computed from base data
  * **Search results**: Generated from Elasticsearch index
++
  === 3.3 Cache Layer (Redis) ===
++
  **Cached for performance**:
++
  * Frequently accessed claims (TTL: 1 hour)
  * Search results (TTL: 15 minutes)
  * User sessions (TTL: 24 hours)
  * Source track records (TTL: 1 hour)
++
  === 3.4 File Storage (S3) ===
++
  **Archived content**:
++
  * Old edit history (>3 months)
  * Evidence documents (archived copies)
  * Database backups
  * Export files
++
  === 3.5 Search Index (Elasticsearch) ===
++
  **Indexed for search**:
++
  * Claim assertions (full-text)
  * Evidence excerpts (full-text)
  * Scenario descriptions (full-text)
  * Source names (autocomplete)
  Synchronized from PostgreSQL via change data capture or periodic sync.
++
  == 4. Related Pages ==
--* [[Architecture>>FactHarbor.Specification.Architecture.WebHome]]
--* [[Requirements>>FactHarbor.Specification.Requirements.WebHome]]
--* [[Workflows>>FactHarbor.Specification.Workflows.WebHome]]
++
++* [[Architecture>>FactHarbor.Archive.FactHarbor delta for V0\.9\.70.Specification.Architecture.WebHome]]
++* [[Requirements>>Test.FactHarbor.Specification.Requirements.WebHome]]
++* [[Workflows>>Test.FactHarbor.Specification.Workflows.WebHome]]

Changes for page Data Model

Summary

Details

Applications

Navigation

Need help?