--- name: data-modeling description: "Knowledge contract for defining database schemas, partition keys, indexes, query patterns, denormalization strategy, TTL/caching, and data ownership. Referenced by design-architecture when designing data models." --- This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing database schemas and data models. ## Core Principles - Data models must be driven by query and write patterns, not theoretical purity - Each table or collection must serve a clear purpose traced to PRD requirements - Indexes must be justified by identified query patterns - Data ownership must be unambiguous: each data item belongs to exactly one service ## Table Definitions For each table or collection, define: - Table name and purpose (traced to PRD requirement) - Column definitions: - Name - Data type - Nullable or not null - Default value (if any) - Constraints (unique, check, etc.) - Primary key - Foreign keys and relationships - Data volume estimates (when relevant for storage selection) ## Index Design Indexes must be justified by query patterns: - Identify the queries this table must support - Design indexes to cover those queries - Avoid speculative indexes "just in case" - Consider write amplification: every index slows writes Index justification format: - Index name - Columns (with sort direction) - Type (unique, non-unique, partial, composite) - Query pattern it serves - Estimated selectivity ## Partition Keys When designing distributed data stores: - Partition key must distribute data evenly across nodes - Partition key should align with the most common access pattern - Consider hot partition risks - Define partition strategy (hash, range, composite) ## Relationships Define relationships explicitly: - One-to-one - One-to-many (with foreign key placement) - Many-to-many (with junction table) For each relationship: - Direction of access (which side queries the other) - Cardinality (exactly N, at most N, unbounded) - Nullability (is the relationship optional?) - Cascade behavior (what happens on delete?) ## Denormalization Strategy Denormalize when: - A query needs data from multiple entities and joins are expensive or unavailable - Read frequency significantly exceeds write frequency - The denormalized data has a clear source of truth that can be kept in sync Do not denormalize when: - The data changes frequently and consistency is critical - Joins are cheap and the data store supports them well - The denormalization creates complex synchronization logic - There is no clear source of truth For each denormalized field: - Identify the source of truth - Define the synchronization mechanism (eventual consistency, sync on read, sync on write) - Define the staleness tolerance ## TTL and Caching ### TTL (Time-To-Live) Define TTL for: - Ephemeral data (sessions, temporary tokens, idempotency keys) - Time-bounded data (logs, analytics, expired records) - Data that must be purged after a regulatory period For each TTL: - Duration and basis (absolute time, sliding window, last access) - Action on expiration (delete, archive, revoke) ### Caching Define caching for: - Frequently read, rarely written data - Computed aggregates that are expensive to recalculate - Data that is accessed across service boundaries For each cache: - Cache type (in-process, distributed, CDN) - Invalidation strategy (TTL-based, event-based, write-through) - Staleness tolerance - Cache miss behavior (stale-while-recompute, block-and-fetch) ## Data Ownership Each piece of data must have exactly one owner: - The owning service is the single source of truth - Other services access that data via the owner's API or events - No service reads directly from another service's data store - If data is needed in multiple places, replicate via events with a clear source of truth Data ownership format: | Data Item | Owning Service | Access Pattern | Replication Strategy | |----------|---------------|----------------|---------------------| | ... | ... | ... | ... | ## Query Pattern Analysis For each table, document: - Primary query patterns (by which columns/keys is data accessed) - Write patterns (insert-heavy, update-heavy, or mixed) - Read-to-write ratio (when relevant) - Consistency requirements (strong, eventual, or tunable) - Scale expectations (rows per day, rows total, growth rate) This analysis drives: - Index selection - Partition key selection - Storage engine selection - Denormalization decisions ## Anti-Patterns - Tables without a clear PRD requirement - Indexes without a documented query pattern - Shared tables across service boundaries - Premature denormalization without a read/write justification - Missing foreign key constraints where referential integrity is required - Data models that assume a specific storage engine without justification