142 lines
4.8 KiB
Markdown
142 lines
4.8 KiB
Markdown
---
|
|
name: data-modeling
|
|
description: "Knowledge contract for defining database schemas, partition keys, indexes, query patterns, denormalization strategy, TTL/caching, and data ownership. Referenced by design-architecture when designing data models."
|
|
---
|
|
|
|
This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing database schemas and data models.
|
|
|
|
## Core Principles
|
|
|
|
- Data models must be driven by query and write patterns, not theoretical purity
|
|
- Each table or collection must serve a clear purpose traced to PRD requirements
|
|
- Indexes must be justified by identified query patterns
|
|
- Data ownership must be unambiguous: each data item belongs to exactly one service
|
|
|
|
## Table Definitions
|
|
|
|
For each table or collection, define:
|
|
- Table name and purpose (traced to PRD requirement)
|
|
- Column definitions:
|
|
- Name
|
|
- Data type
|
|
- Nullable or not null
|
|
- Default value (if any)
|
|
- Constraints (unique, check, etc.)
|
|
- Primary key
|
|
- Foreign keys and relationships
|
|
- Data volume estimates (when relevant for storage selection)
|
|
|
|
## Index Design
|
|
|
|
Indexes must be justified by query patterns:
|
|
- Identify the queries this table must support
|
|
- Design indexes to cover those queries
|
|
- Avoid speculative indexes "just in case"
|
|
- Consider write amplification: every index slows writes
|
|
|
|
Index justification format:
|
|
- Index name
|
|
- Columns (with sort direction)
|
|
- Type (unique, non-unique, partial, composite)
|
|
- Query pattern it serves
|
|
- Estimated selectivity
|
|
|
|
## Partition Keys
|
|
|
|
When designing distributed data stores:
|
|
- Partition key must distribute data evenly across nodes
|
|
- Partition key should align with the most common access pattern
|
|
- Consider hot partition risks
|
|
- Define partition strategy (hash, range, composite)
|
|
|
|
## Relationships
|
|
|
|
Define relationships explicitly:
|
|
- One-to-one
|
|
- One-to-many (with foreign key placement)
|
|
- Many-to-many (with junction table)
|
|
|
|
For each relationship:
|
|
- Direction of access (which side queries the other)
|
|
- Cardinality (exactly N, at most N, unbounded)
|
|
- Nullability (is the relationship optional?)
|
|
- Cascade behavior (what happens on delete?)
|
|
|
|
## Denormalization Strategy
|
|
|
|
Denormalize when:
|
|
- A query needs data from multiple entities and joins are expensive or unavailable
|
|
- Read frequency significantly exceeds write frequency
|
|
- The denormalized data has a clear source of truth that can be kept in sync
|
|
|
|
Do not denormalize when:
|
|
- The data changes frequently and consistency is critical
|
|
- Joins are cheap and the data store supports them well
|
|
- The denormalization creates complex synchronization logic
|
|
- There is no clear source of truth
|
|
|
|
For each denormalized field:
|
|
- Identify the source of truth
|
|
- Define the synchronization mechanism (eventual consistency, sync on read, sync on write)
|
|
- Define the staleness tolerance
|
|
|
|
## TTL and Caching
|
|
|
|
### TTL (Time-To-Live)
|
|
Define TTL for:
|
|
- Ephemeral data (sessions, temporary tokens, idempotency keys)
|
|
- Time-bounded data (logs, analytics, expired records)
|
|
- Data that must be purged after a regulatory period
|
|
|
|
For each TTL:
|
|
- Duration and basis (absolute time, sliding window, last access)
|
|
- Action on expiration (delete, archive, revoke)
|
|
|
|
### Caching
|
|
Define caching for:
|
|
- Frequently read, rarely written data
|
|
- Computed aggregates that are expensive to recalculate
|
|
- Data that is accessed across service boundaries
|
|
|
|
For each cache:
|
|
- Cache type (in-process, distributed, CDN)
|
|
- Invalidation strategy (TTL-based, event-based, write-through)
|
|
- Staleness tolerance
|
|
- Cache miss behavior (stale-while-recompute, block-and-fetch)
|
|
|
|
## Data Ownership
|
|
|
|
Each piece of data must have exactly one owner:
|
|
- The owning service is the single source of truth
|
|
- Other services access that data via the owner's API or events
|
|
- No service reads directly from another service's data store
|
|
- If data is needed in multiple places, replicate via events with a clear source of truth
|
|
|
|
Data ownership format:
|
|
| Data Item | Owning Service | Access Pattern | Replication Strategy |
|
|
|----------|---------------|----------------|---------------------|
|
|
| ... | ... | ... | ... |
|
|
|
|
## Query Pattern Analysis
|
|
|
|
For each table, document:
|
|
- Primary query patterns (by which columns/keys is data accessed)
|
|
- Write patterns (insert-heavy, update-heavy, or mixed)
|
|
- Read-to-write ratio (when relevant)
|
|
- Consistency requirements (strong, eventual, or tunable)
|
|
- Scale expectations (rows per day, rows total, growth rate)
|
|
|
|
This analysis drives:
|
|
- Index selection
|
|
- Partition key selection
|
|
- Storage engine selection
|
|
- Denormalization decisions
|
|
|
|
## Anti-Patterns
|
|
|
|
- Tables without a clear PRD requirement
|
|
- Indexes without a documented query pattern
|
|
- Shared tables across service boundaries
|
|
- Premature denormalization without a read/write justification
|
|
- Missing foreign key constraints where referential integrity is required
|
|
- Data models that assume a specific storage engine without justification |