opencode-workflow/skills/data-modeling/SKILL.md

142 lines
4.8 KiB
Markdown

---
name: data-modeling
description: "Knowledge contract for defining database schemas, partition keys, indexes, query patterns, denormalization strategy, TTL/caching, and data ownership. Referenced by design-architecture when designing data models."
---
This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing database schemas and data models.
## Core Principles
- Data models must be driven by query and write patterns, not theoretical purity
- Each table or collection must serve a clear purpose traced to PRD requirements
- Indexes must be justified by identified query patterns
- Data ownership must be unambiguous: each data item belongs to exactly one service
## Table Definitions
For each table or collection, define:
- Table name and purpose (traced to PRD requirement)
- Column definitions:
- Name
- Data type
- Nullable or not null
- Default value (if any)
- Constraints (unique, check, etc.)
- Primary key
- Foreign keys and relationships
- Data volume estimates (when relevant for storage selection)
## Index Design
Indexes must be justified by query patterns:
- Identify the queries this table must support
- Design indexes to cover those queries
- Avoid speculative indexes "just in case"
- Consider write amplification: every index slows writes
Index justification format:
- Index name
- Columns (with sort direction)
- Type (unique, non-unique, partial, composite)
- Query pattern it serves
- Estimated selectivity
## Partition Keys
When designing distributed data stores:
- Partition key must distribute data evenly across nodes
- Partition key should align with the most common access pattern
- Consider hot partition risks
- Define partition strategy (hash, range, composite)
## Relationships
Define relationships explicitly:
- One-to-one
- One-to-many (with foreign key placement)
- Many-to-many (with junction table)
For each relationship:
- Direction of access (which side queries the other)
- Cardinality (exactly N, at most N, unbounded)
- Nullability (is the relationship optional?)
- Cascade behavior (what happens on delete?)
## Denormalization Strategy
Denormalize when:
- A query needs data from multiple entities and joins are expensive or unavailable
- Read frequency significantly exceeds write frequency
- The denormalized data has a clear source of truth that can be kept in sync
Do not denormalize when:
- The data changes frequently and consistency is critical
- Joins are cheap and the data store supports them well
- The denormalization creates complex synchronization logic
- There is no clear source of truth
For each denormalized field:
- Identify the source of truth
- Define the synchronization mechanism (eventual consistency, sync on read, sync on write)
- Define the staleness tolerance
## TTL and Caching
### TTL (Time-To-Live)
Define TTL for:
- Ephemeral data (sessions, temporary tokens, idempotency keys)
- Time-bounded data (logs, analytics, expired records)
- Data that must be purged after a regulatory period
For each TTL:
- Duration and basis (absolute time, sliding window, last access)
- Action on expiration (delete, archive, revoke)
### Caching
Define caching for:
- Frequently read, rarely written data
- Computed aggregates that are expensive to recalculate
- Data that is accessed across service boundaries
For each cache:
- Cache type (in-process, distributed, CDN)
- Invalidation strategy (TTL-based, event-based, write-through)
- Staleness tolerance
- Cache miss behavior (stale-while-recompute, block-and-fetch)
## Data Ownership
Each piece of data must have exactly one owner:
- The owning service is the single source of truth
- Other services access that data via the owner's API or events
- No service reads directly from another service's data store
- If data is needed in multiple places, replicate via events with a clear source of truth
Data ownership format:
| Data Item | Owning Service | Access Pattern | Replication Strategy |
|----------|---------------|----------------|---------------------|
| ... | ... | ... | ... |
## Query Pattern Analysis
For each table, document:
- Primary query patterns (by which columns/keys is data accessed)
- Write patterns (insert-heavy, update-heavy, or mixed)
- Read-to-write ratio (when relevant)
- Consistency requirements (strong, eventual, or tunable)
- Scale expectations (rows per day, rows total, growth rate)
This analysis drives:
- Index selection
- Partition key selection
- Storage engine selection
- Denormalization decisions
## Anti-Patterns
- Tables without a clear PRD requirement
- Indexes without a documented query pattern
- Shared tables across service boundaries
- Premature denormalization without a read/write justification
- Missing foreign key constraints where referential integrity is required
- Data models that assume a specific storage engine without justification