---
name: data-modeling
description: "Knowledge contract for defining database schemas, partition keys, indexes, query patterns, denormalization strategy, TTL/caching, and data ownership. Referenced by design-architecture when designing data models."
---

This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing database schemas and data models.

## Core Principles

- Data models must be driven by query and write patterns, not theoretical purity
- Each table or collection must serve a clear purpose traced to PRD requirements
- Indexes must be justified by identified query patterns
- Data ownership must be unambiguous: each data item belongs to exactly one service

## Table Definitions

For each table or collection, define:
- Table name and purpose (traced to PRD requirement)
- Column definitions:
  - Name
  - Data type
  - Nullable or not null
  - Default value (if any)
  - Constraints (unique, check, etc.)
- Primary key
- Foreign keys and relationships
- Data volume estimates (when relevant for storage selection)

## Index Design

Indexes must be justified by query patterns:
- Identify the queries this table must support
- Design indexes to cover those queries
- Avoid speculative indexes "just in case"
- Consider write amplification: every index slows writes

Index justification format:
- Index name
- Columns (with sort direction)
- Type (unique, non-unique, partial, composite)
- Query pattern it serves
- Estimated selectivity

## Partition Keys

When designing distributed data stores:
- Partition key must distribute data evenly across nodes
- Partition key should align with the most common access pattern
- Consider hot partition risks
- Define partition strategy (hash, range, composite)

## Relationships

Define relationships explicitly:
- One-to-one
- One-to-many (with foreign key placement)
- Many-to-many (with junction table)

For each relationship:
- Direction of access (which side queries the other)
- Cardinality (exactly N, at most N, unbounded)
- Nullability (is the relationship optional?)
- Cascade behavior (what happens on delete?)

## Denormalization Strategy

Denormalize when:
- A query needs data from multiple entities and joins are expensive or unavailable
- Read frequency significantly exceeds write frequency
- The denormalized data has a clear source of truth that can be kept in sync

Do not denormalize when:
- The data changes frequently and consistency is critical
- Joins are cheap and the data store supports them well
- The denormalization creates complex synchronization logic
- There is no clear source of truth

For each denormalized field:
- Identify the source of truth
- Define the synchronization mechanism (eventual consistency, sync on read, sync on write)
- Define the staleness tolerance

## TTL and Caching

### TTL (Time-To-Live)
Define TTL for:
- Ephemeral data (sessions, temporary tokens, idempotency keys)
- Time-bounded data (logs, analytics, expired records)
- Data that must be purged after a regulatory period

For each TTL:
- Duration and basis (absolute time, sliding window, last access)
- Action on expiration (delete, archive, revoke)

### Caching
Define caching for:
- Frequently read, rarely written data
- Computed aggregates that are expensive to recalculate
- Data that is accessed across service boundaries

For each cache:
- Cache type (in-process, distributed, CDN)
- Invalidation strategy (TTL-based, event-based, write-through)
- Staleness tolerance
- Cache miss behavior (stale-while-recompute, block-and-fetch)

## Data Ownership

Each piece of data must have exactly one owner:
- The owning service is the single source of truth
- Other services access that data via the owner's API or events
- No service reads directly from another service's data store
- If data is needed in multiple places, replicate via events with a clear source of truth

Data ownership format:
| Data Item | Owning Service | Access Pattern | Replication Strategy |
|----------|---------------|----------------|---------------------|
| ... | ... | ... | ... |

## Query Pattern Analysis

For each table, document:
- Primary query patterns (by which columns/keys is data accessed)
- Write patterns (insert-heavy, update-heavy, or mixed)
- Read-to-write ratio (when relevant)
- Consistency requirements (strong, eventual, or tunable)
- Scale expectations (rows per day, rows total, growth rate)

This analysis drives:
- Index selection
- Partition key selection
- Storage engine selection
- Denormalization decisions

## Anti-Patterns

- Tables without a clear PRD requirement
- Indexes without a documented query pattern
- Shared tables across service boundaries
- Premature denormalization without a read/write justification
- Missing foreign key constraints where referential integrity is required
- Data models that assume a specific storage engine without justification