149 lines
6.5 KiB
Markdown
149 lines
6.5 KiB
Markdown
---
|
|
name: storage-knowledge
|
|
description: "Knowledge contract for selecting storage technologies based on data patterns. Covers relational, wide-column, document, and key-value stores with use-when and avoid-when criteria. Referenced by design-architecture when making storage decisions."
|
|
---
|
|
|
|
This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is making storage technology decisions.
|
|
|
|
## Core Principle
|
|
|
|
Storage selection must be driven by query patterns, write patterns, consistency requirements, and scale expectations identified in the PRD. Do not choose a storage technology because it is familiar, fashionable, or might be needed someday.
|
|
|
|
Every storage choice must be justified. If a simpler option meets the requirements, use it.
|
|
|
|
## Storage Selection Criteria
|
|
|
|
For each data entity, answer these questions before selecting storage:
|
|
|
|
1. What are the primary query patterns? (by key, by range, by complex filter, by full-text search)
|
|
2. What are the write patterns? (insert-heavy, update-heavy, append-only)
|
|
3. What consistency is required? (strong, eventual, tunable)
|
|
4. What scale is expected? (rows per day, total rows, growth rate)
|
|
5. What are the access latency requirements? (ms, seconds, eventual)
|
|
6. What relationships exist with other entities? (foreign keys, nested documents, graph traversals)
|
|
|
|
## Relational Database (PostgreSQL, MySQL, etc.)
|
|
|
|
Use when:
|
|
- Strong consistency is required (ACID transactions)
|
|
- Complex joins are needed for queries
|
|
- Transactional integrity across multiple entities is required
|
|
- Data has well-defined structure with relationships
|
|
- Referential integrity constraints are important
|
|
- Ad-hoc querying on multiple dimensions is common
|
|
|
|
Avoid when:
|
|
- Write throughput exceeds what a single relational node can handle and sharding adds unacceptable complexity
|
|
- Data is deeply nested and rarely queried across relationships
|
|
- Schema evolves rapidly and migrations are costly
|
|
- Full-text search is a primary access pattern (use a search engine instead)
|
|
|
|
Trade-offs: +strong consistency, +relationships, +ad-hoc queries, +maturity, -scaling complexity, -schema rigidity
|
|
|
|
### Schema Design for Relational
|
|
- Normalize to 3NF by default
|
|
- Denormalize selectively based on query patterns (see `data-modeling`)
|
|
- Define foreign keys with appropriate ON DELETE behavior
|
|
- Define indexes for identified query patterns only
|
|
- Consider partitioning for large tables
|
|
|
|
## Wide-Column / Cassandra
|
|
|
|
Use when:
|
|
- High write throughput is required (append-heavy workloads)
|
|
- Query-first modeling (you know all query patterns upfront)
|
|
- Large-scale time-series data
|
|
- Geographic distribution with local writes
|
|
- Linear horizontal scaling is required
|
|
- Availability is prioritized over strong consistency (tunable consistency)
|
|
|
|
Avoid when:
|
|
- Ad-hoc queries on arbitrary columns are needed
|
|
- Relational joins across tables are common
|
|
- Strong consistency is required for all operations
|
|
- The data model requires many secondary indexes
|
|
- The team lacks Cassandra modeling experience (data modeling mistakes are costly to fix)
|
|
|
|
Trade-offs: +write throughput, +horizontal scaling, +availability, -no joins, -query-first modeling required, -modeling mistakes are expensive
|
|
|
|
### Schema Design for Wide-Column
|
|
- Model around query patterns: each table serves a specific query
|
|
- Partition key must distribute data evenly
|
|
- Clustering columns define sort order within a partition
|
|
- Denormalize aggressively: one table per query pattern
|
|
- Avoid secondary indexes; model queries into the primary key instead
|
|
|
|
## Document / MongoDB
|
|
|
|
Use when:
|
|
- Data is document-centric with nested structures
|
|
- Schema flexibility is required (rapidly evolving data)
|
|
- Aggregate boundaries align with document boundaries
|
|
- Single-document atomicity is sufficient
|
|
- Read-heavy workloads with rich query capabilities
|
|
|
|
Avoid when:
|
|
- Strong relational constraints between entities are required
|
|
- Multi-document transactions are frequent (MongoDB supports them but they are slower)
|
|
- Data requires complex joins across many collections
|
|
- Strict schema validation is critical
|
|
|
|
Trade-offs: +schema flexibility, +nested structures, +rich queries, +easy to start, -relationship handling, -larger storage for indexes, -multi-document transaction overhead
|
|
|
|
### Schema Design for Document
|
|
- Design documents around access patterns
|
|
- Embed data that is always accessed together
|
|
- Reference data that is accessed independently
|
|
- Use indexes for fields that are frequently filtered
|
|
- Consider document size limits (16MB in MongoDB)
|
|
- Use change streams for event-driven patterns
|
|
|
|
## Key-Value / Redis
|
|
|
|
Use for:
|
|
- Caching frequently accessed data
|
|
- Rate limiting (counters with TTL)
|
|
- Idempotency keys (set with TTL, check existence)
|
|
- Ephemeral state (sessions, temporary tokens)
|
|
- Distributed locking
|
|
- Sorted sets for leaderboards or priority queues
|
|
- Pub/sub for lightweight messaging
|
|
|
|
Avoid when:
|
|
- You need complex queries (no query language)
|
|
- You need durability for primary data (Redis persistence is not ACID)
|
|
- Data size exceeds available memory and eviction is unacceptable
|
|
- You need relationships between entities
|
|
|
|
Trade-offs: +speed, +simplicity, +data structures, -memory cost, -durability (with caveats), -no complex queries
|
|
|
|
### Using Redis as Primary Storage
|
|
Only when:
|
|
- Data is inherently ephemeral (sessions, rate limits, idempotency keys)
|
|
- Data loss is acceptable or can be reconstructed
|
|
- The team understands persistence limitations (RDB snapshots, AOF)
|
|
|
|
Never use Redis as the primary persistent store for business-critical data unless:
|
|
- Durability requirements are clearly defined
|
|
- Persistence configuration (RDB + AOF) meets those requirements
|
|
- Recovery procedures are tested and documented
|
|
|
|
## Storage Selection Decision Framework
|
|
|
|
1. Start with the simplest option that meets requirements
|
|
2. Only add complexity when the PRD justifies it
|
|
3. Prefer one storage technology when it meets all requirements
|
|
4. Add a second storage technology only when a specific PRD requirement demands it
|
|
5. Document every storage choice as an ADR with:
|
|
- The requirement that drives it
|
|
- The alternatives considered
|
|
- Why the chosen option is the simplest that works
|
|
|
|
## Anti-Patterns
|
|
|
|
- Using Cassandra for a 10,000-row table with ad-hoc queries
|
|
- Using MongoDB for highly relational data requiring joins
|
|
- Using Redis as a primary persistent store without understanding durability
|
|
- Using multiple storage technologies when one suffices
|
|
- Choosing storage based on familiarity rather than query/write patterns
|
|
- Premature optimization: selecting distributed storage before single-node is proven insufficient |