opencode-workflow/skills/storage-knowledge/SKILL.md

149 lines
6.5 KiB
Markdown
Raw Permalink Normal View History

---
name: storage-knowledge
description: "Knowledge contract for selecting storage technologies based on data patterns. Covers relational, wide-column, document, and key-value stores with use-when and avoid-when criteria. Referenced by design-architecture when making storage decisions."
---
This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is making storage technology decisions.
## Core Principle
Storage selection must be driven by query patterns, write patterns, consistency requirements, and scale expectations identified in the PRD. Do not choose a storage technology because it is familiar, fashionable, or might be needed someday.
Every storage choice must be justified. If a simpler option meets the requirements, use it.
## Storage Selection Criteria
For each data entity, answer these questions before selecting storage:
1. What are the primary query patterns? (by key, by range, by complex filter, by full-text search)
2. What are the write patterns? (insert-heavy, update-heavy, append-only)
3. What consistency is required? (strong, eventual, tunable)
4. What scale is expected? (rows per day, total rows, growth rate)
5. What are the access latency requirements? (ms, seconds, eventual)
6. What relationships exist with other entities? (foreign keys, nested documents, graph traversals)
## Relational Database (PostgreSQL, MySQL, etc.)
Use when:
- Strong consistency is required (ACID transactions)
- Complex joins are needed for queries
- Transactional integrity across multiple entities is required
- Data has well-defined structure with relationships
- Referential integrity constraints are important
- Ad-hoc querying on multiple dimensions is common
Avoid when:
- Write throughput exceeds what a single relational node can handle and sharding adds unacceptable complexity
- Data is deeply nested and rarely queried across relationships
- Schema evolves rapidly and migrations are costly
- Full-text search is a primary access pattern (use a search engine instead)
Trade-offs: +strong consistency, +relationships, +ad-hoc queries, +maturity, -scaling complexity, -schema rigidity
### Schema Design for Relational
- Normalize to 3NF by default
- Denormalize selectively based on query patterns (see `data-modeling`)
- Define foreign keys with appropriate ON DELETE behavior
- Define indexes for identified query patterns only
- Consider partitioning for large tables
## Wide-Column / Cassandra
Use when:
- High write throughput is required (append-heavy workloads)
- Query-first modeling (you know all query patterns upfront)
- Large-scale time-series data
- Geographic distribution with local writes
- Linear horizontal scaling is required
- Availability is prioritized over strong consistency (tunable consistency)
Avoid when:
- Ad-hoc queries on arbitrary columns are needed
- Relational joins across tables are common
- Strong consistency is required for all operations
- The data model requires many secondary indexes
- The team lacks Cassandra modeling experience (data modeling mistakes are costly to fix)
Trade-offs: +write throughput, +horizontal scaling, +availability, -no joins, -query-first modeling required, -modeling mistakes are expensive
### Schema Design for Wide-Column
- Model around query patterns: each table serves a specific query
- Partition key must distribute data evenly
- Clustering columns define sort order within a partition
- Denormalize aggressively: one table per query pattern
- Avoid secondary indexes; model queries into the primary key instead
## Document / MongoDB
Use when:
- Data is document-centric with nested structures
- Schema flexibility is required (rapidly evolving data)
- Aggregate boundaries align with document boundaries
- Single-document atomicity is sufficient
- Read-heavy workloads with rich query capabilities
Avoid when:
- Strong relational constraints between entities are required
- Multi-document transactions are frequent (MongoDB supports them but they are slower)
- Data requires complex joins across many collections
- Strict schema validation is critical
Trade-offs: +schema flexibility, +nested structures, +rich queries, +easy to start, -relationship handling, -larger storage for indexes, -multi-document transaction overhead
### Schema Design for Document
- Design documents around access patterns
- Embed data that is always accessed together
- Reference data that is accessed independently
- Use indexes for fields that are frequently filtered
- Consider document size limits (16MB in MongoDB)
- Use change streams for event-driven patterns
## Key-Value / Redis
Use for:
- Caching frequently accessed data
- Rate limiting (counters with TTL)
- Idempotency keys (set with TTL, check existence)
- Ephemeral state (sessions, temporary tokens)
- Distributed locking
- Sorted sets for leaderboards or priority queues
- Pub/sub for lightweight messaging
Avoid when:
- You need complex queries (no query language)
- You need durability for primary data (Redis persistence is not ACID)
- Data size exceeds available memory and eviction is unacceptable
- You need relationships between entities
Trade-offs: +speed, +simplicity, +data structures, -memory cost, -durability (with caveats), -no complex queries
### Using Redis as Primary Storage
Only when:
- Data is inherently ephemeral (sessions, rate limits, idempotency keys)
- Data loss is acceptable or can be reconstructed
- The team understands persistence limitations (RDB snapshots, AOF)
Never use Redis as the primary persistent store for business-critical data unless:
- Durability requirements are clearly defined
- Persistence configuration (RDB + AOF) meets those requirements
- Recovery procedures are tested and documented
## Storage Selection Decision Framework
1. Start with the simplest option that meets requirements
2. Only add complexity when the PRD justifies it
3. Prefer one storage technology when it meets all requirements
4. Add a second storage technology only when a specific PRD requirement demands it
5. Document every storage choice as an ADR with:
- The requirement that drives it
- The alternatives considered
- Why the chosen option is the simplest that works
## Anti-Patterns
- Using Cassandra for a 10,000-row table with ad-hoc queries
- Using MongoDB for highly relational data requiring joins
- Using Redis as a primary persistent store without understanding durability
- Using multiple storage technologies when one suffices
- Choosing storage based on familiarity rather than query/write patterns
- Premature optimization: selecting distributed storage before single-node is proven insufficient