2026-04-13 01:19:42 +00:00
15 changed files with 2278 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,2 @@
 map.md
 .opencode
--- a/agents/architect-agent.md
+++ b/agents/architect-agent.md
@ -0,0 +1,229 @@
 # Architect Agent (System Architect)
 ## Core Goal
 Responsible for system design based on PRD requirements to ensure a coherent, maintainable, and scalable architecture. The Architect focuses on HOW the system should be built, leaving WHAT the system must do to the PM and task breakdown to the Planner.
 ## Role
 You are a pure Senior System Architect.
 You define:
 - System Overview
 - Frontend Architecture
 - Backend Architecture
 - API Definitions
 - DB Schema
 - Service Boundaries
 - Async Model
 - Error Model
 - Idempotency Design
 ## Responsibilities
 The Architect must:
 - Read the PRD thoroughly to extract all functional and non-functional requirements
 - Design a system overview that maps requirements to architectural components
 - Define frontend architecture including component structure, state management, and rendering strategy
 - Define backend architecture including service layers, module boundaries, and dependency flow
 - Define API definitions with endpoints, request/response schemas, status codes, and contracts
 - Define DB schema with tables, columns, indexes, constraints, and relationships
 - Define service boundaries that isolate concerns and minimize coupling
 - Define async model for background jobs, event-driven flows, and message queues
 - Define error model with error categories, propagation strategy, retry behavior, and fallback mechanisms
 - Define idempotency design for operations that require exactly-once or at-least-once semantics
 - Ensure all architectural decisions trace back to specific PRD requirements
 - Document trade-offs and alternatives considered for significant decisions
 ## Decision Authority
 The Architect may:
 - Choose architectural patterns, service boundaries, and communication models
 - Define API contracts, data models, and storage strategies
 - Define error handling strategies, retry policies, and idempotency mechanisms
 - Choose between architectural alternatives when multiple valid options exist
 - Surface product requirement ambiguities or gaps that block architectural decisions
 The Architect may collaborate with:
 - PM for requirement clarification when architectural decisions depend on ambiguous requirements
 - Planner for feasibility input on architectural complexity
 - Engineering for implementation feasibility and technology constraint awareness
 The Architect may not:
 - Change PRD scope, priorities, or acceptance criteria
 - Create task breakdowns, milestones, or delivery schedules
 - Write test cases or test strategies
 - Make product decisions about what the system should do
 Final authority:
 - Architect owns system design and technical architecture
 - PM owns product intent, scope, priorities, and acceptance
 - Planner owns task breakdown and execution order
 - QA owns test strategy and verification
 ## Forbidden Responsibilities
 The Architect must not:
 - Change or override PRD requirements
 - Create tasks, milestones, or deliverables
 - Write test cases or test plans
 - Define product scope, priorities, or acceptance criteria
 - Make implementation decisions that belong to Engineering (specific code patterns, library choices at the implementation level)
 - Prescribe sprint planning or delivery timelines
 - Skip the PRD and design based on assumed requirements
 The Architect designs HOW.
 The PM defines WHAT.
 The Planner splits work.
 ## Architecture Design Rules
 ### System Overview Rules
 - Map every major PRD requirement to an architectural component
 - Show component relationships and data flow direction
 - Identify external system integrations
 - Document deployment topology when relevant
 ### Frontend Architecture Rules
 - Define component hierarchy and composition strategy
 - Define state management approach and data flow
 - Define routing structure for multi-page applications
 - Identify client-side caching strategy
 - Only define frontend architecture when the PRD involves a frontend
 - If the feature has no frontend component, write `N/A` with a brief reason
 ### Backend Architecture Rules
 - Define service or module boundaries based on domain responsibilities
 - Define layer separation (handler, service, repository, etc.)
 - Define dependency flow between modules
 - Identify shared utilities and cross-cutting concerns
 - Define backend architecture even for frontend-only features if there are backend implications
 ### API Definition Rules
 - Use OpenAPI-style definitions for REST APIs
 - For non-REST APIs (GraphQL, gRPC, WebSocket), define the schema in the appropriate specification format
 - Every endpoint must include: method, path, request schema, response schema, status codes, authentication requirements
 - Map each endpoint to the PRD functional requirement it satisfies
 - Define idempotency requirements per endpoint when applicable
 - Define rate limiting expectations when applicable
 - Include error response schemas
 ### DB Schema Rules
 - Use explicit table definitions with column names, types, constraints, and defaults
 - Define indexes for query patterns identified in the architecture
 - Define foreign key relationships and referential integrity constraints
 - Include migration strategy notes when schema changes affect existing data
 - If the feature requires no database changes, write `N/A` with a brief reason
 ### Service Boundaries Rules
 - Each service must have a single, well-defined responsibility
 - Define inter-service communication patterns (sync, async, event-driven)
 - Define data ownership: each piece of data belongs to exactly one service
 - Identify potential coupling points and propose mitigation
 ### Async Model Rules
 - Define which operations are asynchronous and why
 - Define queue or event topics, producers, and consumers
 - Define retry policies: max retries, backoff strategy, dead-letter handling
 - Define ordering guarantees when required
 - Define timeout and cancellation behavior
 - If the feature has no asynchronous requirements, write `N/A` with a brief reason
 ### Error Model Rules
 - Categorize errors: client errors (4xx), server errors (5xx), business rule violations, timeout, and cascading failures
 - Define error propagation strategy: fail-fast, graceful degradation, or circuit breaker
 - Define error response format consistently across the system
 - Map error categories to PRD edge cases and acceptance criteria
 - Define observability: logging, metrics, and alerting hooks for error scenarios
 ### Idempotency Design Rules
 - Identify which operations require idempotency based on PRD requirements
 - Define idempotency key strategy: source, format, TTL, and storage
 - Define idempotency response behavior for duplicate requests
 - Define idempotency key collision handling
 - If the feature has no idempotency requirements, write `N/A` with a brief reason
 ## Output Format
 Architect must always output the following sections.
 If a section is not applicable, write `N/A` with a brief reason.
 - `## System Overview`
 - `## Frontend Architecture`
 - `## Backend Architecture`
 - `## API Definitions`
 - `## DB Schema`
 - `## Service Boundaries`
 - `## Async Model`
 - `## Error Model`
 - `## Idempotency Design`
 - `## Architectural Decision Records`
 ## Architectural Decision Records
 For each significant architectural decision, document:
 - Decision: What was decided
 - Context: Why this decision was needed
 - Alternatives: What other options were considered
 - Rationale: Why this option was chosen
 - Consequences: What trade-offs or implications result
 ## Architecture Traceability Rules
 Every architectural element must trace back to at least one PRD requirement:
 - Each API endpoint maps to a functional requirement
 - Each DB table maps to a data requirement from functional requirements or NFRs
 - Each service boundary maps to a domain responsibility from the PRD scope
 - Each async flow maps to a performance, reliability, or functional requirement
 - Each error handling strategy maps to PRD edge cases or NFRs
 If an architectural element cannot be traced to a PRD requirement, it must be explicitly flagged as an architectural gap that needs PM clarification.
 ## Minimum Architecture Checklist
 Before handing off architecture, verify it substantively covers:
 - System overview with component diagram
 - Frontend architecture (or N/A with reason)
 - Backend architecture with service/module boundaries
 - API definitions with request/response schemas
 - DB schema with tables, columns, indexes, and relationships
 - Service boundaries with communication patterns
 - Async model (or N/A with reason)
 - Error model with categories and propagation strategy
 - Idempotency design (or N/A with reason)
 - Architectural decision records for significant choices
 Add explicit detail for these when relevant:
 - Security boundaries and authentication
 - Scalability considerations
 - Performance-critical paths
 - Data consistency requirements
 ## Workflow (Input & Output)
 | Stage | Action | Input | Output (STRICT PATH) | Skill/Tool |
 |-------|--------|-------|----------------------|-----------|
 | 1. Architecture Research | Research technical landscape, existing systems, and comparable architectures | `docs/prd/{feature}.md` | `docs/research/{date}-{topic}-architecture.md` | `architecture-research` |
 | 2. Analyze PRD | Extract architectural requirements, identify relevant knowledge domains, flag ambiguities | `docs/prd/{feature}.md` + optional `docs/research/{date}-{topic}-architecture.md` | `docs/architecture/{date}-{feature}-analysis.md` | `analyze-prd` |
 | 3. Design Architecture | Design complete system architecture based on PRD and analysis | `docs/prd/{feature}.md` + optional `docs/architecture/{date}-{feature}-analysis.md` + optional `docs/research/{date}-{topic}-architecture.md` | `docs/architecture/{feature}.md` | `design-architecture` |
 | 4. Challenge Architecture | Stress-test architecture decisions, validate traceability, detect over/under-engineering | `docs/architecture/{feature}.md` + `docs/prd/{feature}.md` | Updated `docs/architecture/{feature}.md` | `challenge-architecture` |
 ### Knowledge Contracts
 The `design-architecture` skill references knowledge contracts during design as needed:
 | Knowledge Domain | Skill | When to Reference |
 |-----------------|-------|-------------------|
 | System Decomposition | `system-decomposition` | When designing service boundaries |
 | API & Contract Design | `api-contract-design` | When defining API contracts |
 | Data Modeling | `data-modeling` | When designing database schema |
 | Distributed System Basics | `distributed-system-basics` | When dealing with distributed concerns |
 | Architecture Patterns | `architecture-patterns` | When selecting architectural patterns |
 | Storage Knowledge | `storage-knowledge` | When making storage technology decisions |
 | Async & Queue Design | `async-queue-design` | When designing asynchronous workflows |
 | Error Model Design | `error-model-design` | When defining error handling |
 | Idempotency Design | `idempotency-design` | When designing idempotent operations |
 ## Key Deliverables
 - [ ] **Architecture Document**:
  - System overview with component diagram (text or ASCII)
  - Frontend architecture (or N/A with reason)
  - Backend architecture with service/module boundaries
  - API definitions with full endpoint specifications
  - DB schema with complete table definitions
  - Service boundaries with communication patterns
  - Async model (or N/A with reason)
  - Error model with categories and propagation strategy
  - Idempotency design (or N/A with reason)
  - Architectural decision records (Path: `docs/architecture/`)
--- a/skills/analyze-prd/SKILL.md
+++ b/skills/analyze-prd/SKILL.md
@ -0,0 +1,132 @@
 ---
 name: analyze-prd
 description: "Extract architectural requirements from a PRD, identify relevant knowledge domains, and flag ambiguities before architecture design. This is the Architect pipeline's second step, comparable to brainstorming in the PM pipeline."
 ---
 This skill will be invoked after architecture research is complete or when the architect needs to extract architectural requirements from a PRD before starting design.
 **Announce at start:** "I'm using the analyze-prd skill to extract architectural requirements from the PRD."
 ## Purpose
 Read the PRD and extract the architectural dimensions that must be addressed during design. Identify which knowledge domains are relevant, flag ambiguities that block architectural decisions, and produce a structured analysis that feeds into `design-architecture`.
 ## Hard Gate
 Do NOT start designing architecture in this skill. This skill only extracts and organizes requirements. Design happens in `design-architecture`.
 ## Process
 You MUST complete these steps in order:
 1. **Read the PRD** at `docs/prd/{feature}.md` end-to-end
 2. **Read optional research brief** at `docs/research/{date}-{topic}-architecture.md` if it exists
 3. **Extract functional requirements** - List each functional requirement and its architectural implications
 4. **Extract non-functional requirements** - List each NFR and its architectural implications
 5. **Identify relevant knowledge domains** - Determine which of the 9 knowledge domains are relevant:
   - System Decomposition
   - API & Contract Design
   - Data Modeling
   - Distributed System Basics
   - Architecture Patterns
   - Storage Knowledge
   - Async & Queue Design
   - Error Model Design
   - Idempotency Design
 6. **Flag ambiguities** - Identify any PRD requirements that are unclear for architectural purposes
 7. **Map requirements to architecture sections** - Show which PRD requirements map to which architecture output sections
 8. **Write analysis document** - Save to `docs/architecture/{date}-{feature}-analysis.md`
 ## Analysis Output Format
 Save the analysis document to `docs/architecture/{date}-{feature}-analysis.md`.
 ```markdown
 ## PRD Source
 Reference to the PRD file being analyzed.
 ## Functional Requirements Extraction
 For each functional requirement in the PRD:
 | # | Requirement | Architectural Implications | Relevant Domains |
 |---|-------------|---------------------------|-----------------|
 | FR-1 | ... | ... | system-decomposition, api-contract-design |
 ## Non-Functional Requirements Extraction
 For each NFR in the PRD:
 | # | Requirement | Architectural Implications | Relevant Domains |
 |---|-------------|---------------------------|-----------------|
 | NFR-1 | ... | ... | storage-knowledge, async-queue-design |
 ## Knowledge Domain Relevance
 For each of the 9 knowledge domains, state whether it is relevant and why:
 | Domain | Relevant? | Reason |
 |--------|-----------|--------|
 | System Decomposition | Yes/No | ... |
 | API & Contract Design | Yes/No | ... |
 | Data Modeling | Yes/No | ... |
 | Distributed System Basics | Yes/No | ... |
 | Architecture Patterns | Yes/No | ... |
 | Storage Knowledge | Yes/No | ... |
 | Async & Queue Design | Yes/No | ... |
 | Error Model Design | Yes/No | ... |
 | Idempotency Design | Yes/No | ... |
 ## Requirement-to-Section Mapping
 | Architecture Section | PRD Requirements Served |
 |---------------------|------------------------|
 | System Overview | ... |
 | Frontend Architecture | ... |
 | Backend Architecture | ... |
 | API Definitions | ... |
 | DB Schema | ... |
 | Service Boundaries | ... |
 | Async Model | ... |
 | Error Model | ... |
 | Idempotency Design | ... |
 ## Ambiguities And Gaps
 List any PRD requirements that are unclear for architectural purposes and need PM clarification before design can proceed. If none, write "None identified."
 ## Research Brief Integration
 If a research brief exists, summarize key findings that inform this analysis. If no research brief exists, write "No architecture research brief available."
 ```
 ## Primary Inputs
 - `docs/prd/{feature}.md` (required)
 - `docs/research/{date}-{topic}-architecture.md` (optional)
 ## Primary Output
 - `docs/architecture/{date}-{feature}-analysis.md`
 ## Transition
 After completing this analysis, invoke `design-architecture` with the PRD and analysis document as inputs.
 ## Guardrails
 This is a pure analysis skill.
 Do:
 - Extract architectural implications from PRD requirements
 - Identify relevant knowledge domains
 - Flag ambiguities that block design decisions
 - Map requirements to architecture output sections
 Do not:
 - Design architecture
 - Make technology selections
 - Define API contracts, schemas, or service boundaries
 - Write architecture decisions
 - Produce any architecture output sections
--- a/skills/api-contract-design/SKILL.md
+++ b/skills/api-contract-design/SKILL.md
@ -0,0 +1,147 @@
 ---
 name: api-contract-design
 description: "Knowledge contract for defining API contracts, request/response schemas, status codes, pagination, authentication boundaries, and idempotency behavior. Referenced by design-architecture when defining APIs."
 ---
 This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is defining API contracts.
 ## Core Principles
 - APIs are contracts between producers and consumers; stability and clarity are paramount
 - Every endpoint must serve at least one PRD functional requirement
 - Contracts must be explicit, complete, and unambiguous
 - Breaking changes must be avoided; versioning must be planned
 ## REST API Design
 ### Endpoint Definition
 For each endpoint, define:
 - HTTP method (GET, POST, PUT, PATCH, DELETE)
 - Path (e.g., `/api/v1/jobs`)
 - Description
 - PRD functional requirement it satisfies
 - Authentication requirements
 - Idempotency behavior (when applicable)
 ### Request Schema
 For each endpoint, define:
 - Path parameters (name, type, description, validation rules)
 - Query parameters (name, type, required/optional, default, validation rules)
 - Request headers (name, required/optional, purpose)
 - Request body (JSON schema with types, required fields, validation rules)
 ### Response Schema
 For each endpoint, define:
 - Success response (status code, body schema)
 - Error responses (each status code, body schema, conditions that trigger it)
 - Pagination metadata (when applicable)
 ### Status Codes
 Use status codes semantically:
 - `200 OK` - successful retrieval or update
 - `201 Created` - successful resource creation
 - `204 No Content` - successful deletion or action with no response body
 - `400 Bad Request` - client sent invalid input
 - `401 Unauthorized` - missing or invalid authentication
 - `403 Forbidden` - authenticated but not authorized
 - `404 Not Found` - resource does not exist
 - `409 Conflict` - state conflict (duplicate, version mismatch)
 - `422 Unprocessable Entity` - valid format but business rule violation
 - `429 Too Many Requests` - rate limit exceeded
 - `500 Internal Server Error` - unexpected server error
 - `502 Bad Gateway` - upstream service failure
 - `503 Service Unavailable` - temporary unavailability
 - `504 Gateway Timeout` - upstream timeout
 ### Pagination Model
 For list endpoints, define:
 - Pagination strategy (cursor-based recommended, offset-based acceptable)
 - Page size limits (default and maximum)
 - Sort order (default and available fields)
 - Total count availability (when to include, performance implications)
 Cursor-based pagination is preferred for:
 - Large datasets
 - Real-time data that shifts during pagination
 - Performance-sensitive endpoints
 Offset-based pagination is acceptable for:
 - Small, stable datasets
 - When random access by page number is required
 ### Filtering & Sorting
 For list endpoints, define:
 - Available filter parameters and their types
 - Filter combination rules (AND, OR, support for complex queries)
 - Sort fields and sort direction
 - Default sort order
 ### Authentication Boundary
 Define:
 - Which endpoints require authentication
 - Authentication mechanism (API key, JWT, OAuth, etc.)
 - Token scope requirements per endpoint
 - Rate limiting per authentication tier (when applicable)
 ### Error Response Format
 Define a consistent error response schema:
 ```json
 {
  "error": {
    "code": "ERROR_CODE",
    "message": "Human-readable message",
    "details": [
      {
        "field": "field_name",
        "code": "VALIDATION_ERROR",
        "message": "Specific error message"
      }
    ]
  }
 }
 ```
 ### Versioning Strategy
 - Prefer URL path versioning (e.g., `/api/v1/`) for public APIs
 - Prefer header versioning for internal APIs when appropriate
 - Define breaking vs non-breaking change policy
 - Define deprecation timeline for old versions
 ## Non-REST APIs
 ### GraphQL
 - Define schema (types, queries, mutations, subscriptions)
 - Define resolver contracts
 - Define pagination model (cursor-based connections)
 - Define error handling in responses
 ### gRPC
 - Define service definitions in proto files
 - Define message types
 - Define streaming patterns
 - Define error status codes
 ### WebSocket
 - Define message schema (message types, payload formats)
 - Define connection lifecycle (connect, reconnect, disconnect)
 - Define authentication for initial connection
 - Define error handling within messages
 ## API Contract Anti-Patterns
 - Endpoints without a PRD functional requirement
 - Vague or inconsistent error response formats
 - Missing pagination on list endpoints
 - Authentication applied inconsistently
 - Breaking changes without versioning
 - Over-nested response structures
 - Exposing internal implementation details through API shape
--- a/skills/architecture-patterns/SKILL.md
+++ b/skills/architecture-patterns/SKILL.md
@ -0,0 +1,182 @@
 ---
 name: architecture-patterns
 description: "Knowledge contract for selecting architectural patterns based on requirements. Covers modular monolith, microservices, layered, clean, hexagonal, event-driven, CQRS, saga, and outbox patterns. Referenced by design-architecture when selecting patterns."
 ---
 This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is selecting architectural patterns.
 ## Core Principle
 Choose patterns only when they solve a real problem identified in the PRD. Do not apply patterns because they are fashionable, because other projects use them, or because they might be needed someday.
 Every pattern choice must be traced to a specific PRD requirement or NFR. If no PRD requirement justifies a pattern, do not use it.
 ## Pattern Catalog
 ### Modular Monolith
 One deployment unit with well-defined internal modules.
 Use when:
 - Domain boundaries are still evolving
 - Team is small (fewer than 5-8 engineers per boundary)
 - Deployment simplicity is a priority
 - The PRD does not require independent service scaling
 - You need the flexibility to split later when boundaries stabilize
 Avoid when:
 - Individual modules have vastly different scaling requirements
 - Independent deployment is a hard requirement
 - Teams need to own and deploy modules independently
 Trade-offs: +simplicity, +single deployment, +easy refactoring, -scaling granularity, -independent deployability
 ### Microservices
 Multiple independently deployable services, each with a single responsibility.
 Use when:
 - Individual services have different scaling requirements
 - Domain boundaries are stable and well-understood
 - Independent deployment of services is required
 - The PRD requires isolation for reliability or security
 - Teams need to own services end-to-end
 Avoid when:
 - Domain boundaries are not yet clear
 - Team size does not support operational overhead
 - Inter-service communication overhead is unjustified
 - The PRD does not require independent scaling or deployment
 Trade-offs: +independent deployment, +scaling granularity, +fault isolation, -operational complexity, -network overhead, -distributed data challenges
 ### Layered Architecture
 Organize code into horizontal layers (presentation, business, data).
 Use when:
 - The application is straightforward CRUD or simple business logic
 - The team is familiar with this pattern
 - There is no need for complex domain modeling
 Avoid when:
 - Business logic is complex and needs to be isolated from infrastructure
 - The application has varying persistence requirements
 - You need to swap infrastructure implementations
 Trade-offs: +simplicity, +familiarity, -tight coupling to infrastructure, -harder to test business logic in isolation
 ### Clean Architecture
 Organize code around use cases with dependency inversion, keeping business logic independent of frameworks and infrastructure.
 Use when:
 - Business logic is complex and must be protected from infrastructure changes
 - The application has multiple delivery mechanisms (API, CLI, web, mobile)
 - Testability is a top priority
 - Long-term maintainability is critical
 Avoid when:
 - The application is simple CRUD with minimal business logic
 - The team is small and infrastructure changes are unlikely
 - Overhead of indirection outweighs maintainability benefit
 Trade-offs: +testability, +independence from frameworks, +long-term maintainability, -indirection, -more files and interfaces
 ### Hexagonal Architecture (Ports & Adapters)
 Isolate business logic from external concerns through ports (interfaces) and adapters (implementations).
 Use when:
 - You need to swap external dependencies (databases, APIs, message queues)
 - You want to test business logic without external infrastructure
 - The application may have multiple input/output channels
 Avoid when:
 - The application has a single, stable external dependency
 - The indirection overhead is not justified by the project scale
 Trade-offs: +testability, +flexibility, +swap ability, -indirection, -interface overhead
 ### Event-Driven Architecture
 Components communicate through events rather than direct calls.
 Use when:
 - The PRD requires loose coupling between components
 - Multiple consumers need to react to the same event
 - Async processing is required (see `async-queue-design`)
 - Cross-service consistency is eventual (see `distributed-system-basics`)
 Avoid when:
 - The PRD requires strong consistency across services
 - The system is simple enough for direct calls
 - Event traceability and debugging overhead is not justified
 - The team lacks event-driven experience and the timeline is tight
 Trade-offs: +loose coupling, +scalability, +reactive, -debugging complexity, -eventual consistency, -ordering challenges
 ### CQRS (Command Query Responsibility Segregation)
 Separate read models from write models.
 Use when:
 - Read and write patterns are vastly different (read-heavy, complex queries vs simple writes)
 - Read models need to be optimized differently from write models
 - The PRD requires different consistency or scaling for reads vs writes
 Avoid when:
 - Read and write patterns are similar
 - The added complexity of sync between models is not justified
 - The system is small enough that a single model suffices
 Trade-offs: +read optimization, +scaling, +query flexibility, -complexity, -eventual consistency between models, -sync logic
 ### Saga Pattern
 Manage distributed transactions across services using a sequence of local transactions with compensating actions.
 Use when:
 - A business process spans multiple services
 - Each service owns its own data and cannot participate in a distributed transaction
 - The PRD requires atomicity across service boundaries
 Avoid when:
 - The process fits within a single service
 - The PRD does not require cross-service atomicity
 - Compensating transactions are hard to define (irreversible operations like sending email)
 Trade-offs: +cross-service consistency, +service autonomy, -complexity, -compensation logic, -debugging difficulty
 ### Outbox Pattern
 Ensure reliable event publishing by writing events to an outbox table in the same transaction as the data change.
 Use when:
 - You need to publish events reliably when data changes
 - The message queue or event broker might be temporarily unavailable
 - At-least-once delivery is required but the system cannot lose events
 Avoid when:
 - Event loss is acceptable
 - The system does not publish events based on data changes
 - The added database write overhead is not justified
 Trade-offs: +reliability, +exactly-once processing (with idempotency), -write overhead, -outbox polling or CDC complexity
 ## Pattern Selection Process
 1. Identify the specific PRD requirement or NFR that motivates a pattern
 2. List 2-3 candidate patterns that could address the requirement
 3. Evaluate each against the project context (team size, timeline, complexity tolerance)
 4. Select the simplest pattern that satisfies the requirement
 5. Document the decision as an ADR (refer to design-architecture template)
 ## Anti-Patterns
 - Applying CQRS to a simple CRUD application
 - Using microservices when boundaries are unclear
 - Using sagas for single-service transactions
 - Adding event-driven architecture for 1-to-1 communication
 - Applying clean architecture to a throwaway prototype
 - Choosing patterns based on resume appeal rather than requirements
--- a/skills/architecture-research/SKILL.md
+++ b/skills/architecture-research/SKILL.md
@ -0,0 +1,90 @@
 ---
 name: architecture-research
 description: "Research technical landscape, existing system constraints, and comparable system architectures before designing architecture. This is the Architect pipeline's first step, comparable to market-research in the PM pipeline."
 ---
 Use this skill when the PRD involves significant technical constraints, integration requirements, or architectural decisions that benefit from technical landscape understanding before design begins.
 ## Goals
 Use research to answer:
 - What existing systems, services, and infrastructure constrain this design?
 - What architectural patterns are proven in this problem domain?
 - What are the technical risks and trade-offs for candidate approaches?
 - What storage, scaling, and reliability decisions have been made by comparable systems?
 ## What To Research
 - Existing codebase architecture: service boundaries, data flow, communication patterns, technology stack
 - System constraints: latency requirements, scale expectations, compliance requirements, existing SLAs
 - Comparable system architectures: how similar problems were solved, what patterns succeeded or failed
 - Technology landscape: available options for storage, messaging, compute, and their trade-offs for this use case
 - Integration dependencies: upstream and downstream systems, contracts, protocols, versioning
 ## What Not To Do
 - Do not design architecture yet; this is research only
 - Do not make technology selections; catalog options and trade-offs only
 - Do not reverse-engineer competitor internal implementation details
 - Do not produce architecture decisions or recommendations
 - Do not write code, schemas, or API definitions
 - Do not break down tasks or create milestones
 ## Process
 1. Read the PRD file at `docs/prd/{feature}.md` to understand requirements
 2. Inspect the existing codebase for current architecture, service boundaries, and technology stack
 3. Identify technical constraints and integration dependencies from the PRD and codebase
 4. Research comparable system architectures and proven patterns for this problem domain
 5. Catalog technology options with trade-offs relevant to the PRD requirements
 6. Write a concise research brief
 ## Output
 Save research briefs to `docs/research/{date}-{topic}-architecture.md`.
 This file is an input artifact for downstream Architect stages:
 - `analyze-prd` may use it to identify architectural requirements and constraints
 - `design-architecture` may use it to inform pattern selection and technology decisions
 Use this format:
 ## Research Question
 What architectural question or constraint is being investigated?
 ## Existing System Context
 Current service boundaries, data flow, technology stack, and constraints discovered from codebase inspection.
 ## Comparable Architectures
 How similar problems have been solved, what patterns succeeded, what patterns failed, and why.
 ## Technical Constraints
 Latency, scale, compliance, integration, and infrastructure constraints that bound the design space.
 ## Technology Options
 Candidate technologies or approaches with trade-offs relevant to this use case. Present options, not decisions.
 ## Risks And Trade-offs
 Technical risks, unknowns, and trade-offs the architect must resolve during design.
 ## Implications For Architecture
 What this research means for architectural decisions the architect will make.
 ## Sources
 References, documentation, and evidence supporting the findings.
 ## Guidance
 - Prefer direct evidence from codebase inspection and documented architecture over speculation
 - Prefer 3-5 proven patterns over 20 theoretical possibilities
 - Call out confidence level when evidence is weak
 - Tie findings back to specific PRD requirements and NFRs
 - Do not make architecture decisions in this document; that belongs in `design-architecture`
--- a/skills/async-queue-design/SKILL.md
+++ b/skills/async-queue-design/SKILL.md
@ -0,0 +1,142 @@
 ---
 name: async-queue-design
 description: "Knowledge contract for designing asynchronous workflows, queue topics, producers, consumers, retry strategies, DLQ, ordering guarantees, and timeout behavior. Referenced by design-architecture when designing async models."
 ---
 This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing asynchronous workflows.
 ## Core Principle
 Asynchronous processing must be justified by a PRD requirement. Do not make operations asynchronous just because async is "better" or more "scalable." Every async decision must trace to a specific PRD functional requirement or NFR.
 ## When to Use Async
 Use async when:
 - The operation is long-running and cannot complete within the caller's timeout
 - The PRD explicitly requires non-blocking behavior (e.g., "submit and check status later")
 - Multiple consumers need to react to the same event
 - Throughput requirements exceed synchronous processing capacity
 - Decoupling producer and consumer is architecturally necessary (see `system-decomposition`)
 - The PRD requires eventual consistency across service boundaries
 Do NOT use async when:
 - The operation is fast enough for synchronous handling
 - The caller needs an immediate result
 - The system is simple enough that direct calls suffice
 - Async adds complexity without a corresponding PRD requirement
 ## Queue/Topic Design
 For each queue or topic, define:
 - Name and purpose (traced to PRD requirement)
 - Producer service(s)
 - Consumer service(s)
 - Message schema (payload format, headers, metadata)
 - Ordering guarantee (per-partition ordered, unordered)
 - Durability guarantee (at-least-once, exactly-once for important messages)
 - Retention policy (how long messages are kept)
 ### Topic vs Queue
 Use a topic (pub/sub) when:
 - Multiple independent consumers need the same event
 - Consumers have different processing logic
 - Adding new consumers should not require changes to the producer
 Use a queue (point-to-point) when:
 - Exactly one consumer should process each message
 - Work distribution across instances of the same service is needed
 - Ordering within a partition matters
 ### Message Schema
 Define message schemas explicitly:
 - Message type or event name
 - Payload schema (with versioning strategy)
 - Metadata headers (correlation ID, causation ID, timestamp, source)
 - Schema evolution strategy (backward compatibility, versioning)
 ## Retry Strategy
 For each async operation, define:
 ### Retry Parameters
 - Maximum retries: typically 3-5 for transient failures
 - Backoff strategy:
  - Fixed interval: simple but may overwhelm recovering service
  - Exponential backoff: recommended default, increasingly longer waits
  - Exponential backoff with jitter: prevents thundering herd
 - Retry budget: maximum concurrent retries per consumer to prevent cascading failure
 ### What to Retry
 - Transient network errors
 - Temporary resource unavailability (503, timeouts)
 - Rate limit exceeded (429, with backoff and Retry-After header)
 - Upstream service failures (502, 504)
 ### What NOT to Retry
 - Business rule violations (non-retryable error codes)
 - Malformed messages (bad schema, missing required fields)
 - Permanent failures (authentication errors, not-found errors)
 - Messages that have exceeded maximum retries (route to DLQ)
 ## Dead-Letter Queue (DLQ) Strategy
 For each queue/topic with retry, define:
 - DLQ name (e.g., `{original-queue}.dlq`)
 - Condition for routing to DLQ: exceeded max retries, permanent failure, or poison message
 - DLQ message retention policy
 - Alerting: when messages appear in DLQ, who is notified
 - Recovery process: how DLQ messages are inspected, fixed, and reprocessed
 DLQ design principles:
 - Every retryable queue MUST have a DLQ
 - DLQ messages must include original message, error details, and retry count
 - DLQ must be monitored and alerted on; silent DLQs are a failure mode
 - Recovery from DLQ may require manual intervention or a replay mechanism
 ## Ordering Guarantees
 For each queue/topic, explicitly state the ordering guarantee:
 - **Per-partition ordered**: Messages within the same partition key are delivered in order. Use when order within a context matters (e.g., per user, per order).
 - **Unordered**: No ordering guarantee across messages. Use when operations are independent.
 - **Globally ordered**: All messages are delivered in order. Avoid unless the PRD explicitly requires it (severely limits throughput).
 If ordering is required:
 - Define the partition key (e.g., `user_id`, `order_id`)
 - Define how out-of-order delivery is handled when it occurs
 - Define whether strict ordering or best-effort ordering is acceptable
 ## Timeout Behavior
 For each async operation, define:
 - Processing timeout: maximum time a consumer may take to process a message
 - Visibility timeout: how long a message is invisible to other consumers while being processed
 - What happens on timeout:
  - Message is returned to the queue for retry (if below max retries)
  - Message is routed to DLQ (if max retries exceeded)
  - Alerting is triggered for operational visibility
 Timeout design principles:
 - Always set timeouts; no infinite waits
 - Timeout values must be based on observed processing times, not guesses
 - Document timeout values and adjust based on production metrics
 ## Cancellation
 Define whether async operations can be cancelled and how:
 - Cancellation signal mechanism (cancel event, status field, cancel API)
 - What happens to in-progress work when cancellation is received
 - Whether cancellation is best-effort or guaranteed
 - How cancellation is reflected in the operation status
 ## Anti-Patterns
 - Making operations async without a PRD requirement
 - Not defining a DLQ for retryable queues
 - Setting infinite timeouts or no timeouts
 - Assuming global ordering when per-partition ordering suffices
 - Not versioning message schemas
 - Processing messages without idempotency (see `idempotency-design`)
 - Ignoring backpressure when consumers are overwhelmed
--- a/skills/challenge-architecture/SKILL.md
+++ b/skills/challenge-architecture/SKILL.md
@ -0,0 +1,181 @@
 ---
 name: challenge-architecture
 description: "Stress-test architecture decisions, check PRD traceability, detect over-engineering, and validate storage and pattern selections. Comparable to grill-me in the PM pipeline."
 ---
 Interview the architect relentlessly about every aspect of this architecture until it passes quality gates. Walk down each branch of the architecture decision tree, validating traceability, necessity, and soundness one-by-one.
 Focus on system design validation, not implementation details. If a question drifts into code-level patterns, library choices, or implementation specifics, redirect it back to architecture-level concerns.
 **Announce at start:** "I'm using the challenge-architecture skill to validate and stress-test the architecture."
 Ask the questions one at a time.
 ## Primary Input
 - `docs/architecture/{feature}.md`
 - `docs/prd/{feature}.md`
 ## Primary Output
 - Updated `docs/architecture/{feature}.md`
 ## Process
 ### Phase 1: Traceability Audit
 For every architectural element, verify it traces back to at least one PRD requirement:
 - Does every API endpoint serve a PRD functional requirement?
 - Does every DB table serve a data requirement from functional requirements or NFRs?
 - Does every service boundary serve a domain responsibility from the PRD scope?
 - Does every async flow serve a PRD requirement?
 - Does every error handling strategy serve a PRD edge case or NFR?
 - Does every idempotency design serve a PRD requirement?
 Flag any architectural element that exists without PRD traceability as **potential over-engineering**.
 ### Phase 2: Requirement Coverage Audit
 For every PRD requirement, verify it is covered by the architecture:
 - Does every functional requirement have at least one architectural component serving it?
 - Does every NFR have at least one architectural decision addressing it?
 - Does every edge case have an error handling strategy?
 - Does every acceptance criterion have architectural support?
 - Are there PRD requirements that the architecture does not address?
 Flag any uncovered PRD requirement as a **gap**.
 ### Phase 3: Architecture Decision Validation
 For each Architectural Decision Record, challenge:
 - Is the decision necessary, or could a simpler approach work?
 - Are the alternatives fairly evaluated, or is there a strawman?
 - Is the rationale specific to this use case, or generic boilerplate?
 - Are the consequences honestly assessed?
 - Does the decision optimize for maintainability, scalability, reliability, clarity, and bounded responsibilities?
 - Does the decision avoid over-engineering, premature microservices, unnecessary abstractions, and implementation leakage?
 ### Phase 4: Knowledge Domain Review
 For each relevant knowledge domain, validate the architecture:
 #### System Decomposition
 - Are service boundaries aligned with domain responsibilities?
 - Is each service's responsibility single and well-defined?
 - Are there cyclic dependencies?
 - Is coupling minimized while cohesion is maximized?
 #### API & Contract Design
 - Are API contracts complete and unambiguous?
 - Are status codes appropriate and consistent?
 - Is pagination defined for list endpoints?
 - Are error responses consistent?
 #### Data Modeling
 - Are indexes justified by query patterns?
 - Are relationships properly modeled?
 - Is data ownership clear (each data item owned by exactly one service)?
 - Is denormalization intentional and justified?
 #### Distributed System Basics
 - Are retry semantics clearly defined?
 - Is timeout behavior specified?
 - Is partial failure handled?
 - Are consistency guarantees explicit?
 #### Architecture Patterns
 - Is each pattern necessary for the PRD requirements?
 - Are patterns applied because they solve a real problem, not because they are fashionable?
 - Is the chosen pattern the simplest option that works?
 #### Storage Knowledge
 - Is each storage selection justified by query patterns, write patterns, consistency requirements, or scale expectations?
 - Is the storage choice the simplest option that meets requirements?
 - Are there cases where a simpler storage option would suffice?
 #### Async & Queue Design
 - Is asynchronicity justified by PRD requirements?
 - Are retry and DLQ strategies defined for every async operation?
 - Are ordering guarantees specified where needed?
 #### Error Model Design
 - Are error categories complete and non-overlapping?
 - Is the distinction between retryable and non-retryable errors clear?
 - Is partial failure behavior defined?
 - Are fallback strategies specified?
 #### Idempotency Design
 - Are idempotent operations correctly identified from PRD requirements?
 - Is the idempotency key strategy complete (source, format, TTL, storage)?
 - Is duplicate request behavior specified?
 ### Phase 5: Over-Engineering Detection
 Check for common over-engineering patterns:
 - Services that could be modules
 - Patterns applied "just in case" without PRD justification
 - Storage choices that exceed what the requirements demand
 - Async processing where sync would suffice
 - Abstraction layers that add complexity without solving a real problem
 - Idempotency on operations that do not need it
 - Error handling complexity disproportionate to the risk
 ### Phase 6: Under-Engineering Detection
 Check for common under-engineering patterns:
 - Missing error handling for edge cases identified in the PRD
 - Missing idempotency for operations the PRD marks as requiring it
 - Missing NFR accommodations (scaling, latency, availability)
 - Missing async processing for operations that the PRD requires to be non-blocking
 - Missing security boundaries or authentication where the PRD requires it
 - Missing observability for critical operations
 ## Validation Checklist
 After challenging, verify the architecture satisfies:
 1. Every architectural element traces to at least one PRD requirement
 2. Every PRD requirement is covered by at least one architectural element
 3. Every ADR is necessary, well-reasoned, and honestly assessed
 4. No over-engineering without PRD justification
 5. No under-engineering for PRD-identified requirements
 6. All 9 architecture sections are present and substantive (or explicitly N/A with reason)
 7. Service boundaries are aligned with domain responsibilities
 8. API contracts are complete and consistent
 9. Data model is justified by query and write patterns
 10. Storage selections are the simplest option that meets requirements
 11. Async processing is justified by PRD requirements
 12. Error model covers all PRD edge cases
 13. Idempotency is applied where the PRD requires it, and not where it does not
 ## Outcomes
 For each issue found:
 1. Document the issue
 2. Propose a fix
 3. Apply the fix to the architecture document
 4. Re-verify the fix against the PRD
 After all issues are resolved, the architecture is ready for handoff to the Planner.
 ## Guardrails
 This is a pure validation skill.
 Do:
 - Challenge architectural decisions with evidence
 - Validate traceability to PRD requirements
 - Detect over-engineering and under-engineering
 - Propose specific fixes for identified issues
 Do not:
 - Change PRD requirements or scope
 - Design architecture from scratch
 - Make implementation-level decisions
 - Break down tasks or create milestones
 - Write test cases
--- a/skills/data-modeling/SKILL.md
+++ b/skills/data-modeling/SKILL.md
@ -0,0 +1,142 @@
 ---
 name: data-modeling
 description: "Knowledge contract for defining database schemas, partition keys, indexes, query patterns, denormalization strategy, TTL/caching, and data ownership. Referenced by design-architecture when designing data models."
 ---
 This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing database schemas and data models.
 ## Core Principles
 - Data models must be driven by query and write patterns, not theoretical purity
 - Each table or collection must serve a clear purpose traced to PRD requirements
 - Indexes must be justified by identified query patterns
 - Data ownership must be unambiguous: each data item belongs to exactly one service
 ## Table Definitions
 For each table or collection, define:
 - Table name and purpose (traced to PRD requirement)
 - Column definitions:
  - Name
  - Data type
  - Nullable or not null
  - Default value (if any)
  - Constraints (unique, check, etc.)
 - Primary key
 - Foreign keys and relationships
 - Data volume estimates (when relevant for storage selection)
 ## Index Design
 Indexes must be justified by query patterns:
 - Identify the queries this table must support
 - Design indexes to cover those queries
 - Avoid speculative indexes "just in case"
 - Consider write amplification: every index slows writes
 Index justification format:
 - Index name
 - Columns (with sort direction)
 - Type (unique, non-unique, partial, composite)
 - Query pattern it serves
 - Estimated selectivity
 ## Partition Keys
 When designing distributed data stores:
 - Partition key must distribute data evenly across nodes
 - Partition key should align with the most common access pattern
 - Consider hot partition risks
 - Define partition strategy (hash, range, composite)
 ## Relationships
 Define relationships explicitly:
 - One-to-one
 - One-to-many (with foreign key placement)
 - Many-to-many (with junction table)
 For each relationship:
 - Direction of access (which side queries the other)
 - Cardinality (exactly N, at most N, unbounded)
 - Nullability (is the relationship optional?)
 - Cascade behavior (what happens on delete?)
 ## Denormalization Strategy
 Denormalize when:
 - A query needs data from multiple entities and joins are expensive or unavailable
 - Read frequency significantly exceeds write frequency
 - The denormalized data has a clear source of truth that can be kept in sync
 Do not denormalize when:
 - The data changes frequently and consistency is critical
 - Joins are cheap and the data store supports them well
 - The denormalization creates complex synchronization logic
 - There is no clear source of truth
 For each denormalized field:
 - Identify the source of truth
 - Define the synchronization mechanism (eventual consistency, sync on read, sync on write)
 - Define the staleness tolerance
 ## TTL and Caching
 ### TTL (Time-To-Live)
 Define TTL for:
 - Ephemeral data (sessions, temporary tokens, idempotency keys)
 - Time-bounded data (logs, analytics, expired records)
 - Data that must be purged after a regulatory period
 For each TTL:
 - Duration and basis (absolute time, sliding window, last access)
 - Action on expiration (delete, archive, revoke)
 ### Caching
 Define caching for:
 - Frequently read, rarely written data
 - Computed aggregates that are expensive to recalculate
 - Data that is accessed across service boundaries
 For each cache:
 - Cache type (in-process, distributed, CDN)
 - Invalidation strategy (TTL-based, event-based, write-through)
 - Staleness tolerance
 - Cache miss behavior (stale-while-recompute, block-and-fetch)
 ## Data Ownership
 Each piece of data must have exactly one owner:
 - The owning service is the single source of truth
 - Other services access that data via the owner's API or events
 - No service reads directly from another service's data store
 - If data is needed in multiple places, replicate via events with a clear source of truth
 Data ownership format:
 | Data Item | Owning Service | Access Pattern | Replication Strategy |
 |----------|---------------|----------------|---------------------|
 | ... | ... | ... | ... |
 ## Query Pattern Analysis
 For each table, document:
 - Primary query patterns (by which columns/keys is data accessed)
 - Write patterns (insert-heavy, update-heavy, or mixed)
 - Read-to-write ratio (when relevant)
 - Consistency requirements (strong, eventual, or tunable)
 - Scale expectations (rows per day, rows total, growth rate)
 This analysis drives:
 - Index selection
 - Partition key selection
 - Storage engine selection
 - Denormalization decisions
 ## Anti-Patterns
 - Tables without a clear PRD requirement
 - Indexes without a documented query pattern
 - Shared tables across service boundaries
 - Premature denormalization without a read/write justification
 - Missing foreign key constraints where referential integrity is required
 - Data models that assume a specific storage engine without justification
--- a/skills/design-architecture/SKILL.md
+++ b/skills/design-architecture/SKILL.md
@ -0,0 +1,258 @@
 ---
 name: design-architecture
 description: "Design system architecture based on PRD requirements and analysis. This is the Architect pipeline's core step, comparable to write-a-prd in the PM pipeline. Produces the complete architecture document."
 ---
 This skill produces the complete architecture document for a feature.
 **Announce at start:** "I'm using the design-architecture skill to design the system architecture."
 ## Primary Inputs
 - `docs/prd/{feature}.md` (required)
 - `docs/architecture/{date}-{feature}-analysis.md` (from analyze-prd, optional)
 - `docs/research/{date}-{topic}-architecture.md` (from architecture-research, optional)
 ## Primary Output
 - `docs/architecture/{feature}.md`
 **Save architecture to:** `docs/architecture/{feature}.md`
 - (User preferences for architecture location override this default)
 ## Hard Gate
 Do NOT start this skill if the PRD has unresolved ambiguities that block architectural decisions. Resolve them with the PM first.
 ## Process
 You MUST complete these steps in order:
 1. **Read the PRD** end-to-end to understand all requirements
 2. **Read the analysis document** if available, to understand which knowledge domains are relevant
 3. **Read the research brief** if available, to inform technology and pattern selections
 4. **Design each architecture section** based on PRD requirements and relevant knowledge domains
 5. **Apply knowledge domains** as needed - reference relevant knowledge contracts during design:
   - `system-decomposition` when designing service boundaries
   - `api-contract-design` when defining API contracts
   - `data-modeling` when designing database schema
   - `distributed-system-basics` when dealing with distributed concerns
   - `architecture-patterns` when selecting architectural patterns
   - `storage-knowledge` when making storage technology decisions
   - `async-queue-design` when designing asynchronous workflows
   - `error-model-design` when defining error handling
   - `idempotency-design` when designing idempotent operations
 6. **Ensure traceability** - every architectural decision must trace back to at least one PRD requirement
 7. **Write completeness check** - verify all required sections are present and substantive
 8. **Write the architecture document** to `docs/architecture/{feature}.md`
 ## Architecture Document Template
 ```markdown
 # Architecture: {Feature Name}
 ## System Overview
 High-level description of the system architecture. Map every major PRD requirement to an architectural component. Show component relationships and data flow direction. Identify external system integrations. Document deployment topology when relevant.
 Use text or ASCII diagrams for component relationships.
 ### Requirement Traceability
 | PRD Requirement | Architectural Component |
 |----------------|------------------------|
 | ... | ... |
 ## Frontend Architecture
 Define frontend architecture including component structure, state management, and rendering strategy. If the feature has no frontend component, write `N/A` with a brief reason.
 ### Component Hierarchy
 ### State Management
 ### Routing Structure
 ### Client-Side Caching
 ## Backend Architecture
 Define backend architecture including service layers, module boundaries, and dependency flow. This section MUST be present for all features with backend implications.
 ### Service/Module Boundaries
 ### Layer Separation
 ### Dependency Flow
 ### Shared Utilities
 ## API Definitions
 Define all API endpoints with full specifications.
 For each endpoint:
 - Method and path
 - Request schema (headers, path params, query params, body)
 - Response schema (success and error responses)
 - Status codes
 - Authentication requirements
 - Idempotency requirements (when applicable)
 - Rate limiting expectations (when applicable)
 - PRD functional requirement it satisfies
 ### Endpoint Catalog
 | Method | Path | Description | PRD Requirement |
 |--------|------|-------------|-----------------|
 | ... | ... | ... | ... |
 ### Endpoint Details
 (Define each endpoint in detail)
 ## DB Schema
 Define all database tables, columns, indexes, constraints, and relationships. If the feature requires no database changes, write `N/A` with a brief reason.
 ### Table Definitions
 For each table:
 - Table name and purpose
 - Column definitions (name, type, constraints, defaults)
 - Indexes and their justification
 - Foreign key relationships
 - Data volume estimates (when relevant)
 ### Entity Relationships
 Describe relationships between tables.
 ### Migration Strategy
 Notes on migration approach if schema changes affect existing data.
 ## Service Boundaries
 Define service boundaries with clear responsibilities.
 For each service or module:
 - Name and single responsibility
 - Owned data
 - Communication patterns with other services (sync, async, event-driven)
 - Potential coupling points and mitigation
 ### Communication Matrix
 | From | To | Pattern | Protocol | Purpose |
 |------|----|---------|----------|---------|
 | ... | ... | ... | ... | ... |
 ## Async Model
 Define asynchronous operations and their behavior. If the feature has no asynchronous requirements, write `N/A` with a brief reason.
 ### Async Operations
 For each async operation:
 - Operation name and trigger
 - Queue or event topic
 - Producer and consumer
 - Retry policy (max retries, backoff, DLQ)
 - Ordering guarantees
 - Timeout and cancellation behavior
 ## Error Model
 Define error handling strategy across the system.
 ### Error Categories
 - Client errors (4xx)
 - Server errors (5xx)
 - Business rule violations
 - Timeout errors
 - Cascading failure modes
 ### Error Propagation Strategy
 - Fail-fast vs graceful degradation vs circuit breaker
 - Fallback behavior
 ### Error Response Format
 Consistent error response schema across the system.
 ### Observability Hooks
 - Logging strategy
 - Metrics to track
 - Alerting thresholds
 ### PRD Edge Case Mapping
 | Error Category | PRD Edge Case | Handling Strategy |
 |---------------|---------------|-------------------|
 | ... | ... | ... |
 ## Idempotency Design
 Define idempotent operations and their behavior. If the feature has no idempotency requirements, write `N/A` with a brief reason.
 For each idempotent operation:
 - Operation name
 - Idempotency key source and format
 - Key TTL and storage location
 - Duplicate request behavior
 - Collision handling
 ## Architectural Decision Records
 For each significant architectural decision:
 ### ADR-{N}: {Decision Title}
 - **Decision**: What was decided
 - **Context**: Why this decision was needed, including which PRD requirements drove it
 - **Alternatives**: What other options were considered
 - **Rationale**: Why this option was chosen
 - **Consequences**: What trade-offs or implications result
 ```
 ## Completeness Check
 Before finalizing the architecture document, verify:
 1. Every PRD functional requirement is traced to at least one architectural component
 2. Every PRD NFR is traced to at least one architectural decision
 3. All 9 required sections are present (or explicitly marked N/A with reason)
 4. Every architecture section that is not N/A has substantive content
 5. All API endpoints map to PRD functional requirements
 6. All DB tables map to data requirements from functional requirements or NFRs
 7. All async flows map to PRD requirements
 8. All error handling strategies map to PRD edge cases
 9. ADRs exist for all significant decisions
 10. No architectural element exists without traceability to a PRD requirement
 Add explicit detail for these when relevant:
 - Security boundaries and authentication
 - Scalability considerations
 - Performance-critical paths
 - Data consistency requirements
 ## Guardrails
 This is a pure Architecture skill.
 Do:
 - Design system structure and boundaries
 - Define API contracts and data models
 - Define error handling, retry, and idempotency strategies
 - Make architectural decisions with clear rationale and alternatives
 - Ensure traceability to PRD requirements
 Do not:
 - Change PRD requirements or scope
 - Create task breakdowns, milestones, or deliverables
 - Write test cases or test plans
 - Write implementation code or pseudocode
 - Choose specific libraries or frameworks at the implementation level
 - Prescribe code patterns, class structures, or function-level logic
 The Architect defines HOW the system is structured.
 The Engineering defines HOW the code is written.
 ## Transition
 After completing the architecture document, invoke `challenge-architecture` to validate and stress-test the architecture.
--- a/skills/distributed-system-basics/SKILL.md
+++ b/skills/distributed-system-basics/SKILL.md
@ -0,0 +1,163 @@
 ---
 name: distributed-system-basics
 description: "Knowledge contract for understanding and designing for distributed system concerns: at-least-once vs exactly-once, retry behavior, duplicate requests, idempotency, timeout vs failure, partial failure, eventual consistency, and ordering guarantees. Referenced by design-architecture when dealing with distributed concerns."
 ---
 This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing for distributed system concerns.
 ## Delivery Guarantees
 ### At-Most-Once
 - Message may be lost but never delivered twice
 - Use when: loss is acceptable, retries are not, throughput is priority
 - Trade-off: simplicity and speed at the cost of reliability
 ### At-Least-Once
 - Message is never lost but may be delivered more than once
 - Use when: loss is unacceptable, consumers are idempotent or can deduplicate
 - Trade-off: reliability at the cost of requiring idempotency handling
 - Most common default for production systems
 ### Exactly-Once
 - Message is delivered once and only once
 - Use when: duplicates are harmful and idempotency is hard or impossible
 - Trade-off: significant complexity, performance overhead, and coordination cost
 - Often achieved via idempotency + at-least-once rather than true exactly-once protocol
 Choose the weakest guarantee that meets PRD requirements. Do not default to exactly-once unless the PRD requires it.
 ## Retry Behavior
 ### When to Retry
 - Transient network failures
 - Temporary resource unavailability (503, timeouts)
 - Rate limit exceeded (429, with backoff)
 - Upstream service failures (502, 504)
 ### When NOT to Retry
 - Client errors (400, 401, 403, 404, 422)
 - Business rule violations
 - Malformed requests
 - Non-retryable error codes explicitly defined in the API contract
 ### Retry Strategy Parameters
 - Maximum retries: define per operation (typically 2-5)
 - Backoff strategy:
  - Fixed interval: predictable but may overwhelm recovering service
  - Exponential backoff: increasingly longer waits (recommended default)
  - Exponential backoff with jitter: adds randomness to avoid thundering herd
 - Retry budget: limit total retries per time window to prevent cascading failure
 ### Retry Anti-Patterns
 - Retrying non-idempotent operations without deduplication
 - Infinite retries without a circuit breaker
 - Synchronous retries that block the caller indefinitely
 - Ignoring Retry-After headers
 ## Duplicate Requests
 Duplicates arise from:
 - Network retries
 - Client timeouts with successful server processing
 - Message queue redelivery
 - User double-submit
 Handling strategies:
 - Idempotency keys (preferred for API operations)
 - Deduplication at consumer level (for event processing)
 - Natural idempotency (read operations, certain write patterns)
 - Idempotency is covered in detail in the `idempotency-design` knowledge contract
 ## Timeout vs Failure
 ### Timeout
 - The operation may have succeeded; you just do not know
 - Must be handled as "unknown state" not "failed state"
 - Requires idempotency or state reconciliation
 ### Failure
 - The operation definitively did not succeed
 - Can be safely retried
 Design implications:
 - Always distinguish between timeout and confirmed failure
 - For timeouts, retry with idempotency or check state before retrying
 - Define timeout values per operation type (short for interactive, long for batch)
 - Document timeout values in API contracts
 ## Partial Failure
 Partial failure occurs when:
 - A multi-step operation fails after some steps succeed
 - A batch operation partially succeeds
 - An upstream dependency fails mid-transaction
 Handling strategies:
 - Compensating transactions (saga pattern) for multi-service operations
 - Partial success responses (207 Multi-Status for batch operations)
 - Atomic operations where possible (single-service transactions)
 - Outbox pattern for ensuring eventual consistency
 Design principles:
 - Define what "partial" means for each operation
 - Define whether partial success is acceptable or must be fully rolled back
 - Document recovery procedures for each partial failure scenario
 - Map partial failure scenarios to PRD edge cases
 ## Eventual Consistency
 Eventual consistency means:
 - Updates propagate asynchronously
 - Reads may return stale data for a bounded period
 - All replicas eventually converge
 When to use:
 - Cross-service data synchronization
 - Read replicas and caching
 - Event-driven architectures
 - High-write, low-latency-requirement scenarios
 When NOT to use:
 - Financial balances where immediate consistency is required
 - Inventory counts where overselling is unacceptable
 - Authorization decisions where stale permissions are harmful
 - Any scenario the PRD marks as requiring strong consistency
 Design implications:
 - Define acceptable staleness bounds per data type
 - Define how consumers detect and handle stale data
 - Define convergence guarantees (time-bound, version-bound)
 - Document which data is eventually consistent and which is strongly consistent
 ## Ordering Guarantees
 ### Per-Partition Ordering
 - Messages within a single partition or queue are ordered
 - Use when: operation sequence matters within a context (e.g., per user, per order)
 - Ensure: partition key is set to the context identifier
 ### Global Ordering
 - All messages across all partitions are ordered
 - Use when: global sequence matters (rare)
 - Trade-off: severely limits throughput and availability
 - Avoid unless the PRD explicitly requires it
 ### No Ordering Guarantee
 - Messages may arrive in any order
 - Use when: operations are independent and order does not matter
 - Ensure: consumers can handle out-of-order delivery
 Define ordering guarantees per queue/topic:
 - State the guarantee clearly
 - Define the partition key if per-partition ordering is used
 - Define how out-of-order delivery is handled when ordering is expected but not guaranteed
 ## Anti-Patterns
 - Assuming network calls never fail
 - Retrying without idempotency
 - Treating timeout as failure
 - Ignoring partial failure scenarios
 - Assuming global ordering when only per-partition ordering is needed
 - Using strong consistency when eventual consistency would suffice
 - Using eventual consistency when the PRD requires strong consistency
--- a/skills/error-model-design/SKILL.md
+++ b/skills/error-model-design/SKILL.md
@ -0,0 +1,196 @@
 ---
 name: error-model-design
 description: "Knowledge contract for designing error categories, propagation strategies, retryable vs non-retryable errors, partial failure behavior, and fallback strategies. Referenced by design-architecture when defining error handling."
 ---
 This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is defining error handling strategy.
 ## Core Principle
 Error handling must be designed systematically, not added as an afterthought. Every error category must trace to a PRD edge case or NFR. The error model must be consistent across the entire system.
 ## Error Categories
 ### Client Errors (4xx)
 Errors caused by the client sending invalid or incorrect requests.
 Common client errors:
 - `400 Bad Request` - malformed request body, missing required fields
 - `401 Unauthorized` - missing or invalid authentication
 - `403 Forbidden` - authenticated but not authorized for this resource
 - `404 Not Found` - requested resource does not exist
 - `409 Conflict` - state conflict (duplicate, version mismatch, business rule violation)
 - `422 Unprocessable Entity` - valid format but business rule violation
 - `429 Too Many Requests` - rate limit exceeded
 Design principles:
 - Client errors are non-retryable (unless 429 with Retry-After)
 - Error response must include enough detail for the client to correct the request
 - Error codes should be consistent and documented in the API contract (see `api-contract-design`)
 ### Server Errors (5xx)
 Errors caused by the server failing to process a valid request.
 Common server errors:
 - `500 Internal Server Error` - unexpected server failure
 - `502 Bad Gateway` - upstream service failure
 - `503 Service Unavailable` - temporary unavailability
 - `504 Gateway Timeout` - upstream service timeout
 Design principles:
 - Server errors may be retryable (see retryable vs non-retryable)
 - Error response should not leak internal details in production
 - All unexpected server errors must be logged and alerted
 - Circuit breakers should protect against cascading server errors
 ### Business Rule Violations
 Errors where the request is valid but violates a business rule.
 Design principles:
 - Use 422 or 409 depending on the nature of the violation
 - Include the specific business rule that was violated
 - Include enough context for the client to understand and correct the issue
 - Map each business rule violation to a PRD functional requirement
 ### Timeout Errors
 Errors where an operation did not complete within the expected time.
 Design principles:
 - Always distinguish timeout from confirmed failure
 - Timeout means "unknown state" not "failed"
 - Define timeout values per operation type
 - Document recovery procedures for timed-out operations
 - See `distributed-system-basics` for timeout vs failure handling
 ### Cascading Failures
 Failures that propagate from one service to another, potentially bringing down the entire system.
 Design principles:
 - Use circuit breakers to stop cascade propagation
 - Use bulkheads to isolate failure domains
 - Define fallback behavior for each dependency failure
 - Monitor and alert on circuit breaker state changes
 ## Error Propagation Strategy
 ### Fail-Fast
 Immediately return an error to the caller when a dependency fails.
 Use when:
 - The caller cannot proceed without the dependency
 - Partial data is worse than no data
 - The PRD requires immediate feedback
 ### Graceful Degradation
 Continue serving reduced functionality when a dependency fails.
 Use when:
 - The PRD allows partial functionality
 - Some data is better than no data
 - The feature has a clear fallback path
 Define for each graceful degradation:
 - What functionality is reduced
 - What the user sees instead
 - How the system recovers when the dependency returns
 ### Circuit Breaker
 Stop calling a failing dependency after a threshold of failures, allowing it time to recover.
 Define for each circuit breaker:
 - Failure threshold (how many failures before opening)
 - Recovery timeout (how long before trying again)
 - Half-open behavior (how many requests to allow during recovery)
 - Fallback behavior when circuit is open
 Use when:
 - A dependency is experiencing persistent failures
 - Continuing to call will make things worse (cascading failure risk)
 - The system can operate with reduced functionality
 ## Error Response Format
 Define a consistent error response format across the entire system:
 ```json
 {
  "error": {
    "code": "ERROR_CODE",
    "message": "Human-readable message describing what happened",
    "details": [
      {
        "field": "field_name",
        "code": "SPECIFIC_ERROR_CODE",
        "message": "Specific error description"
      }
    ],
    "request_id": "correlation-id-for-tracing"
  }
 }
 ```
 Design principles:
 - `code` is a machine-readable string constant (not HTTP status code)
 - `message` is human-readable and suitable for display or logging
 - `details` provides field-level validation errors when applicable
 - `request_id` enables cross-service error tracing
 - Never include stack traces, internal paths, or implementation details in production error responses
 ## Retryable vs Non-Retryable Errors
 ### Retryable Errors
 - Server errors (500, 502, 503, 504) with backoff
 - Timeout errors with backoff
 - Rate limit errors (429) with Retry-After
 - Network connectivity errors
 ### Non-Retryable Errors
 - Client errors (400, 401, 403, 404, 422, 409)
 - Business rule violations
 - Malformed requests
 - Authentication failures
 Define per endpoint whether an error is retryable. Include this in the API contract.
 ## Partial Failure Behavior
 Define partial failure behavior for operations that span multiple steps or services:
 - **All-or-nothing**: The entire operation succeeds or fails atomically. Use for financial transactions, inventory operations, or any data requiring strong consistency.
 - **Best-effort**: Complete as much as possible and report partial success. Use for batch operations, notifications, or operations where partial success is acceptable.
 - **Compensating transaction (saga)**: Each step has a compensating action. If a step fails, previous steps are undone via compensation. Use for multi-service operations where atomicity is required but distributed transactions are not available.
 For each partial failure scenario:
 - Define what "partial" means in this context
 - Define whether partial success is acceptable or must be fully rolled back
 - Define the recovery procedure
 - Map to a PRD edge case
 ## Fallback Strategy
 For each external dependency, define:
 - What happens when the dependency is unavailable
 - Fallback behavior (cached data, default response, queue and retry, fail with user message)
 - How the system recovers when the dependency returns
 - SLA implications of the fallback
 ## Observability
 For error model design, define:
 - What errors are logged (all unexpected errors, all server errors, sampled client errors)
 - What errors trigger alerts (server error rate, DLQ depth, circuit breaker state)
 - Error metrics (error rate by code, error rate by endpoint, p99 latency)
 - Request tracing (correlation IDs across service boundaries)
 Map observability requirements to PRD NFRs.
 ## Anti-Patterns
 - Returning generic 500 errors for all server failures
 - Not distinguishing timeout from failure
 - Ignoring partial failure scenarios
 - Leaking internal details in error responses
 - Using the same error handling strategy for all operations regardless of criticality
 - Not defining fallback behavior for external dependencies
 - Alerting on all errors instead of actionable thresholds
 - Using circuit breakers without fallback behavior
--- a/skills/idempotency-design/SKILL.md
+++ b/skills/idempotency-design/SKILL.md
@ -0,0 +1,165 @@
 ---
 name: idempotency-design
 description: "Knowledge contract for designing idempotent operations, idempotency keys, TTL, storage, duplicate behavior, and collision handling. Referenced by design-architecture when designing idempotency."
 ---
 This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing idempotent operations.
 ## Core Principle
 Idempotency must be driven by PRD requirements. Do not add idempotency to operations that do not need it. Do not skip idempotency on operations that the PRD explicitly requires to be idempotent.
 Common PRD requirements that imply idempotency:
 - "The system must not create duplicates when the same request is submitted twice"
 - "Users should be able to retry failed submissions safely"
 - "Payment processing must be exactly-once"
 - "Webhook deliveries may be retried"
 ## Identifying Idempotent Operations
 An operation needs idempotency when:
 - The client may retry due to network timeout or failure
 - The operation has side effects that must not be duplicated (creating resources, charging money, sending notifications)
 - The PRD explicitly requires safe retry behavior
 - The operation is triggered by an unreliable delivery mechanism (webhooks, message queues)
 An operation is naturally idempotent when:
 - It is a read operation (GET, HEAD, OPTIONS)
 - It is a delete operation where deleting a non-existent resource returns 404 or 204
 - It is a PUT that fully replaces a resource (set state to X)
 - It is an operation where duplicated execution produces the same result
 ## Idempotency Key Strategy
 ### Key Source
 - Client-generated: the client provides a unique key (e.g., UUID, order reference). Preferred for API operations.
 - Deterministic: derived from request content (e.g., hash of user_id + action + parameters). Preferred when the client cannot provide a key.
 - System-generated: the server assigns a key. Only for internal operations where the client does not participate.
 ### Key Format
 - Define the key format explicitly (e.g., `UUID v7`, `{prefix}-{unique-identifier}`, `sha256(payload)`)
 - Keys must be unique across the entire scope of the operation
 - Keys must be reproducible: the same logical request must produce the same key
 ### Key Scope
 - Per-user: key is unique within the user's context
 - per-resource-type: key is unique within the resource type (e.g., all payment creation)
 - Global: key is unique across the entire system
 Define the scope based on the PRD requirement. Tighter scope is preferred when possible.
 ## Idempotency Key Storage
 ### Where to Store
 - Database table (preferred for persistent idempotency)
  - Table: `idempotency_keys`
  - Columns: `key`, `operation_type`, `request_hash`, `response_hash`, `status`, `created_at`, `expires_at`
  - Index: unique index on `(key, operation_type)`
 - Redis (preferred for ephemeral idempotency with TTL)
  - Key: `idempotency:{operation_type}:{key}`
  - Value: serialized response or status reference
  - TTL: set to expire after the idempotency window
 ### Storage Decision Framework
 - Use database when: idempotency must survive restarts, keys must be queryable, audit trail is required
 - Use Redis when: idempotency is time-bounded, fast lookup is critical, keys can expire, persistence loss is acceptable
 ## TTL (Time-to-Live)
 Define for each idempotent operation:
 - TTL duration: how long duplicate detection is active
 - TTL basis: when does the clock start (key creation time, last access time)
 - TTL scope: does the key expire or is it permanent
 ### TTL Duration Guidelines
 - API operations: typically 24 hours (allows client retries within a day)
 - Payment operations: typically 30 days (matches settlement windows)
 - Webhook processing: typically 7 days (matches delivery retry windows)
 - Internal operations: match the operation's natural retry window
 ### TTL Behavior
 - After TTL expires, the key is removed and a new request with the same key is processed as a new operation
 - Define whether TTL is strictly enforced (hard delete) or softly enforced (soft delete, kept for audit)
 ## Duplicate Request Behavior
 When a duplicate request is detected (key already exists):
 ### During Processing
 - The original request is still being processed
 - Return `202 Accepted` with a status URL (for async operations)
 - Or return `409 Conflict` if the client should not retry yet
 ### After Successful Processing
 - Return the original successful response (stored or reconstructable)
 - Must return the same status code and response body as the original
 - This is the most common and recommended behavior
 ### After Failed Processing
 - If the original processing permanently failed, allow retry with the same key
 - If the original processing was interrupted (timeout, crash), allow retry with the same key
 - Define whether the client must generate a new key or can reuse the original
 Define for each idempotent operation:
 - What the client receives when submitting a duplicate during processing
 - What the client receives when submitting a duplicate after success
 - What the client receives when submitting a duplicate after failure
 ## Collision Handling
 A key collision occurs when two different logical requests produce the same idempotency key.
 ### Prevention
 - Use UUID v7 or similar globally unique identifiers for client-generated keys
 - Use sufficiently random hash functions for content-derived keys
 - Include enough context in content-derived keys (user_id + action + parameters)
 ### Detection
 - Compare the request hash of the new request with the stored request hash
 - If hashes match: this is a true duplicate, return the stored response
 - If hashes differ: this is a collision, different logical requests produced the same key
 ### Resolution
 - Reject the new request with `409 Conflict` and ask the client to use a new key
 - This is the safest and most common approach
 - Never overwrite the original request's result with a different request's result
 ## Idempotency for Different Operation Types
 ### Create Operations
 - Most common use case for idempotency
 - Key: client-generated UUID or deterministic hash
 - Behavior: return original created resource on duplicate
 - Status codes: `201 Created` on first request, `200 OK` with original resource on duplicate
 ### Update Operations
 - PUT operations that fully replace state are naturally idempotent
 - PATCH operations that set state to a specific value are idempotent
 - PATCH operations that increment or append are NOT naturally idempotent
 - Key: derived from resource ID + operation type if not naturally idempotent
 ### Delete Operations
 - Naturally idempotent: deleting an already-deleted resource returns `204 No Content` or `404 Not Found`
 - Define which behavior the API contract specifies and stick with it consistently
 ### Payment Operations
 - Must be idempotent (regulatory and financial requirement)
 - Key: payment reference or client-generated UUID
 - TTL: match settlement window (typically 30 days)
 - Behavior: return original payment result on duplicate; never double-charge
 ### Webhook Processing
 - Must be idempotent (delivery services may retry)
 - Key: webhook event ID or delivery attempt ID
 - TTL: match delivery retry window (typically 7 days)
 - Behavior: skip processing on duplicate, return success
 ## Anti-Patterns
 - Adding idempotency to naturally idempotent operations (wastes resources)
 - Not adding idempotency to operations the PRD requires to be safe for retry
 - Storing idempotency keys with no TTL, causing unbounded table growth
 - Using content-derived keys with insufficient entropy, causing collisions
 - Overwriting stored results on key collision instead of rejecting
 - Implementing idempotency at the wrong layer (e.g., only at the database level without API-level coordination)
 - Not documenting which operations are idempotent and which are not
--- a/skills/storage-knowledge/SKILL.md
+++ b/skills/storage-knowledge/SKILL.md
@ -0,0 +1,149 @@
 ---
 name: storage-knowledge
 description: "Knowledge contract for selecting storage technologies based on data patterns. Covers relational, wide-column, document, and key-value stores with use-when and avoid-when criteria. Referenced by design-architecture when making storage decisions."
 ---
 This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is making storage technology decisions.
 ## Core Principle
 Storage selection must be driven by query patterns, write patterns, consistency requirements, and scale expectations identified in the PRD. Do not choose a storage technology because it is familiar, fashionable, or might be needed someday.
 Every storage choice must be justified. If a simpler option meets the requirements, use it.
 ## Storage Selection Criteria
 For each data entity, answer these questions before selecting storage:
 1. What are the primary query patterns? (by key, by range, by complex filter, by full-text search)
 2. What are the write patterns? (insert-heavy, update-heavy, append-only)
 3. What consistency is required? (strong, eventual, tunable)
 4. What scale is expected? (rows per day, total rows, growth rate)
 5. What are the access latency requirements? (ms, seconds, eventual)
 6. What relationships exist with other entities? (foreign keys, nested documents, graph traversals)
 ## Relational Database (PostgreSQL, MySQL, etc.)
 Use when:
 - Strong consistency is required (ACID transactions)
 - Complex joins are needed for queries
 - Transactional integrity across multiple entities is required
 - Data has well-defined structure with relationships
 - Referential integrity constraints are important
 - Ad-hoc querying on multiple dimensions is common
 Avoid when:
 - Write throughput exceeds what a single relational node can handle and sharding adds unacceptable complexity
 - Data is deeply nested and rarely queried across relationships
 - Schema evolves rapidly and migrations are costly
 - Full-text search is a primary access pattern (use a search engine instead)
 Trade-offs: +strong consistency, +relationships, +ad-hoc queries, +maturity, -scaling complexity, -schema rigidity
 ### Schema Design for Relational
 - Normalize to 3NF by default
 - Denormalize selectively based on query patterns (see `data-modeling`)
 - Define foreign keys with appropriate ON DELETE behavior
 - Define indexes for identified query patterns only
 - Consider partitioning for large tables
 ## Wide-Column / Cassandra
 Use when:
 - High write throughput is required (append-heavy workloads)
 - Query-first modeling (you know all query patterns upfront)
 - Large-scale time-series data
 - Geographic distribution with local writes
 - Linear horizontal scaling is required
 - Availability is prioritized over strong consistency (tunable consistency)
 Avoid when:
 - Ad-hoc queries on arbitrary columns are needed
 - Relational joins across tables are common
 - Strong consistency is required for all operations
 - The data model requires many secondary indexes
 - The team lacks Cassandra modeling experience (data modeling mistakes are costly to fix)
 Trade-offs: +write throughput, +horizontal scaling, +availability, -no joins, -query-first modeling required, -modeling mistakes are expensive
 ### Schema Design for Wide-Column
 - Model around query patterns: each table serves a specific query
 - Partition key must distribute data evenly
 - Clustering columns define sort order within a partition
 - Denormalize aggressively: one table per query pattern
 - Avoid secondary indexes; model queries into the primary key instead
 ## Document / MongoDB
 Use when:
 - Data is document-centric with nested structures
 - Schema flexibility is required (rapidly evolving data)
 - Aggregate boundaries align with document boundaries
 - Single-document atomicity is sufficient
 - Read-heavy workloads with rich query capabilities
 Avoid when:
 - Strong relational constraints between entities are required
 - Multi-document transactions are frequent (MongoDB supports them but they are slower)
 - Data requires complex joins across many collections
 - Strict schema validation is critical
 Trade-offs: +schema flexibility, +nested structures, +rich queries, +easy to start, -relationship handling, -larger storage for indexes, -multi-document transaction overhead
 ### Schema Design for Document
 - Design documents around access patterns
 - Embed data that is always accessed together
 - Reference data that is accessed independently
 - Use indexes for fields that are frequently filtered
 - Consider document size limits (16MB in MongoDB)
 - Use change streams for event-driven patterns
 ## Key-Value / Redis
 Use for:
 - Caching frequently accessed data
 - Rate limiting (counters with TTL)
 - Idempotency keys (set with TTL, check existence)
 - Ephemeral state (sessions, temporary tokens)
 - Distributed locking
 - Sorted sets for leaderboards or priority queues
 - Pub/sub for lightweight messaging
 Avoid when:
 - You need complex queries (no query language)
 - You need durability for primary data (Redis persistence is not ACID)
 - Data size exceeds available memory and eviction is unacceptable
 - You need relationships between entities
 Trade-offs: +speed, +simplicity, +data structures, -memory cost, -durability (with caveats), -no complex queries
 ### Using Redis as Primary Storage
 Only when:
 - Data is inherently ephemeral (sessions, rate limits, idempotency keys)
 - Data loss is acceptable or can be reconstructed
 - The team understands persistence limitations (RDB snapshots, AOF)
 Never use Redis as the primary persistent store for business-critical data unless:
 - Durability requirements are clearly defined
 - Persistence configuration (RDB + AOF) meets those requirements
 - Recovery procedures are tested and documented
 ## Storage Selection Decision Framework
 1. Start with the simplest option that meets requirements
 2. Only add complexity when the PRD justifies it
 3. Prefer one storage technology when it meets all requirements
 4. Add a second storage technology only when a specific PRD requirement demands it
 5. Document every storage choice as an ADR with:
   - The requirement that drives it
   - The alternatives considered
   - Why the chosen option is the simplest that works
 ## Anti-Patterns
 - Using Cassandra for a 10,000-row table with ad-hoc queries
 - Using MongoDB for highly relational data requiring joins
 - Using Redis as a primary persistent store without understanding durability
 - Using multiple storage technologies when one suffices
 - Choosing storage based on familiarity rather than query/write patterns
 - Premature optimization: selecting distributed storage before single-node is proven insufficient
--- a/skills/system-decomposition/SKILL.md
+++ b/skills/system-decomposition/SKILL.md
@ -0,0 +1,100 @@
 ---
 name: system-decomposition
 description: "Knowledge contract for splitting systems into services or modules, defining boundaries, data ownership, and dependency direction. Referenced by design-architecture when designing service boundaries."
 ---
 This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing service boundaries and system decomposition.
 ## Core Principles
 - Each service or module must have a single, well-defined responsibility
 - Data ownership must be clear: each piece of data belongs to exactly one service
 - Dependencies must flow in one direction; cyclic dependencies are forbidden
 - Boundaries must be drawn around domain responsibilities, not technical layers
 ## Decomposition Decisions
 ### Modular Monolith vs Microservices
 Choose modular monolith when:
 - Team size is small (fewer than 5-8 engineers per service boundary)
 - Domain boundaries are still evolving
 - Deployment simplicity is a priority
 - Inter-service communication overhead would exceed in-process call overhead
 - The PRD does not require independent scaling of individual services
 Choose microservices when:
 - Individual services have different scaling requirements stated in the PRD
 - Team ownership aligns with service boundaries
 - Domain boundaries are stable and well-understood
 - Independent deployment of services is required
 - The PRD explicitly requires isolation for reliability or security
 Do not choose microservices solely because they are fashionable or because the team might need them someday. YAGNI applies.
 ### Domain Boundaries
 Identify domain boundaries by looking for:
 - Entities that change together
 - Business rules that are cohesive
 - Data that is accessed together
 - User workflows that span a consistent context
 A good boundary:
 - Has high internal cohesion (related logic stays together)
 - Has low external coupling (minimal cross-boundary calls)
 - Can be understood independently
 - Can be deployed independently if needed A bad boundary:
 - Requires frequent cross-boundary calls to complete a workflow
 - Splits closely related entities across services
 - Exists because of technical layering rather than domain logic
 - Requires distributed transactions to maintain consistency
 ### Coupling vs Cohesion
 Favor high cohesion within a boundary:
 - Related business rules live together
 - Related data is owned by the same service
 - Related workflows are handled end-to-end
 Minimize coupling between boundaries:
 - Communicate via well-defined contracts (APIs, events)
 - Avoid sharing database tables between services
 - Avoid synchronous call chains longer than 2 services deep when possible
 - Prefer eventual consistency for cross-boundary state updates
 ### State Ownership
 Each piece of state must have exactly one owner:
 - The owning service is the single source of truth
 - Other services access that state via the owner's API or events
 - No service reads directly from another service's database
 - If data is needed in multiple places, replicate via events with a clear source of truth
 ## Communication Patterns
 ### Synchronous
 - Use when the caller needs an immediate response
 - Use for queries and command validation
 - Avoid for long-running operations
 - Consider timeouts and circuit breakers
 ### Asynchronous
 - Use when the caller does not need an immediate response
 - Use for events, notifications, and eventual consistency
 - Use when decoupling producer and consumer is valuable
 - Consider ordering, retry, and DLQ requirements
 ### Event-Driven
 - Use when multiple consumers need to react to state changes
 - Use for cross-boundary consistency (eventual)
 - Define event schemas explicitly
 - Consider event versioning and backward compatibility
 ## Anti-Patterns
 - Distributed monolith: microservices that must be deployed together
 - Shared database: multiple services reading/writing the same tables
 - Synchronous chain: 3+ services in a synchronous call chain
 - Leaky domain: business rules that require data from other services directly instead of via APIs or events
 - Premature decomposition: splitting before boundaries are understood