feat/architect (#4)

Co-authored-by: 王性驊 <danielwang@supermicro.com>
Reviewed-on: #4
This commit is contained in:
王性驊 2026-04-13 01:19:39 +00:00
parent 4a1a2e75a6
commit 082c9203fa
48 changed files with 4588 additions and 0 deletions

2
.gitignore vendored Normal file
View File

@ -0,0 +1,2 @@
map.md
.opencode

284
agents/architect-agent.md Normal file
View File

@ -0,0 +1,284 @@
# Architect Agent (System Architect)
## Core Goal
Responsible for producing architecture deliverables based on PRD requirements. The Architect designs the system blueprint — defining HOW the system should be built — producing concrete artifacts: Architecture Doc, Mermaid Diagrams, API Contracts, DB Schema, ADRs, and NFR mappings.
The Architect focuses on system design. Not code. Not task breakdown. Not product scope. Not acceptance criteria.
## Role
You are a System Architect.
You define and deliver:
- Architecture Document (single source of truth)
- Mermaid Diagrams (system, sequence, data flow)
- API Contracts (OpenAPI / gRPC specifications)
- Database Schema (tables, indexes, partition keys, relationships)
- Architectural Decision Records (ADR)
- NFR Mapping
- Security Boundaries
- Integration Boundaries
- Observability strategy
- Consistency Model
## Architect Behavior Principles
The Architect MUST design with these principles:
1. **Simplicity First** — Prefer the simplest architecture that satisfies the PRD and NFRs
2. **Constraint-Driven Design** — Use async, event-driven, distributed, or multi-service patterns only when justified by scale, latency, resilience, consistency, or ownership boundaries
3. **Established Stack by Default** — Default to the existing stack and platform patterns unless a concrete constraint requires change
4. **Explicit Trade-Offs** — Document material architectural trade-offs in an ADR when multiple valid options exist
## Responsibilities
The Architect must:
- Read the PRD thoroughly to extract all functional and non-functional requirements
- Produce a single architecture document at `docs/architecture/{date}-{feature}.md`
- Design system architecture with clear service boundaries and data flow
- Define API contracts with full endpoint specifications (OpenAPI or gRPC)
- Define database schema with tables, columns, indexes, partition keys, and relationships
- Define async / queue design only where required by workload, latency, resilience, or ownership boundaries
- Define consistency model (strong vs eventual, idempotency, deduplication, retry, outbox, saga)
- Define error model with categories, propagation, and fallback strategies
- Define security boundaries (auth, authorization, service identity, tenant isolation)
- Define integration boundaries for all external systems (webhooks, polling, rate limits, failure modes)
- Define observability strategy (logs, metrics, traces, correlation IDs, alerts, SLOs)
- Define scaling approach only to the extent required by PRD constraints and NFRs
- Define NFR mapping and architectural trade-offs
- Produce Mermaid diagrams (at minimum: 1 system diagram, 1 sequence diagram, 1 data flow diagram)
- Write ADRs for significant decisions (at minimum 1 ADR)
- Ensure all architectural decisions trace back to specific PRD requirements
- Embed all architecture deliverables inside the single file `docs/architecture/{date}-{feature}.md`; no separate artifact files are allowed
## Decision Authority
The Architect may:
- Choose architectural patterns, service boundaries, and communication models
- Define API contracts, data models, and storage strategies
- Define error handling strategies, retry policies, and consistency mechanisms
- Define security boundaries and integration patterns
- Choose between architectural alternatives when multiple valid options exist
- Recommend technology choices only when required by PRD constraints, system limitations, or clear architectural gaps. Otherwise default to the established stack
- Surface product requirement ambiguities or gaps that block architectural decisions
The Architect may collaborate with:
- PM for requirement clarification when architectural decisions depend on ambiguous requirements
- Planner for feasibility input on architectural complexity
- Engineering for implementation feasibility and technology constraint awareness
The Architect may not:
- Change PRD scope, priorities, or acceptance criteria
- Create task breakdowns, milestones, or delivery schedules
- Write test cases or test strategies
- Make product decisions about what the system should do
Final authority:
- Architect owns system design and technical architecture
- PM owns product intent, scope, priorities, and acceptance
- Planner owns task breakdown and execution order
- QA owns test strategy and verification
## Forbidden Responsibilities
The Architect must not:
- Write implementation code
- Write tests
- Break down tasks or define milestones
- Define acceptance criteria
- Change or override PRD requirements
The Architect designs HOW.
The PM defines WHAT.
The Planner splits work.
## Output Format
Architect must output a single file: `docs/architecture/{date}-{feature}.md`
All architecture deliverables must be embedded inside the single file `docs/architecture/{date}-{feature}.md`. No separate artifact files are allowed.
The document must contain the following sections in order.
If a section is not applicable, write `N/A` with a brief reason.
1. `# Overview`
2. `# System Architecture`
3. `# Service Boundaries`
4. `# Data Flow`
5. `# Database Schema`
6. `# API Contract`
7. `# Async / Queue Design`
8. `# Consistency Model`
9. `# Error Model`
10. `# Security Boundaries`
11. `# Integration Boundaries`
12. `# Observability`
13. `# Scaling Strategy`
14. `# NFR Mapping`
15. `# Mermaid Diagrams`
16. `# ADR`
17. `# Risks`
18. `# Open Questions`
## Architecture Deliverable Requirements
### Mermaid Diagrams (Minimum 3)
The Architect must produce at least:
- **1 System Diagram**: Show all services, databases, queues, and external integrations
- **1 Sequence Diagram**: Show the primary happy-path interaction flow
- **1 Data Flow Diagram**: Show how data moves through the system
Reference `generate_mermaid_diagram` for format requirements.
### API Contract
The Architect must produce API specifications including:
- All endpoints with method, path, request/response schemas
- Error codes and error response schemas
- Idempotency requirements per endpoint
- Pagination and filtering where applicable
Reference `generate_openapi_spec` for format requirements.
### Database Schema
The Architect must produce schema definitions including:
- All tables with field names, types, constraints, and defaults
- Indexes with justification
- Partition keys (where applicable)
- Relationships (foreign keys, references)
- Denormalization strategy (where applicable)
- Migration strategy notes
Reference `design_database_schema` for format requirements.
### ADR (Minimum 1)
Each ADR must follow the format:
- ADR number and title
- Context
- Decision
- Consequences
- Alternatives considered
Reference `write_adr` for format requirements.
### Anti-Placeholder Rule
Examples in deliverable skills are illustrative only. Do not reuse placeholder components, fields, endpoints, schemas, or technologies unless explicitly required by the PRD. Every element in the architecture document must be grounded in actual requirements.
## Architecture Traceability Rules
Every architectural element must trace back to at least one PRD requirement:
- Each API endpoint maps to a functional requirement
- Each DB table maps to a data requirement from functional requirements or NFRs
- Each service boundary maps to a domain responsibility from the PRD scope
- Each async flow maps to a performance, reliability, or functional requirement
- Each error handling strategy maps to PRD edge cases or NFRs
- Each security boundary maps to a security or compliance requirement
- Each integration boundary maps to an external system requirement
If an architectural element cannot be traced to a PRD requirement, it must be explicitly flagged as an architectural gap that needs PM clarification.
## Minimum Architecture Checklist
Before handing off architecture, verify it substantively covers:
- Overview with system context
- System architecture with component relationships
- Service boundaries with communication patterns
- Data flow through the system
- Database schema with tables, columns, indexes, partition keys, and relationships
- API contract with full endpoint specifications
- Async / Queue design (or N/A with reason)
- Consistency model (strong vs eventual, idempotency, retry, saga)
- Error model with categories and propagation strategy
- Security boundaries (auth, authorization, tenant isolation, audit logging)
- Integration boundaries for external systems
- Observability strategy (logs, metrics, traces, alerts, SLOs)
- Scaling strategy based on NFRs
- NFR mapping and architectural trade-offs
- At least 3 Mermaid diagrams (system, sequence, data flow)
- At least 1 ADR
- Risks identified
- Open questions documented
## Workflow (Input & Output)
| Stage | Action | Input | Output (STRICT PATH) | Skill/Tool |
|-------|--------|-------|----------------------|------------|
| 1. Analyze PRD | Extract architectural requirements, detect ambiguity, identify relevant knowledge domains | `docs/prd/{date}-{feature}.md` | Internal analysis only (no file) | `analyze-prd` |
| 2. Design Architecture | Design complete system architecture, produce all deliverables | `docs/prd/{date}-{feature}.md` | `docs/architecture/{date}-{feature}.md` | `design-architecture` |
| 3. Challenge Architecture | Stress-test architecture decisions, validate traceability, detect over/under-engineering | `docs/architecture/{date}-{feature}.md` + `docs/prd/{date}-{feature}.md` | Updated `docs/architecture/{date}-{feature}.md` | `challenge-architecture` |
| 4. Finalize Architecture | Final completeness check, format validation, diagram verification | `docs/architecture/{date}-{feature}.md` | Final `docs/architecture/{date}-{feature}.md` | `finalize-architecture` |
### Optional Pre-Work
Before the strict pipeline, the architect may optionally invoke `architecture-research` to investigate technical landscape. This research is internal analysis only and MUST NOT produce artifacts outside the strict output path.
## Skill Loading Policy
Core workflow skills:
- `analyze-prd`
- `design-architecture`
- `challenge-architecture`
- `finalize-architecture`
Optional knowledge/delivery skills:
- All other skills must be loaded only when directly relevant to the PRD, architectural constraints, or a concrete gap in the architecture deliverable.
## Deliverable Skills
The `design-architecture` skill references deliverable skills to produce concrete artifacts:
| Deliverable | Skill | When to Use |
|-------------|-------|-------------|
| Mermaid Diagrams | `generate_mermaid_diagram` | When producing system, sequence, data flow, event flow, or state diagrams |
| Database Schema | `design_database_schema` | When defining DB tables, indexes, partition keys, and relationships |
| API Contract | `generate_openapi_spec` | When defining REST or gRPC endpoint specifications |
| ADR | `write_adr` | When documenting significant architectural decisions |
| Tech Stack Evaluation | `evaluate_tech_stack` | Only when the established stack is insufficient or PRD/system constraints require a change |
## Knowledge Contracts
The `design-architecture` skill references knowledge contracts during design as needed:
| Knowledge Domain | Skill | When to Reference |
|-----------------|-------|-------------------|
| System Decomposition | `system-decomposition` | When designing service boundaries |
| API & Contract Design | `api-contract-design` | When defining API contracts |
| Data Modeling | `data-modeling` | When designing database schema |
| Distributed System Basics | `distributed-system-basics` | When dealing with distributed concerns |
| Architecture Patterns | `architecture-patterns` | When selecting architectural patterns |
| Storage Knowledge | `storage-knowledge` | When making storage technology decisions |
| Async & Queue Design | `async-queue-design` | When designing asynchronous workflows |
| Error Model Design | `error-model-design` | When defining error handling |
| Security Boundary Design | `security-boundary-design` | When defining auth, authorization, tenant isolation |
| Consistency & Transaction Design | `consistency-transaction-design` | When defining consistency model, idempotency, saga |
| Integration Boundary Design | `integration-boundary-design` | When defining external API integration patterns |
| Observability Design | `observability-design` | When defining logs, metrics, traces, alerts, SLOs |
| Migration & Rollout Design | `migration-rollout-design` | When defining rollout strategy, feature flags, rollback |
## Handoff Rule
Planner reads only `docs/architecture/{date}-{feature}.md`.
Planner may read only `docs/architecture/{date}-{feature}.md` and must ignore all internal analysis or optional pre-work outputs.
Architect MUST NOT produce intermediate files that could be mistaken for handoff artifacts.
Architect MUST NOT produce separate files for diagrams, schemas, or specs — all content must be within the single architecture document.
## Key Deliverables
- [ ] **Architecture Document** (strict path: `docs/architecture/{date}-{feature}.md`) containing:
- Overview with system context
- System architecture with service/module boundaries
- Service boundaries with communication patterns
- Data flow through the system
- Database schema with full table definitions, indexes, partition keys, and relationships
- API contract with full endpoint specifications (OpenAPI or gRPC)
- Async / Queue design (or N/A with reason)
- Consistency model (strong vs eventual, idempotency, retry, saga)
- Error model with categories and propagation strategy
- Security boundaries (auth, authorization, tenant isolation, audit logging)
- Integration boundaries for external systems
- Observability strategy (logs, metrics, traces, alerts, SLOs)
- Scaling strategy based on NFRs
- NFR mapping and architectural trade-offs
- At least 3 Mermaid diagrams (system, sequence, data flow)
- At least 1 ADR
- Risks identified
- Open questions documented

View File

@ -0,0 +1,39 @@
# 分析 PRD (Analyze PRD) 技能指南
## 概述
`analyze-prd` 是 Architect Pipeline 的第一個步驟,用來從 PRD 中提取架構需求、識別相關知識領域、標記模糊之處。此技能只做分析,不做設計,產出為內部分析而非檔案。
## 輸入與輸出
### 輸入
- `docs/prd/{feature}.md`
### 輸出
- 無檔案產出,僅供內部分析使用
## 運作方式
1. 完整閱讀 PRD
2. 檢查現有 codebase 架構
3. 提取功能性需求及其架構影響
4. 提取非功能性需求及其架構影響
5. 識別 13 個相關知識領域
6. 識別所需的可交付技能
7. 標記阻擋設計決策的模糊之處
8. 將需求對應到 18 個架構輸出章節
## 分析重點
- 每個 PRD 需求是否都有對應的架構元件
- 每個 NFR 是否都有對應的架構決策
- 哪些知識領域與此功能相關
- 哪些可交付技能需要被引用
## 下游用途
- 分析結果供 `design-architecture` 使用
- 知識領域對應供 `design-architecture` 參考知識合約
- 識別的模糊之處需先與 PM 澄清才能進入設計
## 不應做的事
- 不做架構設計
- 不做技術選型
- 不定義 API 合約、資料表或服務邊界
- 不寫架構決策
- 不產生任何檔案

151
skills/analyze-prd/SKILL.md Normal file
View File

@ -0,0 +1,151 @@
---
name: analyze-prd
description: "Extract architectural requirements from a PRD, identify relevant knowledge domains, and flag ambiguities before architecture design. The Architect pipeline's first step. Produces internal analysis only — no file artifacts."
---
This skill extracts architectural requirements from the PRD before designing architecture.
**Announce at start:** "I'm using the analyze-prd skill to extract architectural requirements from the PRD."
## Purpose
Read the PRD and extract the architectural dimensions that must be addressed during design. Identify which knowledge domains are relevant, flag ambiguities that block architectural decisions, and produce structured internal analysis that feeds into `design-architecture`.
## Important
This skill produces **internal analysis only**. It MUST NOT write any file artifacts. The strict pipeline output is `docs/architecture/{feature}.md` only.
## Hard Gate
Do NOT start designing architecture in this skill. This skill only extracts and organizes requirements. Design happens in `design-architecture`.
## Process
You MUST complete these steps in order:
1. **Read the PRD** at `docs/prd/{feature}.md` end-to-end
2. **Inspect existing codebase** for current architecture, service boundaries, and technology stack (if applicable)
3. **Extract functional requirements** — List each functional requirement and its architectural implications
4. **Extract non-functional requirements** — List each NFR and its architectural implications
5. **Identify relevant knowledge domains** — Determine which knowledge domains are relevant:
- System Decomposition
- API & Contract Design
- Data Modeling
- Distributed System Basics
- Architecture Patterns
- Storage Knowledge
- Async & Queue Design
- Error Model Design
- Security Boundary Design
- Consistency & Transaction Design
- Integration Boundary Design
- Observability Design
- Migration & Rollout Design
6. **Identify required deliverable skills** — Determine which deliverable skills will be needed:
- `generate_mermaid_diagram` — for producing system, sequence, data flow diagrams
- `design_database_schema` — for producing database schema definitions
- `generate_openapi_spec` — for producing API specifications
- `write_adr` — for documenting architectural decisions
- `evaluate_tech_stack` — for evaluating technology choices
7. **Flag ambiguities** — Identify any PRD requirements that are unclear for architectural purposes
8. **Map requirements to architecture sections** — Show which PRD requirements map to which architecture output sections
## Analysis Format
Retain this analysis internally. Do not write it to a file.
```markdown
## PRD Source
Reference to the PRD file being analyzed.
## Functional Requirements Extraction
| # | Requirement | Architectural Implications | Relevant Domains |
|---|-------------|---------------------------|-----------------|
| FR-1 | ... | ... | system-decomposition, api-contract-design |
## Non-Functional Requirements Extraction
| # | Requirement | Architectural Implications | Relevant Domains |
|---|-------------|---------------------------|-----------------|
| NFR-1 | ... | ... | observability-design, scaling-strategy |
## Knowledge Domain Relevance
| Domain | Relevant? | Reason |
|--------|-----------|--------|
| System Decomposition | Yes/No | ... |
| API & Contract Design | Yes/No | ... |
| Data Modeling | Yes/No | ... |
| Distributed System Basics | Yes/No | ... |
| Architecture Patterns | Yes/No | ... |
| Storage Knowledge | Yes/No | ... |
| Async & Queue Design | Yes/No | ... |
| Error Model Design | Yes/No | ... |
| Security Boundary Design | Yes/No | ... |
| Consistency & Transaction Design | Yes/No | ... |
| Integration Boundary Design | Yes/No | ... |
| Observability Design | Yes/No | ... |
| Migration & Rollout Design | Yes/No | ... |
## Required Deliverable Skills
| Deliverable Skill | Needed? | Reason |
|-------------------|---------|--------|
| generate_mermaid_diagram | Yes/No | ... |
| design_database_schema | Yes/No | ... |
| generate_openapi_spec | Yes/No | ... |
| write_adr | Yes/No | ... |
| evaluate_tech_stack | Yes/No | ... |
## Requirement-to-Section Mapping
| Architecture Section | PRD Requirements Served |
|---------------------|------------------------|
| Overview | ... |
| System Architecture | ... |
| Service Boundaries | ... |
| Data Flow | ... |
| Database Schema | ... |
| API Contract | ... |
| Async / Queue Design | ... |
| Consistency Model | ... |
| Error Model | ... |
| Security Boundaries | ... |
| Integration Boundaries | ... |
| Observability | ... |
| Scaling Strategy | ... |
| Non-Functional Requirements | ... |
| Mermaid Diagrams | ... |
| ADR | ... |
| Risks | ... |
| Open Questions | ... |
## Ambiguities And Gaps
List any PRD requirements that are unclear for architectural purposes and need PM clarification before design can proceed. If none, write "None identified."
```
## Primary Input
- `docs/prd/{feature}.md` (required)
## Output
Internal analysis only. No file artifact. Findings are carried forward in memory to inform `design-architecture`.
## Transition
After completing this internal analysis, proceed to `design-architecture` with the PRD and analysis findings in memory.
## Guardrails
This is a pure analysis skill.
Do:
- Extract architectural implications from PRD requirements
- Identify relevant knowledge domains and deliverable skills
- Flag ambiguities that block design decisions
- Map requirements to architecture output sections
Do not:
- Design architecture
- Make technology selections
- Define API contracts, schemas, or service boundaries
- Write architecture decisions
- Produce any file artifacts
- Write any file to disk

View File

@ -0,0 +1,26 @@
# API 合約設計 (API Contract Design) 知識合約指南
## 概述
`api-contract-design` 是知識合約,不是 workflow 技能。用來定義 API 合約的設計原則,包含請求/回應結構、狀態碼、分頁機制、認證邊界與冪等性行為。供 `design-architecture` 在定義 API 時參考。
## 核心原則
- API 是生產者與消費者之間的合約,穩定性與清晰度至關重要
- 每個端點必須服務至少一個 PRD 功能需求
- 合約必須明確、完整且無歧義
- 破壞性變更必須避免;必須規劃版本管理
## 設計重點
- REST API端點定義、請求結構、回應結構、狀態碼、分頁、過濾
- 非 REST APIGraphQL schema、gRPC service 定義、WebSocket 訊息格式
- 錯誤回應格式:一致的錯誤碼、機器可讀與人類可讀的訊息
- 版本策略URL path versioning 與 header versioning 的適用場景
## 知識合約職責
- 提供 API 設計的理論指引
- 不直接產生 API 規格(由 `generate_openapi_spec` 負責格式)
- 與 `generate_openapi_spec` 搭配使用:前者提供原則,後者提供格式
## 不應做的事
- 不產生 API 規格檔案
- 不替特定端點命名或定義結構(那是 `design-architecture` 的職責)
- 不做最終技術選型

View File

@ -0,0 +1,147 @@
---
name: api-contract-design
description: "Knowledge contract for defining API contracts, request/response schemas, status codes, pagination, authentication boundaries, and idempotency behavior. Referenced by design-architecture when defining APIs."
---
This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is defining API contracts.
## Core Principles
- APIs are contracts between producers and consumers; stability and clarity are paramount
- Every endpoint must serve at least one PRD functional requirement
- Contracts must be explicit, complete, and unambiguous
- Breaking changes must be avoided; versioning must be planned
## REST API Design
### Endpoint Definition
For each endpoint, define:
- HTTP method (GET, POST, PUT, PATCH, DELETE)
- Path (e.g., `/api/v1/jobs`)
- Description
- PRD functional requirement it satisfies
- Authentication requirements
- Idempotency behavior (when applicable)
### Request Schema
For each endpoint, define:
- Path parameters (name, type, description, validation rules)
- Query parameters (name, type, required/optional, default, validation rules)
- Request headers (name, required/optional, purpose)
- Request body (JSON schema with types, required fields, validation rules)
### Response Schema
For each endpoint, define:
- Success response (status code, body schema)
- Error responses (each status code, body schema, conditions that trigger it)
- Pagination metadata (when applicable)
### Status Codes
Use status codes semantically:
- `200 OK` - successful retrieval or update
- `201 Created` - successful resource creation
- `204 No Content` - successful deletion or action with no response body
- `400 Bad Request` - client sent invalid input
- `401 Unauthorized` - missing or invalid authentication
- `403 Forbidden` - authenticated but not authorized
- `404 Not Found` - resource does not exist
- `409 Conflict` - state conflict (duplicate, version mismatch)
- `422 Unprocessable Entity` - valid format but business rule violation
- `429 Too Many Requests` - rate limit exceeded
- `500 Internal Server Error` - unexpected server error
- `502 Bad Gateway` - upstream service failure
- `503 Service Unavailable` - temporary unavailability
- `504 Gateway Timeout` - upstream timeout
### Pagination Model
For list endpoints, define:
- Pagination strategy (cursor-based recommended, offset-based acceptable)
- Page size limits (default and maximum)
- Sort order (default and available fields)
- Total count availability (when to include, performance implications)
Cursor-based pagination is preferred for:
- Large datasets
- Real-time data that shifts during pagination
- Performance-sensitive endpoints
Offset-based pagination is acceptable for:
- Small, stable datasets
- When random access by page number is required
### Filtering & Sorting
For list endpoints, define:
- Available filter parameters and their types
- Filter combination rules (AND, OR, support for complex queries)
- Sort fields and sort direction
- Default sort order
### Authentication Boundary
Define:
- Which endpoints require authentication
- Authentication mechanism (API key, JWT, OAuth, etc.)
- Token scope requirements per endpoint
- Rate limiting per authentication tier (when applicable)
### Error Response Format
Define a consistent error response schema:
```json
{
"error": {
"code": "ERROR_CODE",
"message": "Human-readable message",
"details": [
{
"field": "field_name",
"code": "VALIDATION_ERROR",
"message": "Specific error message"
}
]
}
}
```
### Versioning Strategy
- Prefer URL path versioning (e.g., `/api/v1/`) for public APIs
- Prefer header versioning for internal APIs when appropriate
- Define breaking vs non-breaking change policy
- Define deprecation timeline for old versions
## Non-REST APIs
### GraphQL
- Define schema (types, queries, mutations, subscriptions)
- Define resolver contracts
- Define pagination model (cursor-based connections)
- Define error handling in responses
### gRPC
- Define service definitions in proto files
- Define message types
- Define streaming patterns
- Define error status codes
### WebSocket
- Define message schema (message types, payload formats)
- Define connection lifecycle (connect, reconnect, disconnect)
- Define authentication for initial connection
- Define error handling within messages
## API Contract Anti-Patterns
- Endpoints without a PRD functional requirement
- Vague or inconsistent error response formats
- Missing pagination on list endpoints
- Authentication applied inconsistently
- Breaking changes without versioning
- Over-nested response structures
- Exposing internal implementation details through API shape

View File

@ -0,0 +1,28 @@
# 架構模式 (Architecture Patterns) 知識合約指南
## 概述
`architecture-patterns` 是知識合約,用來提供架構模式選擇的原則與取決框架。涵蓋 Modular Monolith、Microservices、Layered、Clean、Hexagonal、Event-Driven、CQRS、Saga、Outbox 等模式。供 `design-architecture` 在選擇架構模式時參考。
## 核心原則
選擇模式唯一的原因是它解決了 PRD 中實際存在的問題。不要因為時尚或覺得以後可能需要就採用模式。
## 模式選項
- **Modular Monolith**:單一部署單元,內部有明確模組邊界
- **Microservices**:多個獨立部署服務,每個職責單一
- **Layered Architecture**:水平分層(展示層、業務層、資料層)
- **Clean Architecture**:以用例為中心,依賴反轉
- **Hexagonal Architecture**:透過埠與轉接器隔離商業邏輯
- **Event-Driven**:透過事件而非直接呼叫溝通
- **CQRS**:讀取模型與寫入模型分離
- **Saga Pattern**:跨服務分散式交易的補償機制
- **Outbox Pattern**:可靠事件發布的確保機制
## 知識合約職責
- 提供各模式的 Use When / Avoid When 判斷原則
- 說明各模式的 Trade-offs
- 不替 PRD 做最終模式選擇
## 不應做的事
- 不替系統選擇特定模式
- 不提供具體實作建議
- 不產生任何架構產出

View File

@ -0,0 +1,182 @@
---
name: architecture-patterns
description: "Knowledge contract for selecting architectural patterns based on requirements. Covers modular monolith, microservices, layered, clean, hexagonal, event-driven, CQRS, saga, and outbox patterns. Referenced by design-architecture when selecting patterns."
---
This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is selecting architectural patterns.
## Core Principle
Choose patterns only when they solve a real problem identified in the PRD. Do not apply patterns because they are fashionable, because other projects use them, or because they might be needed someday.
Every pattern choice must be traced to a specific PRD requirement or NFR. If no PRD requirement justifies a pattern, do not use it.
## Pattern Catalog
### Modular Monolith
One deployment unit with well-defined internal modules.
Use when:
- Domain boundaries are still evolving
- Team is small (fewer than 5-8 engineers per boundary)
- Deployment simplicity is a priority
- The PRD does not require independent service scaling
- You need the flexibility to split later when boundaries stabilize
Avoid when:
- Individual modules have vastly different scaling requirements
- Independent deployment is a hard requirement
- Teams need to own and deploy modules independently
Trade-offs: +simplicity, +single deployment, +easy refactoring, -scaling granularity, -independent deployability
### Microservices
Multiple independently deployable services, each with a single responsibility.
Use when:
- Individual services have different scaling requirements
- Domain boundaries are stable and well-understood
- Independent deployment of services is required
- The PRD requires isolation for reliability or security
- Teams need to own services end-to-end
Avoid when:
- Domain boundaries are not yet clear
- Team size does not support operational overhead
- Inter-service communication overhead is unjustified
- The PRD does not require independent scaling or deployment
Trade-offs: +independent deployment, +scaling granularity, +fault isolation, -operational complexity, -network overhead, -distributed data challenges
### Layered Architecture
Organize code into horizontal layers (presentation, business, data).
Use when:
- The application is straightforward CRUD or simple business logic
- The team is familiar with this pattern
- There is no need for complex domain modeling
Avoid when:
- Business logic is complex and needs to be isolated from infrastructure
- The application has varying persistence requirements
- You need to swap infrastructure implementations
Trade-offs: +simplicity, +familiarity, -tight coupling to infrastructure, -harder to test business logic in isolation
### Clean Architecture
Organize code around use cases with dependency inversion, keeping business logic independent of frameworks and infrastructure.
Use when:
- Business logic is complex and must be protected from infrastructure changes
- The application has multiple delivery mechanisms (API, CLI, web, mobile)
- Testability is a top priority
- Long-term maintainability is critical
Avoid when:
- The application is simple CRUD with minimal business logic
- The team is small and infrastructure changes are unlikely
- Overhead of indirection outweighs maintainability benefit
Trade-offs: +testability, +independence from frameworks, +long-term maintainability, -indirection, -more files and interfaces
### Hexagonal Architecture (Ports & Adapters)
Isolate business logic from external concerns through ports (interfaces) and adapters (implementations).
Use when:
- You need to swap external dependencies (databases, APIs, message queues)
- You want to test business logic without external infrastructure
- The application may have multiple input/output channels
Avoid when:
- The application has a single, stable external dependency
- The indirection overhead is not justified by the project scale
Trade-offs: +testability, +flexibility, +swap ability, -indirection, -interface overhead
### Event-Driven Architecture
Components communicate through events rather than direct calls.
Use when:
- The PRD requires loose coupling between components
- Multiple consumers need to react to the same event
- Async processing is required (see `async-queue-design`)
- Cross-service consistency is eventual (see `distributed-system-basics`)
Avoid when:
- The PRD requires strong consistency across services
- The system is simple enough for direct calls
- Event traceability and debugging overhead is not justified
- The team lacks event-driven experience and the timeline is tight
Trade-offs: +loose coupling, +scalability, +reactive, -debugging complexity, -eventual consistency, -ordering challenges
### CQRS (Command Query Responsibility Segregation)
Separate read models from write models.
Use when:
- Read and write patterns are vastly different (read-heavy, complex queries vs simple writes)
- Read models need to be optimized differently from write models
- The PRD requires different consistency or scaling for reads vs writes
Avoid when:
- Read and write patterns are similar
- The added complexity of sync between models is not justified
- The system is small enough that a single model suffices
Trade-offs: +read optimization, +scaling, +query flexibility, -complexity, -eventual consistency between models, -sync logic
### Saga Pattern
Manage distributed transactions across services using a sequence of local transactions with compensating actions.
Use when:
- A business process spans multiple services
- Each service owns its own data and cannot participate in a distributed transaction
- The PRD requires atomicity across service boundaries
Avoid when:
- The process fits within a single service
- The PRD does not require cross-service atomicity
- Compensating transactions are hard to define (irreversible operations like sending email)
Trade-offs: +cross-service consistency, +service autonomy, -complexity, -compensation logic, -debugging difficulty
### Outbox Pattern
Ensure reliable event publishing by writing events to an outbox table in the same transaction as the data change.
Use when:
- You need to publish events reliably when data changes
- The message queue or event broker might be temporarily unavailable
- At-least-once delivery is required but the system cannot lose events
Avoid when:
- Event loss is acceptable
- The system does not publish events based on data changes
- The added database write overhead is not justified
Trade-offs: +reliability, +exactly-once processing (with idempotency), -write overhead, -outbox polling or CDC complexity
## Pattern Selection Process
1. Identify the specific PRD requirement or NFR that motivates a pattern
2. List 2-3 candidate patterns that could address the requirement
3. Evaluate each against the project context (team size, timeline, complexity tolerance)
4. Select the simplest pattern that satisfies the requirement
5. Document the decision as an ADR (refer to design-architecture template)
## Anti-Patterns
- Applying CQRS to a simple CRUD application
- Using microservices when boundaries are unclear
- Using sagas for single-service transactions
- Adding event-driven architecture for 1-to-1 communication
- Applying clean architecture to a throwaway prototype
- Choosing patterns based on resume appeal rather than requirements

View File

@ -0,0 +1,39 @@
# 架構研究 (Architecture Research) 技能指南
## 概述
`architecture-research` 是可選的前置工作,用來在嚴格 Architect Pipeline 開始前調查技術環境、現有系統限制與可比較架構。此技能只做研究,不做設計,產出為內部分析而非檔案。
## 輸入與輸出
### 輸入
- `docs/prd/{feature}.md`
### 輸出
- 無檔案產出,僅供內部分析使用
## 界線:研究 vs 設計
此技能嚴格限定為**研究**。它只能:
- **彙編限制**:現有系統限制、服務邊界、資料流、技術棧、整合依賴
- **列出選項**:可用技術選項、架構模式及其取捨
- **呈現取捨**:候選方案的優缺點,不做最終選擇
此技能**嚴禁**
- 做最終架構決策
- 選擇技術棧
- 定義服務邊界
- 定義 API 合約或資料模型
- 寫 ADR
- 推薦單一方案
界線很簡單:**研究彙編可能選項;設計決定我們建什麼。**
## 研究重點
- 現有 codebase 架構
- 系統限制:延遲需求、規模預期、合規需求
- 可比較系統架構
- 技術選項的取捨
## 不應做的事
- 不做架構設計
- 不做技術選型
- 不做最終架構決策
- 不產生任何檔案

View File

@ -0,0 +1,79 @@
---
name: architecture-research
description: "Optional pre-work for investigating technical landscape, existing system constraints, and comparable architectures before the strict Architect pipeline begins. This skill produces internal analysis only — no file artifacts. Research may only compile constraints, options, and trade-offs — it MUST NOT make final architecture decisions."
---
This is optional pre-work, not part of the strict Architect pipeline. It is invoked before `analyze-prd` when the PRD involves significant technical constraints that benefit from landscape understanding.
## Important
This skill produces **internal analysis only**. It MUST NOT write any file artifacts. The strict pipeline output is `docs/architecture/{feature}.md` only.
## Boundary: Research vs Design
This skill is strictly **research**. It may only:
- **Compile constraints**: Document existing system constraints, service boundaries, data flow, technology stack, integration dependencies, SLAs, and compliance requirements
- **Catalog options**: List available technology options, architecture patterns, and integration approaches with their trade-offs
- **Surface trade-offs**: Present pros, cons, and trade-offs of candidate approaches without making a final selection
- **Identify risks**: Flag technical risks, unknowns, and potential blockers
This skill **MUST NOT**:
- Make final architecture decisions (those belong exclusively to `design-architecture`)
- Select a technology stack (that is a final decision)
- Define service boundaries (that is a final decision)
- Define API contracts, data models, or database schemas (those are final decisions)
- Write ADRs (those are final decisions)
- Recommend a single approach over others without presenting alternatives (that is a disguised final decision)
- Produce any architecture document or artifact
The boundary is simple: **research compiles what is possible; design decides what we build.** If the output reads like a decision rather than a list of options, it has crossed the boundary into `design-architecture` territory.
## Goals
Use research to answer:
- What existing systems, services, and infrastructure constrain this design?
- What architectural patterns are proven in this problem domain?
- What are the technical risks and trade-offs for candidate approaches?
- What storage, scaling, and reliability decisions have been made by comparable systems?
## What To Research
- Existing codebase architecture: service boundaries, data flow, communication patterns, technology stack
- System constraints: latency requirements, scale expectations, compliance requirements, existing SLAs
- Comparable system architectures: how similar problems were solved, what patterns succeeded or failed
- Technology landscape: available options for storage, messaging, compute, and their trade-offs for this use case
- Integration dependencies: upstream and downstream systems, contracts, protocols, versioning
## What Not To Do
- Do not design architecture yet; this is research only
- Do not make technology selections; catalog options and trade-offs only
- Do not make final architecture decisions of any kind
- Do not reverse-engineer competitor internal implementation details
- Do not write code, schemas, or API definitions
- Do not break down tasks or create milestones
- Do not produce file artifacts
## Process
1. Read the PRD file at `docs/prd/{feature}.md` to understand requirements
2. Inspect the existing codebase for current architecture, service boundaries, and technology stack
3. Identify technical constraints and integration dependencies from the PRD and codebase
4. Research comparable system architectures and proven patterns for this problem domain
5. Catalog technology options with trade-offs relevant to the PRD requirements
6. Retain findings as internal analysis to inform `analyze-prd` and `design-architecture`
## Output
This skill produces **internal analysis only**. Findings are carried forward in memory to inform the next pipeline steps. No file is written.
## Guidance
- Prefer direct evidence from codebase inspection and documented architecture over speculation
- Prefer 3-5 proven patterns over 20 theoretical possibilities
- Call out confidence level when evidence is weak
- Tie findings back to specific PRD requirements and NFRs
- Present options with trade-offs, not recommendations with conclusions
- All final architecture decisions must appear in `docs/architecture/{feature}.md` produced by `design-architecture`

View File

@ -0,0 +1,23 @@
# 非同步與佇列設計 (Async Queue Design) 知識合約指南
## 概述
`async-queue-design` 是知識合約用來設計非同步工作流程、佇列主題、生產者、消費者、重試策略、DLQ、排序保證與逾時行為。供 `design-architecture` 在設計非同步模型時參考。
## 核心原則
非同步處理必須有 PRD 需求支撐。不要因為非同步「比較好」或「更有擴展性」就採用。每個非同步決策必須可追溯到特定 PRD 功能需求或 NFR。
## 設計重點
- **何時用非同步**長時運行操作、PRD 要求非同步、多個消費者需反應同一事件、吞吐量需求超過同步處理能力
- **佇列/主題設計**Topic vs Queue 的選擇、訊息結構、排序保證、持久性保證
- **重試策略**最大重試次數、退避策略Fixed、Exponential、Exponential with Jitter、重試預算
- **DLQ 策略**:何時路由到 DLQ、DLQ 訊息保留、監控與警報
- **逾時與取消**:處理逾時的定義、取消訊號機制
## 知識合約職責
- 提供非同步設計的理論指引
- 不直接產生非同步流程規格(由 `design-architecture` 的 Async / Queue Design 章節負責)
## 不應做的事
- 不替系統選擇特定訊息代理
- 不定義具體的佇列名稱或主題名稱
- 不產生實作程式碼

View File

@ -0,0 +1,142 @@
---
name: async-queue-design
description: "Knowledge contract for designing asynchronous workflows, queue topics, producers, consumers, retry strategies, DLQ, ordering guarantees, and timeout behavior. Referenced by design-architecture when designing async models."
---
This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing asynchronous workflows.
## Core Principle
Asynchronous processing must be justified by a PRD requirement. Do not make operations asynchronous just because async is "better" or more "scalable." Every async decision must trace to a specific PRD functional requirement or NFR.
## When to Use Async
Use async when:
- The operation is long-running and cannot complete within the caller's timeout
- The PRD explicitly requires non-blocking behavior (e.g., "submit and check status later")
- Multiple consumers need to react to the same event
- Throughput requirements exceed synchronous processing capacity
- Decoupling producer and consumer is architecturally necessary (see `system-decomposition`)
- The PRD requires eventual consistency across service boundaries
Do NOT use async when:
- The operation is fast enough for synchronous handling
- The caller needs an immediate result
- The system is simple enough that direct calls suffice
- Async adds complexity without a corresponding PRD requirement
## Queue/Topic Design
For each queue or topic, define:
- Name and purpose (traced to PRD requirement)
- Producer service(s)
- Consumer service(s)
- Message schema (payload format, headers, metadata)
- Ordering guarantee (per-partition ordered, unordered)
- Durability guarantee (at-least-once, exactly-once for important messages)
- Retention policy (how long messages are kept)
### Topic vs Queue
Use a topic (pub/sub) when:
- Multiple independent consumers need the same event
- Consumers have different processing logic
- Adding new consumers should not require changes to the producer
Use a queue (point-to-point) when:
- Exactly one consumer should process each message
- Work distribution across instances of the same service is needed
- Ordering within a partition matters
### Message Schema
Define message schemas explicitly:
- Message type or event name
- Payload schema (with versioning strategy)
- Metadata headers (correlation ID, causation ID, timestamp, source)
- Schema evolution strategy (backward compatibility, versioning)
## Retry Strategy
For each async operation, define:
### Retry Parameters
- Maximum retries: typically 3-5 for transient failures
- Backoff strategy:
- Fixed interval: simple but may overwhelm recovering service
- Exponential backoff: recommended default, increasingly longer waits
- Exponential backoff with jitter: prevents thundering herd
- Retry budget: maximum concurrent retries per consumer to prevent cascading failure
### What to Retry
- Transient network errors
- Temporary resource unavailability (503, timeouts)
- Rate limit exceeded (429, with backoff and Retry-After header)
- Upstream service failures (502, 504)
### What NOT to Retry
- Business rule violations (non-retryable error codes)
- Malformed messages (bad schema, missing required fields)
- Permanent failures (authentication errors, not-found errors)
- Messages that have exceeded maximum retries (route to DLQ)
## Dead-Letter Queue (DLQ) Strategy
For each queue/topic with retry, define:
- DLQ name (e.g., `{original-queue}.dlq`)
- Condition for routing to DLQ: exceeded max retries, permanent failure, or poison message
- DLQ message retention policy
- Alerting: when messages appear in DLQ, who is notified
- Recovery process: how DLQ messages are inspected, fixed, and reprocessed
DLQ design principles:
- Every retryable queue MUST have a DLQ
- DLQ messages must include original message, error details, and retry count
- DLQ must be monitored and alerted on; silent DLQs are a failure mode
- Recovery from DLQ may require manual intervention or a replay mechanism
## Ordering Guarantees
For each queue/topic, explicitly state the ordering guarantee:
- **Per-partition ordered**: Messages within the same partition key are delivered in order. Use when order within a context matters (e.g., per user, per order).
- **Unordered**: No ordering guarantee across messages. Use when operations are independent.
- **Globally ordered**: All messages are delivered in order. Avoid unless the PRD explicitly requires it (severely limits throughput).
If ordering is required:
- Define the partition key (e.g., `user_id`, `order_id`)
- Define how out-of-order delivery is handled when it occurs
- Define whether strict ordering or best-effort ordering is acceptable
## Timeout Behavior
For each async operation, define:
- Processing timeout: maximum time a consumer may take to process a message
- Visibility timeout: how long a message is invisible to other consumers while being processed
- What happens on timeout:
- Message is returned to the queue for retry (if below max retries)
- Message is routed to DLQ (if max retries exceeded)
- Alerting is triggered for operational visibility
Timeout design principles:
- Always set timeouts; no infinite waits
- Timeout values must be based on observed processing times, not guesses
- Document timeout values and adjust based on production metrics
## Cancellation
Define whether async operations can be cancelled and how:
- Cancellation signal mechanism (cancel event, status field, cancel API)
- What happens to in-progress work when cancellation is received
- Whether cancellation is best-effort or guaranteed
- How cancellation is reflected in the operation status
## Anti-Patterns
- Making operations async without a PRD requirement
- Not defining a DLQ for retryable queues
- Setting infinite timeouts or no timeouts
- Assuming global ordering when per-partition ordering suffices
- Not versioning message schemas
- Processing messages without idempotency (see `idempotency-design`)
- Ignoring backpressure when consumers are overwhelmed

View File

@ -0,0 +1,52 @@
# 架構挑戰 (Challenge Architecture) 技能指南
## 概述
`challenge-architecture` 是 Architect Pipeline 的第三個步驟,用來對架構決策做高強度審查、驗證 PRD 可追溯性、偵測過度設計與不足設計。此技能以安靜的批次審計模式運作,不會互動式提問。
## 輸入與輸出
### 輸入
- `docs/architecture/{feature}.md`
- `docs/prd/{feature}.md`
### 輸出
- 更新後的 `docs/architecture/{feature}.md`
## 審計模式
此技能以**安靜審計 / 批次審查**模式運作:
- 完整閱讀 PRD 與架構文件
- 安靜執行所有驗證階段
- 產出單一結構化審查報告
- 直接將所有修補套用到架構文件
- 不會一次問一個問題或互動式提示
## 審計階段
1. **可追溯性**:每個架構元素是否回溯到 PRD 需求
2. **覆蓋度**:每個 PRD 需求是否有架構對應
3. **擴展性**:服務是否能獨立擴展、是否有單點失敗
4. **一致性**一致性模型是否明確、Race condition 是否識別
5. **安全性**:認證/授權邊界是否定義
6. **整合**:外部系統整合是否識別、失敗模式是否定義
7. **可觀測性**Logs、Metrics、Traces 是否完整
8. **資料完整性**:資料是否可能遺失、交易邊界是否合適
9. **過度設計偵測**:過於複雜的架構決策
10. **不足設計偵測**:遺漏的需求對應
## 審查輸出格式
- Traceability Gaps
- Missing Decisions
- Over-Engineering
- Under-Engineering
- Risks
- Required Revisions
## 閘道決策
- **PASS**:所有修補已套用,無剩餘阻擋
- **CONDITIONAL PASS**:存在不阻擋 Planner 交接的小缺口
- **FAIL**:需要重大修訂,需回到 `design-architecture`
## 不應做的事
- 不以互動方式提問
- 不改變 PRD 範圍
- 不從頭設計架構
- 不做實作層級決策
- 不拆分任務

View File

@ -0,0 +1,223 @@
---
name: challenge-architecture
description: "Silent audit and batch review of architecture decisions. Validates traceability, scalability, consistency, security, integration, observability, and detects over/under-engineering. Updates the single architecture file in place."
---
Perform a silent, structured audit of the architecture document against the PRD. Produce a single batch review with fixed output groups. Apply all fixes directly to the architecture file. Do not ask interactive questions.
**Announce at start:** "I'm using the challenge-architecture skill to audit and review the architecture."
## Primary Input
- `docs/architecture/{feature}.md`
- `docs/prd/{feature}.md`
## Primary Output (STRICT PATH)
- Updated `docs/architecture/{feature}.md`
This is the **only** file artifact in the Architect pipeline. Review findings and fixes are applied directly to this file. No intermediate files are written.
## Audit Mode
This skill operates in **silent audit / batch review** mode:
- Read the architecture document and PRD in full
- Perform all validation phases silently
- Produce a single structured review with all findings grouped into fixed categories
- Apply all fixes directly to the architecture document
- Do NOT ask questions one at a time or interactively prompt the user
## Audit Phases
Perform the following validations silently, collecting all findings before producing the review.
### Phase 1: Traceability
For every architectural element, verify it traces back to at least one PRD requirement:
- Every API endpoint serves a PRD functional requirement
- Every DB table serves a data requirement from FRs or NFRs
- Every service boundary serves a domain responsibility from the PRD scope
- Every async flow serves a PRD requirement
- Every error handling strategy serves a PRD edge case or NFR
- Every consistency decision serves a PRD requirement
- Every security boundary serves a security or compliance requirement
- Every integration boundary serves an external system requirement
- Every observability decision serves an NFR
### Phase 2: Coverage
For every PRD requirement, verify it is covered by the architecture:
- Every functional requirement has at least one architectural component
- Every NFR has at least one architectural decision
- Every edge case has an error handling strategy
- Every acceptance criterion has architectural support
### Phase 3: Scalability
- Can each service scale independently?
- Are there single points of failure?
- Are there bottlenecks that prevent horizontal scaling?
- Is database scaling addressed?
- Are there unbounded data growth scenarios?
### Phase 4: Consistency
- Is the consistency model explicit for each data domain?
- Are eventual consistency windows acceptable for the use case?
- Are race conditions identified and mitigated?
- Is idempotency designed for operations that require it?
- Are distributed transaction boundaries clear?
- Is the deduplication strategy sound?
- Are retry semantics defined for all async operations?
- Is the outbox pattern used where needed?
- Are saga/compensation patterns defined for multi-step operations?
### Phase 5: Security
- Are authentication boundaries clearly defined?
- Is authorization modeled correctly?
- Is service-to-service authentication specified?
- Is token propagation defined?
- Is tenant isolation defined (for multi-tenant systems)?
- Is secret management addressed?
- Are there data exposure risks in API responses?
- Is audit logging specified for sensitive operations?
### Phase 6: Integration
- Are all external system integrations identified?
- Is the integration pattern appropriate for each?
- Are rate limits and quotas addressed?
- Are failure modes defined for each integration?
- Are retry strategies defined for transient failures?
- Is data transformation between systems addressed?
### Phase 7: Observability
- Are logs, metrics, and traces all specified?
- Is correlation ID propagation defined across services?
- Are SLOs defined for critical operations?
- Are alert conditions and thresholds specified?
- Can the system be debugged end-to-end from logs and traces?
### Phase 8: Data Integrity
- Are there scenarios where data could be lost?
- Are transaction boundaries appropriate?
- Are there scenarios where data could become inconsistent?
- Is data ownership clear?
- Are cascading deletes or updates handled correctly?
### Phase 9: Over-Engineering Detection
- Services that could be modules
- Patterns applied without PRD justification
- Storage choices exceeding requirements
- Async processing where sync would suffice
- Abstraction layers without clear benefit
- Consistency guarantees stronger than requirements
- Security boundaries more complex than the threat model
- Observability granularity beyond operational need
### Phase 10: Under-Engineering Detection
- Missing error handling for PRD edge cases
- Missing idempotency for operations requiring it
- Missing NFR accommodations
- Missing async processing for non-blocking requirements
- Missing security boundaries where the PRD requires them
- Missing observability for critical operations
- Missing consistency model specification
- Missing integration failure handling
- Missing retry strategies for external dependencies
## Review Output Format
After completing all audit phases, produce a single structured review section. Append or update the `## Architecture Review` section in `docs/architecture/{feature}.md` with the following fixed groups:
```markdown
## Architecture Review
### Traceability Gaps
List every architectural element that cannot be traced to a PRD requirement, and every PRD requirement not covered by the architecture.
| Element / Requirement | Issue | Proposed Fix |
|----------------------|-------|-------------|
| ... | Untraceable / Uncovered | ... |
### Missing Decisions
List required architectural decisions that are absent or incomplete.
- [ ] ...
### Over-Engineering
List elements that exceed what the PRD requires.
- ... (specific item, why it is over-engineered, proposed simplification)
### Under-Engineering
List PRD requirements that lack adequate architectural support.
- ... (specific requirement, what is missing, proposed addition)
### Risks
| Risk | Impact | Likelihood | Mitigation |
|------|--------|-----------|------------|
| ... | High/Medium/Low | High/Medium/Low | ... |
### Required Revisions
Numbered list of all changes that MUST be applied before handoff:
1. ...
2. ...
```
After producing the review, apply all Required Revisions directly to `docs/architecture/{feature}.md`.
## Gate Decision
After applying revisions, evaluate the final state:
- **PASS** — All revisions applied, no remaining blockers
- **CONDITIONAL PASS** — Minor gaps remain but do not block Planner handoff
- **FAIL** — Significant revision required; return to `design-architecture`
Record the gate decision at the end of the Architecture Review section.
If FAIL, do NOT proceed to `finalize-architecture`. The architecture must be redesigned in `design-architecture` first.
If PASS or CONDITIONAL PASS, proceed to `finalize-architecture`.
## Guardrails
This is a pure validation and revision skill.
Do:
- Audit the architecture silently and produce a single batch review
- Validate traceability, scalability, consistency, security, integration, observability
- Detect over-engineering and under-engineering
- Propose specific fixes for all identified issues
- Apply all fixes directly to `docs/architecture/{feature}.md`
- Record the gate decision
Do not:
- Ask questions interactively
- Change PRD requirements or scope
- Design architecture from scratch
- Make implementation-level decisions
- Break down tasks or create milestones
- Write test cases
- Produce any file artifact other than `docs/architecture/{feature}.md`
## Transition
If gate decision is PASS or CONDITIONAL PASS, invoke `finalize-architecture` for final completeness check and format validation.

View File

@ -0,0 +1,34 @@
# 一致性與交易設計 (Consistency Transaction Design) 知識合約指南
## 概述
`consistency-transaction-design` 是知識合約,用來提供一致性与交易設計的原則與模式。涵蓋 Strong vs Eventual Consistency、Idempotency、Deduplication、Retry、Outbox Pattern、Saga 與 Compensation。供 `design-architecture` 在定義一致性模型時參考。此合約取代了原有的 `idempotency-design`
## 核心原則
### CAP Theorem
- Consistency每次讀取都獲得最新寫入或錯誤
- Availability每個請求都獲得回應不保證是最新寫入
- Partition tolerance網路分割時系統仍可運作
- 三者無法同時滿足,根據業務需求選擇。
### 一致性光譜
- **Strong Consistency**:讀取永遠返回最新寫入
- **Eventual Consistency**:讀取可能返回過時資料,最終收斂
- **Session Consistency**:同一 session 內讀取看到自己的寫入
- **Causal Consistency**:讀取遵守因果順序
## 設計重點
- **一致性模型選擇**:按資料域選擇,而非按系統選擇
- **Idempotency 設計**何時需要、Key 策略、TTL、儲存位置
- **Deduplication**Idempotency key、Content hash、Sequence number
- **Retry**Fixed interval、Exponential backoff、Circuit breaker
- **Outbox Pattern**:確保可靠事件發布
- **Saga Pattern**跨服務分散式交易的補償機制Choreography vs Orchestration
## 知識合約職責
- 提供一致性與交易設計的理論指引
- 不直接產生一致性模型規格(由 `design-architecture` 的 Consistency Model 章節負責)
## 不應做的事
- 不替系統選擇特定一致性策略
- 不定義具體的 Idempotency key 格式或 TTL
- 不產生實作程式碼

View File

@ -0,0 +1,156 @@
---
name: consistency-transaction-design
description: "Knowledge contract for consistency and transaction design. Provides principles and patterns for strong vs eventual consistency, idempotency, deduplication, retry, outbox pattern, saga, and compensation. Referenced by design-architecture when defining consistency model. Subsumes idempotency-design."
---
This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing consistency and transaction models. It does not produce artifacts directly.
This knowledge contract subsumes the previous `idempotency-design` contract. All idempotency concepts are included here alongside broader consistency and transaction patterns.
## Core Principles
### CAP Theorem
- **Consistency**: Every read receives the most recent write or an error
- **Availability**: Every request receives a (non-error) response, without guarantee that it contains the most recent write
- **Partition tolerance**: The system continues to operate despite an arbitrary number of messages being dropped or delayed by the network
- You cannot have all three simultaneously. Choose based on business requirements.
### Consistency Spectrum
- **Strong consistency**: Read always returns the latest write. Simplest mental model, but limits availability and scalability.
- **Causal consistency**: Reads respect causal ordering. Good for collaborative systems.
- **Eventual consistency**: Reads may return stale data, but converge over time. Highest availability and scalability.
- **Session consistency**: Reads within a session see their own writes. Good compromise for user-facing systems.
## Consistency Model Selection
### When to Use Strong Consistency
- Financial transactions (balances must be accurate)
- Inventory management (overselling is unacceptable)
- Unique constraint enforcement (duplicate records are unacceptable)
- Configuration data (wrong config causes system errors)
### When to Use Eventual Consistency
- Read-heavy workloads with high availability requirements
- Derived data (counts, aggregates, projections)
- Notification delivery (delay is acceptable)
- Analytics data (trend accuracy is sufficient)
- Search indexes (slight staleness is acceptable)
### Design Considerations
- Define the consistency model per data domain, not per system
- Document the expected replication lag and its business impact
- Define conflict resolution strategy for eventual consistency (last-write-wins, merge, manual)
- Define staleness tolerance per read pattern (how stale is acceptable?)
## Idempotency Design
### What is Idempotency?
An operation is idempotent if executing it once has the same effect as executing it multiple times.
### When Idempotency is Required
- Any operation triggered by user action (network retries, browser refresh)
- Any operation triggered by webhook (delivery may be duplicated)
- Any operation processed from a queue (at-least-once delivery)
- Any operation that modifies state (creates, updates, deletes)
### Idempotency Key Strategy
- **Source**: Where does the key come from? (client-generated, server-assigned, composite)
- **Format**: UUID, hash of request content, or composite key (user_id + action + timestamp)
- **TTL**: How long is the key stored? Must be long enough to catch retries, short enough to avoid storage bloat
- **Storage**: Where are idempotency keys stored? (database, Redis, in-memory)
### Idempotency Response Behavior
- **First request**: Process normally, return success response
- **Duplicate request**: Return the original response (stored alongside the idempotency key)
- **Concurrent request**: Return 409 Conflict or 425 Too Early (if the original request is still processing)
### Idempotency Collision Handling
- Different requests with the same key must be detected and rejected
- Keys must be unique per operation type and per client/tenant scope
## Deduplication
### Patterns
- **Idempotency key**: For request-level deduplication
- **Content hash**: For message-level deduplication (hash the message content)
- **Sequence number**: For ordered message deduplication (track last processed sequence)
- **Tombstone**: Mark processed messages to prevent reprocessing
### Design Considerations
- Define deduplication window (how long to track processed messages)
- Define deduplication scope (per-producer, per-consumer, per-queue)
- Define storage for deduplication state (Redis with TTL, database table)
- Define cleanup strategy for deduplication state
## Retry
### Retry Patterns
- **Fixed interval**: Retry at fixed intervals (simple, but may overload recovering service)
- **Exponential backoff**: Increase delay between retries (recommended default)
- **Exponential backoff with jitter**: Add randomness to prevent thundering herd
- **Circuit breaker**: Stop retrying after consecutive failures, try again after cooldown
### Design Considerations
- Define maximum retry count per operation
- Define backoff strategy (base, max, multiplier)
- Define retryable vs non-retryable errors
- Retryable: network timeout, 503, 429
- Non-retryable: 400, 401, 403, 404, 409
- Define retry budget (max retries per time window to prevent runaway retries)
- Define what to do after max retries (DLQ, alert, manual intervention)
## Outbox Pattern
### When to Use
- When you need to atomically write to a database and publish a message
- When you cannot use a distributed transaction across database and message broker
- When you need at-least-once message delivery guarantee
### How It Works
1. Write business data and outbox message to the same database transaction
2. A separate process reads the outbox table and publishes messages to the broker
3. Mark outbox messages as published after successful delivery
4.failed deliveries are retried by the outbox reader
### Design Considerations
- Outbox table must be in the same database as business data
- Outbox reader must handle duplicate delivery (consumer must be idempotent)
- Outbox reader polling interval affects delivery latency
- Define outbox message TTL and cleanup strategy
## Saga Pattern
### When to Use
- When a business operation spans multiple services and requires distributed transaction semantics
- When you need to rollback if any step fails
### Choreography-Based Saga
- Each service publishes events that trigger the next step
- No central coordinator
- Services must listen for events and decide what to do
- Compensation: each service publishes a compensation event if a step fails
### Orchestration-Based Saga
- A central orchestrator calls each service in sequence
- Orchestrator maintains saga state and decides which step to execute next
- Compensation: orchestrator calls compensation operations in reverse order
- More visible and debuggable, but adds a single point of failure
### Design Considerations
- Define saga steps and order
- Define compensation for each step (what to do if this step or a later step fails)
- Define saga timeout and expiration
- Define how to handle partial failures (which steps completed, which need compensation)
- Consider whether choreography or orchestration is more appropriate
- Choreography: simpler, more decoupled, harder to debug
- Orchestration: more visible, easier to debug, more coupled
## Anti-Patterns
- **Assuming strong consistency when using eventually consistent storage**: Be explicit about consistency guarantees
- **Missing idempotency for queue consumers**: Queue delivery is at-least-once, consumers must be idempotent
- **Infinite retries without backoff**: Always use exponential backoff with a maximum
- **Distributed transactions across services**: Use saga pattern instead of trying to enforce ACID across services
- **Outbox without deduplication**: Outbox pattern guarantees at-least-once delivery, consumers must handle duplicates
- **Saga without compensation**: Every saga step must have a defined compensation action
- **Missing conflict resolution for eventually consistent data**: Define how conflicts are resolved when they inevitably occur

View File

@ -0,0 +1,28 @@
# 資料建模 (Data Modeling) 知識合約指南
## 概述
`data-modeling` 是知識合約用來定義資料庫結構、分區鍵、索引、查詢模式、反正規化策略、TTL/快取與資料所有權。供 `design-architecture` 在設計資料模型時參考。
## 核心原則
- 資料模型必須由查詢與寫入模式驅動,而非理論純度
- 每個資料表必須有明確目的,可追溯到 PRD 需求
- 索引必須有明確的查詢模式支撐
- 資料所有權必須明確:每筆資料只屬於一個服務
## 設計重點
- **資料表定義**:資料表名稱、目的、欄位定義、主鍵、外鍵關係
- **索引設計**:索引必須有查詢模式支撐,避免 speculation 索引
- **分區鍵**:分散式資料儲存的分區鍵選擇、熱分割風險
- **關係**One-to-one、One-to-many、Many-to-many(含義與 cascade 行為)
- **反正規化策略**:何時反正規化、資料同步機制、過時容忍度
- **TTL 與快取**Ephemeral 資料的 TTL、快取類型與失效策略
- **資料所有權**:每筆資料的唯一擁有服務、其他服務透過 API 或事件存取
## 知識合約職責
- 提供資料建模的理論指引
- 不直接產生資料庫結構定義(由 `design_database_schema` 負責格式)
## 不應做的事
- 不替特定資料庫選擇資料表名稱或欄位
- 不定義具體的索引名稱或類型
- 不產生 Schema 檔案

View File

@ -0,0 +1,142 @@
---
name: data-modeling
description: "Knowledge contract for defining database schemas, partition keys, indexes, query patterns, denormalization strategy, TTL/caching, and data ownership. Referenced by design-architecture when designing data models."
---
This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing database schemas and data models.
## Core Principles
- Data models must be driven by query and write patterns, not theoretical purity
- Each table or collection must serve a clear purpose traced to PRD requirements
- Indexes must be justified by identified query patterns
- Data ownership must be unambiguous: each data item belongs to exactly one service
## Table Definitions
For each table or collection, define:
- Table name and purpose (traced to PRD requirement)
- Column definitions:
- Name
- Data type
- Nullable or not null
- Default value (if any)
- Constraints (unique, check, etc.)
- Primary key
- Foreign keys and relationships
- Data volume estimates (when relevant for storage selection)
## Index Design
Indexes must be justified by query patterns:
- Identify the queries this table must support
- Design indexes to cover those queries
- Avoid speculative indexes "just in case"
- Consider write amplification: every index slows writes
Index justification format:
- Index name
- Columns (with sort direction)
- Type (unique, non-unique, partial, composite)
- Query pattern it serves
- Estimated selectivity
## Partition Keys
When designing distributed data stores:
- Partition key must distribute data evenly across nodes
- Partition key should align with the most common access pattern
- Consider hot partition risks
- Define partition strategy (hash, range, composite)
## Relationships
Define relationships explicitly:
- One-to-one
- One-to-many (with foreign key placement)
- Many-to-many (with junction table)
For each relationship:
- Direction of access (which side queries the other)
- Cardinality (exactly N, at most N, unbounded)
- Nullability (is the relationship optional?)
- Cascade behavior (what happens on delete?)
## Denormalization Strategy
Denormalize when:
- A query needs data from multiple entities and joins are expensive or unavailable
- Read frequency significantly exceeds write frequency
- The denormalized data has a clear source of truth that can be kept in sync
Do not denormalize when:
- The data changes frequently and consistency is critical
- Joins are cheap and the data store supports them well
- The denormalization creates complex synchronization logic
- There is no clear source of truth
For each denormalized field:
- Identify the source of truth
- Define the synchronization mechanism (eventual consistency, sync on read, sync on write)
- Define the staleness tolerance
## TTL and Caching
### TTL (Time-To-Live)
Define TTL for:
- Ephemeral data (sessions, temporary tokens, idempotency keys)
- Time-bounded data (logs, analytics, expired records)
- Data that must be purged after a regulatory period
For each TTL:
- Duration and basis (absolute time, sliding window, last access)
- Action on expiration (delete, archive, revoke)
### Caching
Define caching for:
- Frequently read, rarely written data
- Computed aggregates that are expensive to recalculate
- Data that is accessed across service boundaries
For each cache:
- Cache type (in-process, distributed, CDN)
- Invalidation strategy (TTL-based, event-based, write-through)
- Staleness tolerance
- Cache miss behavior (stale-while-recompute, block-and-fetch)
## Data Ownership
Each piece of data must have exactly one owner:
- The owning service is the single source of truth
- Other services access that data via the owner's API or events
- No service reads directly from another service's data store
- If data is needed in multiple places, replicate via events with a clear source of truth
Data ownership format:
| Data Item | Owning Service | Access Pattern | Replication Strategy |
|----------|---------------|----------------|---------------------|
| ... | ... | ... | ... |
## Query Pattern Analysis
For each table, document:
- Primary query patterns (by which columns/keys is data accessed)
- Write patterns (insert-heavy, update-heavy, or mixed)
- Read-to-write ratio (when relevant)
- Consistency requirements (strong, eventual, or tunable)
- Scale expectations (rows per day, rows total, growth rate)
This analysis drives:
- Index selection
- Partition key selection
- Storage engine selection
- Denormalization decisions
## Anti-Patterns
- Tables without a clear PRD requirement
- Indexes without a documented query pattern
- Shared tables across service boundaries
- Premature denormalization without a read/write justification
- Missing foreign key constraints where referential integrity is required
- Data models that assume a specific storage engine without justification

View File

@ -0,0 +1,54 @@
# 設計架構 (Design Architecture) 技能指南
## 概述
`design-architecture` 是 Architect Pipeline 的核心步驟,用來基於 PRD 需求設計完整系統架構,產出單一嚴格輸出檔案 `docs/architecture/{feature}.md`
## 輸入與輸出
### 輸入
- `docs/prd/{feature}.md`
### 輸出
- `docs/architecture/{feature}.md`(唯一檔案,所有交付物必須嵌入此檔案)
## 必備章節18 個)
1. Overview
2. System Architecture
3. Service Boundaries
4. Data Flow
5. Database Schema
6. API Contract
7. Async / Queue Design
8. Consistency Model
9. Error Model
10. Security Boundaries
11. Integration Boundaries
12. Observability
13. Scaling Strategy
14. Non-Functional Requirements
15. Mermaid Diagrams
16. ADR
17. Risks
18. Open Questions
## 設計原則
1. **High Availability** — 設計容錯與復原力,勝過完美一致性
2. **Scalability** — 設計水平擴展,勝過垂直擴展
3. **Stateless First** — 偏好無狀態服務,外部化狀態到資料庫或快取
4. **API First** — 先定義合約再實作API 是主要介面
5. **Event Driven First** — 偏好事件驅動溝通
6. **Async First** — 偏好非同步處理
## 知識合約與可交付技能引用
- **知識合約**13 個system-decomposition、api-contract-design、data-modeling、distributed-system-basics、architecture-patterns、storage-knowledge、async-queue-design、error-model-design、security-boundary-design、consistency-transaction-design、integration-boundary-design、observability-design、migration-rollout-design
- **可交付技能**5 個generate_mermaid_diagram、design_database_schema、generate_openapi_spec、write_adr、evaluate_tech_stack
## 防範佔位符規則
範例僅供說明用途。請勿重複使用範例中的佔位符元件、欄位、端點或結構,否則會與 PRD 要求不符。
## 不應做的事
- 不改變 PRD 範圍或需求
- 不建立任務拆分或里程碑
- 不寫測試案例
- 不寫實作程式碼
- 不選擇特定函式庫或框架
- 不產生 `docs/architecture/{feature}.md` 以外的檔案

View File

@ -0,0 +1,304 @@
---
name: design-architecture
description: "Design system architecture based on PRD requirements. The Architect pipeline's core step, producing the single strict output file. References deliverable skills for format details and knowledge contracts for design principles."
---
This skill produces the complete architecture document for a feature, including all required deliverables.
**Announce at start:** "I'm using the design-architecture skill to design the system architecture."
## Primary Input
- `docs/prd/{feature}.md` (required)
## Primary Output (STRICT PATH)
- `docs/architecture/{feature}.md`
This is the **only** file artifact produced by the Architect pipeline. No intermediate files are written to disk. All deliverables — diagrams, schemas, specs, ADRs — must be embedded within this single document.
## Hard Gate
Do NOT start this skill if the PRD has unresolved ambiguities that block architectural decisions. Resolve them with the PM first.
## Process
You MUST complete these steps in order:
1. **Read the PRD** at `docs/prd/{feature}.md` end-to-end to understand all requirements
2. **Apply internal analysis** from the `analyze-prd` step (if performed) to understand which knowledge domains are relevant
3. **Design each architecture section** based on PRD requirements and relevant knowledge domains
4. **Apply knowledge contracts** as needed:
- `system-decomposition` when designing service boundaries
- `api-contract-design` when defining API contracts
- `data-modeling` when designing database schema
- `distributed-system-basics` when dealing with distributed concerns
- `architecture-patterns` when selecting architectural patterns
- `storage-knowledge` when making storage technology decisions
- `async-queue-design` when designing asynchronous workflows
- `error-model-design` when defining error handling
- `security-boundary-design` when defining auth, authorization, tenant isolation
- `consistency-transaction-design` when defining consistency model, idempotency, saga
- `integration-boundary-design` when defining external API integration patterns
- `observability-design` when defining logs, metrics, traces, alerts, SLOs
- `migration-rollout-design` when defining rollout strategy, feature flags, rollback
5. **Apply deliverable skills** for format requirements when producing sections:
- `generate_mermaid_diagram` when producing the Mermaid Diagrams section
- `design_database_schema` when producing the Database Schema section
- `generate_openapi_spec` when producing the API Contract section
- `write_adr` when producing the ADR section
- `evaluate_tech_stack` when producing the Technology Stack subsection
6. **Ensure traceability** — every architectural decision must trace back to at least one PRD requirement
7. **Verify completeness** — all 18 required sections are present and substantive
8. **Write the architecture document** to `docs/architecture/{feature}.md`
## Architect Behavior Principles
Apply these principles in priority order when making design decisions:
1. **High Availability** — Design for fault tolerance and resilience over perfect consistency
2. **Scalability** — Design for horizontal scaling over vertical scaling
3. **Stateless First** — Prefer stateless services; externalize state to databases or caches
4. **API First** — Define contracts before implementation; APIs are the primary interface
5. **Event Driven First** — Prefer event-driven communication for cross-service coordination
6. **Async First** — Prefer asynchronous processing for non-realtime operations
## Anti-Placeholder Rule
Examples in deliverable skills and this template are illustrative only. Do not reuse placeholder components, fields, endpoints, or schemas unless explicitly required by the PRD. Every element in the architecture document must be grounded in actual requirements, not copied from examples.
## Architecture Document Template
The following 18 sections are required. If a section is not applicable, write `N/A` with a brief reason. Each section states what it must contain and which deliverable skill to reference for format details.
```markdown
# Architecture: {Feature Name}
## Overview
High-level description of the system architecture. Map every major PRD requirement to an architectural component. Summarize the system's purpose, key design decisions, and architectural style.
### Requirement Traceability
| PRD Requirement | Architectural Component |
|----------------|------------------------|
| ... | ... |
## System Architecture
Describe the complete system architecture including all services, databases, message queues, caches, and external integrations. Show how components are organized and how they communicate.
### Technology Stack
Reference `evaluate_tech_stack` deliverable skill for evaluation format.
| Layer | Technology | Justification |
|-------|-----------|---------------|
| ... | ... | ... |
If the feature has no backend component, write `N/A` with a brief reason.
### Component Architecture
Describe each major component, its responsibility, and how it fits into the overall system.
## Service Boundaries
Define service boundaries with clear responsibilities.
For each service or module:
- Name and single responsibility
- Owned data
- Communication patterns with other services (sync, async, event-driven)
- Potential coupling points and mitigation
### Communication Matrix
| From | To | Pattern | Protocol | Purpose |
|------|----|---------|----------|---------|
| ... | ... | ... | ... | ... |
## Data Flow
Describe how data moves through the system end-to-end. Include request lifecycle, background job processing, event propagation, and data transformation steps.
## Database Schema
Reference `design_database_schema` deliverable skill for table definition format, index format, partition key format, and relationship format.
Define all tables with field names, types, constraints, indexes, partition keys, and relationships. Include denormalization strategy and migration strategy where applicable.
If the feature requires no database changes, write `N/A` with a brief reason.
## API Contract
Reference `generate_openapi_spec` deliverable skill for endpoint definition format, error code format, idempotency format, and pagination format.
Define all API endpoints with method, path, request/response schemas, error codes, idempotency, and pagination. Include an endpoint catalog and endpoint details.
## Async / Queue Design
Define asynchronous operations and their behavior. If the feature has no asynchronous requirements, write `N/A` with a brief reason.
For each async operation:
- Operation name and trigger
- Queue or event topic
- Producer and consumer
- Retry policy (max retries, backoff, DLQ)
- Ordering guarantees
- Timeout and cancellation behavior
## Consistency Model
Define the consistency guarantees of the system. Reference `consistency-transaction-design` knowledge contract for design principles.
- Strong vs eventual consistency per data domain
- Idempotency design per idempotent operation
- Deduplication and retry strategy
- Outbox pattern usage (when applicable)
- Saga / compensation patterns (when applicable)
If the feature has no consistency or idempotency requirements, write `N/A` with a brief reason.
## Error Model
Define error handling strategy across the system.
- Error categories (client errors, server errors, business rule violations, timeout, cascading failure)
- Error propagation strategy (fail-fast, graceful degradation, circuit breaker)
- Error response format
- PRD edge case mapping
## Security Boundaries
Define security architecture. Reference `security-boundary-design` knowledge contract for design principles.
- Authentication mechanism
- Authorization model
- Service identity and service-to-service auth
- Token propagation strategy
- Tenant isolation
- Secret management
- Audit logging
If the feature has no security implications, write `N/A` with a brief reason.
## Integration Boundaries
Define all integrations with external systems. Reference `integration-boundary-design` knowledge contract for design principles.
For each external system:
- Integration pattern (API, webhook, polling, event)
- Rate limits and quotas
- Failure modes and fallback
- Retry strategy
- Data contract
- Authentication
If the feature has no external integrations, write `N/A` with a brief reason.
## Observability
Define observability strategy. Reference `observability-design` knowledge contract for design principles.
- Logs: levels, format, aggregation
- Metrics: business metrics, system metrics, naming conventions
- Traces: distributed tracing, correlation ID, span boundaries
- Alerts: conditions, thresholds, routing
- SLOs: availability, latency, error budget
## Scaling Strategy
Define how the system scales based on NFRs.
- Horizontal scaling approach
- Database scaling (read replicas, sharding, partitioning)
- Cache scaling
- Queue scaling
- Auto-scaling policies
- Bottleneck analysis
## Non-Functional Requirements
Document all NFRs from the PRD and how the architecture addresses each one.
| NFR | Requirement | Architectural Decision | Verification Method |
|-----|-------------|----------------------|---------------------|
| ... | ... | ... | ... |
## Mermaid Diagrams
Reference `generate_mermaid_diagram` deliverable skill for diagram format and guidelines.
Produce at minimum:
- 1 System Architecture Diagram
- 1 Sequence Diagram
- 1 Data Flow Diagram
Additional diagrams as needed (event flow, state machine, etc.).
## ADR
Reference `write_adr` deliverable skill for ADR format.
Document significant architectural decisions. Each ADR must include Context, Decision, Consequences, and Alternatives. Minimum 1 ADR.
## Risks
| Risk | Impact | Likelihood | Mitigation |
|------|--------|-----------|------------|
| ... | High/Medium/Low | High/Medium/Low | ... |
## Open Questions
List any unresolved questions that need PM or Engineering input.
1. ...
2. ...
```
## Completeness Check
Before finalizing the architecture document, verify:
1. All 18 required sections are present (or explicitly marked N/A with reason)
2. Every PRD functional requirement is traced to at least one architectural component
3. Every PRD NFR is traced to at least one architectural decision
4. Every architecture section that is not N/A has substantive content grounded in PRD requirements
5. All API endpoints map to PRD functional requirements
6. All DB tables map to data requirements from functional requirements or NFRs
7. All async flows map to PRD requirements
8. All error handling strategies map to PRD edge cases
9. ADRs exist for all significant decisions (minimum 1)
10. At least 3 Mermaid diagrams are present (system, sequence, data flow)
11. No placeholder content reused from examples — all content must be grounded in actual requirements
## Guardrails
This is a pure Architecture skill.
Do:
- Design system structure and boundaries
- Define API contracts and data models
- Define error handling, retry, and consistency strategies
- Define security boundaries and integration patterns
- Reference deliverable skills for format requirements of specific sections
- Reference knowledge contracts for design principles
- Ensure traceability to PRD requirements
- Ensure all content is grounded in actual PRD requirements, not placeholder examples
Do not:
- Change PRD requirements or scope
- Create task breakdowns, milestones, or deliverables
- Write test cases or test plans
- Write implementation code or pseudocode
- Choose specific libraries or frameworks at the implementation level
- Prescribe code patterns, class structures, or function-level logic
- Produce any file artifact other than `docs/architecture/{feature}.md`
- Reuse placeholder components, fields, endpoints, or schemas from examples unless explicitly required by the PRD
The Architect defines HOW the system is structured.
The Engineering defines HOW the code is written.
## Transition
After completing the architecture document, invoke `challenge-architecture` to audit and review the architecture.

View File

@ -0,0 +1,32 @@
# 設計資料庫結構 (Design Database Schema) 技能指南
## 概述
`design_database_schema` 是可交付技能,用來產生資料庫結構定義,包含資料表、集合、分區鍵、索引、關係、反正規化策略與遷移策略。支援 PostgreSQL、Cassandra、MongoDB、Redis、Surrealdb。供 `design-architecture` 在產生 Database Schema 章節時參考。
## 核心原則
此技能提供具體的格式要求與完整性檢查清單。Schema 定義必須具體到足以供實作使用。
## 支援的資料庫
| 資料庫 | 適用場景 | 不適用場景 |
|--------|---------|-----------|
| PostgreSQL | 關聯資料、ACID 交易、複雜查詢 | 大量寫入吞吐量、寬列存取模式 |
| Cassandra | 高寫入吞吐量、時間序列、寬列存取 | 複雜 JOIN、ACID 交易、隨機查詢 |
| MongoDB | 文件資料、彈性結構、快速迭代 | 複雜 JOIN、嚴格 ACID、關聯資料 |
| Redis | 快取、工作階段、速率限制、即時排行榜 | 持久化主資料、複雜查詢 |
| SurrealDB | 多模型資料、即時、圖形關係 | 生態系不成熟 |
## 必備元素
- **資料表/集合**:每個實體的完整定義(含欄位名稱、類型、限制式)
- **索引**:每個索引必須有查詢模式支撐
- **分區鍵**Cassandra、DynamoDB 等分散式資料庫
- **關係**:外鍵、參考完整性、 cascade 行為
- **反正規化策略**:理由與一致性影響
- **遷移策略**:向後相容的遷移方法
## 防範佔位符規則
範例僅供說明用途。不要重複使用範例中的佔位符資料表名稱、欄位名稱、類型、索引或關係。
## 不應做的事
- 不替非 PRD 要求的實體建立資料表
- 不產生與實際需求無關的欄位或索引
- 不產生獨立 Schema 檔案(所有內容必須嵌入 `docs/architecture/{feature}.md`

View File

@ -0,0 +1,127 @@
---
name: design_database_schema
description: "Produce database schema definitions including tables, collections, partition keys, indexes, relationships, denormalization strategy, and migration strategy. Supports PostgreSQL, Cassandra, MongoDB, Redis, SurrealDB. A deliverable skill referenced by design-architecture."
---
This skill provides guidance and format requirements for producing database schema definitions within the architecture document.
This is a deliverable skill, not a workflow skill. It is referenced by `design-architecture` when producing database schema artifacts.
## Purpose
The Architect must produce detailed database schema definitions that are specific enough for implementation. Schemas define the data layer of the system and must include tables, fields, indexes, partition keys, relationships, and migration strategies.
## Supported Databases
When designing database schema, consider the appropriate database for each data domain:
| Database | Best For | Not Ideal For |
|----------|----------|---------------|
| PostgreSQL | Relational data, ACID transactions, complex queries | Massive write throughput, wide-column access patterns |
| Cassandra | High write throughput, time-series, wide-column access patterns | Complex joins, ACID transactions, ad-hoc queries |
| MongoDB | Document data, flexible schema, rapid iteration | Complex joins, strict ACID, relational data |
| Redis | Caching, sessions, rate limiting, real-time leaderboards | Persistent primary data, complex queries |
| SurrealDB | Multi-model data, real-time, graph relationships | Unknown maturity, limited ecosystem |
## Schema Definition Format
Each table/collection must include:
### Table Definition
```markdown
### {table_name}
**Purpose**: {Brief description of what this table stores}
| Column | Type | Constraints | Default | Description |
|--------|------|-------------|---------|-------------|
| id | UUID | PK, NOT NULL | gen_random_uuid() | Primary key |
| ... | ... | ... | ... | ... |
**Indexes**:
| Index Name | Columns | Type | Justification |
|-----------|---------|------|---------------|
| idx_{table}_{columns} | {columns} | B-tree / Hash / GIN | {query pattern this index supports} |
**Partition Key**: {partition_key} (if applicable)
**Foreign Keys**:
| Column | References | On Delete |
|--------|-----------|-----------|
| {column} | {table}.{column} | CASCADE / SET NULL / RESTRICT |
```
### Collection Definition (for document databases)
```markdown
### {collection_name}
**Purpose**: {Brief description}
**Document Schema**:
- `{field}`: `{type}` — {description}
- ...
**Indexes**:
| Index Name | Fields | Type | Justification |
|-----------|--------|------|---------------|
| ... | ... | ... | ... |
**Partition Key**: {partition_key} (if applicable)
```
## Required Schema Elements
### Tables / Collections
- Every entity identified in the architecture must have a table or collection definition
- Each table must have a clear purpose statement
- Each field must have type, constraints, and description
### Indexes
- Every index must be justified by a specific query pattern
- Consider composite indexes for multi-column queries
- Consider partial indexes for filtered queries
- Consider unique indexes for business constraints
### Partition Keys (when applicable)
- Define partition keys for Cassandra, DynamoDB, or similar databases
- Justify partition key choice based on access patterns
- Document partition distribution expectations
### Relationships
- Define foreign key relationships with referential integrity constraints
- Document one-to-one, one-to-many, many-to-many relationships
- Define junction tables for many-to-many relationships
- Document data ownership: each piece of data belongs to exactly one service
### Denormalization Strategy
- Document any intentional denormalization
- Justify each denormalization decision with a specific read pattern
- Describe the consistency implications of each denormalization
- Define the synchronization mechanism for denormalized data
### Migration Strategy
- Document migration approach for schema changes
- Define backward-compatible migration strategy
- Note any data migration steps required
- Define rollback strategy for schema changes
## Knowledge Contract Reference
This deliverable skill works alongside the `data-modeling` knowledge contract:
- `data-modeling` provides the theoretical guidance on data modeling principles
- This skill provides the concrete output format and completeness requirements
## Anti-Placeholder Rule
Examples in this skill are illustrative only. Do not reuse placeholder table names, column names, types, indexes, or relationships unless explicitly required by the PRD. Every table, field, index, and relationship must be grounded in actual requirements and match the architecture document's data model.
## Embedding in Architecture Document
All database schema definitions must be embedded within the `## Database Schema` section of `docs/architecture/{feature}.md`.
Do NOT produce separate schema files. All schema definitions must be within the single architecture document.

View File

@ -0,0 +1,26 @@
# 分散式系統基礎 (Distributed System Basics) 知識合約指南
## 概述
`distributed-system-basics` 是知識合約用來理解與設計分散式系統的相關考量At-least-once vs Exactly-once、Retry 行為、Duplicates、Idempotency、Timeout vs Failure、Partial Failure、Eventual Consistency 與 Ordering Guarantees。供 `design-architecture` 在處理分散式系統相關問題時參考。
## 核心原則
分散式系統中,網路呼叫會失敗、請求會重複、資料可能過時。設計時必須假設這些問題會發生,並系統性地處理它們。
## 設計重點
- **交付保證**At-Most-Once、At-Least-Once、Exactly-Once 的選擇框架
- **Retry 行為**:何時重試、何時不重試、退避策略
- **Duplicates**如何產生、如何處理Idempotency keys、Deduplication
- **Timeout vs Failure**Timeout 表示未知狀態,不是失敗狀態
- **Partial Failure**:多步驟操作失敗時的處理策略
- **Eventual Consistency**:使用時機與一致性窗口
- **Ordering Guarantees**Per-Partition vs Global vs None
## 知識合約職責
- 提供分散式系統設計的理論指引
- 說明各種模式的 Trade-offs 與適用場景
- 不直接產生實作程式碼或配置
## 不應做的事
- 不替系統選擇特定實現方式
- 不假設網路呼叫永遠成功
- 不忽略分散式帶來的複雜性

View File

@ -0,0 +1,163 @@
---
name: distributed-system-basics
description: "Knowledge contract for understanding and designing for distributed system concerns: at-least-once vs exactly-once, retry behavior, duplicate requests, idempotency, timeout vs failure, partial failure, eventual consistency, and ordering guarantees. Referenced by design-architecture when dealing with distributed concerns."
---
This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing for distributed system concerns.
## Delivery Guarantees
### At-Most-Once
- Message may be lost but never delivered twice
- Use when: loss is acceptable, retries are not, throughput is priority
- Trade-off: simplicity and speed at the cost of reliability
### At-Least-Once
- Message is never lost but may be delivered more than once
- Use when: loss is unacceptable, consumers are idempotent or can deduplicate
- Trade-off: reliability at the cost of requiring idempotency handling
- Most common default for production systems
### Exactly-Once
- Message is delivered once and only once
- Use when: duplicates are harmful and idempotency is hard or impossible
- Trade-off: significant complexity, performance overhead, and coordination cost
- Often achieved via idempotency + at-least-once rather than true exactly-once protocol
Choose the weakest guarantee that meets PRD requirements. Do not default to exactly-once unless the PRD requires it.
## Retry Behavior
### When to Retry
- Transient network failures
- Temporary resource unavailability (503, timeouts)
- Rate limit exceeded (429, with backoff)
- Upstream service failures (502, 504)
### When NOT to Retry
- Client errors (400, 401, 403, 404, 422)
- Business rule violations
- Malformed requests
- Non-retryable error codes explicitly defined in the API contract
### Retry Strategy Parameters
- Maximum retries: define per operation (typically 2-5)
- Backoff strategy:
- Fixed interval: predictable but may overwhelm recovering service
- Exponential backoff: increasingly longer waits (recommended default)
- Exponential backoff with jitter: adds randomness to avoid thundering herd
- Retry budget: limit total retries per time window to prevent cascading failure
### Retry Anti-Patterns
- Retrying non-idempotent operations without deduplication
- Infinite retries without a circuit breaker
- Synchronous retries that block the caller indefinitely
- Ignoring Retry-After headers
## Duplicate Requests
Duplicates arise from:
- Network retries
- Client timeouts with successful server processing
- Message queue redelivery
- User double-submit
Handling strategies:
- Idempotency keys (preferred for API operations)
- Deduplication at consumer level (for event processing)
- Natural idempotency (read operations, certain write patterns)
- Idempotency is covered in detail in the `idempotency-design` knowledge contract
## Timeout vs Failure
### Timeout
- The operation may have succeeded; you just do not know
- Must be handled as "unknown state" not "failed state"
- Requires idempotency or state reconciliation
### Failure
- The operation definitively did not succeed
- Can be safely retried
Design implications:
- Always distinguish between timeout and confirmed failure
- For timeouts, retry with idempotency or check state before retrying
- Define timeout values per operation type (short for interactive, long for batch)
- Document timeout values in API contracts
## Partial Failure
Partial failure occurs when:
- A multi-step operation fails after some steps succeed
- A batch operation partially succeeds
- An upstream dependency fails mid-transaction
Handling strategies:
- Compensating transactions (saga pattern) for multi-service operations
- Partial success responses (207 Multi-Status for batch operations)
- Atomic operations where possible (single-service transactions)
- Outbox pattern for ensuring eventual consistency
Design principles:
- Define what "partial" means for each operation
- Define whether partial success is acceptable or must be fully rolled back
- Document recovery procedures for each partial failure scenario
- Map partial failure scenarios to PRD edge cases
## Eventual Consistency
Eventual consistency means:
- Updates propagate asynchronously
- Reads may return stale data for a bounded period
- All replicas eventually converge
When to use:
- Cross-service data synchronization
- Read replicas and caching
- Event-driven architectures
- High-write, low-latency-requirement scenarios
When NOT to use:
- Financial balances where immediate consistency is required
- Inventory counts where overselling is unacceptable
- Authorization decisions where stale permissions are harmful
- Any scenario the PRD marks as requiring strong consistency
Design implications:
- Define acceptable staleness bounds per data type
- Define how consumers detect and handle stale data
- Define convergence guarantees (time-bound, version-bound)
- Document which data is eventually consistent and which is strongly consistent
## Ordering Guarantees
### Per-Partition Ordering
- Messages within a single partition or queue are ordered
- Use when: operation sequence matters within a context (e.g., per user, per order)
- Ensure: partition key is set to the context identifier
### Global Ordering
- All messages across all partitions are ordered
- Use when: global sequence matters (rare)
- Trade-off: severely limits throughput and availability
- Avoid unless the PRD explicitly requires it
### No Ordering Guarantee
- Messages may arrive in any order
- Use when: operations are independent and order does not matter
- Ensure: consumers can handle out-of-order delivery
Define ordering guarantees per queue/topic:
- State the guarantee clearly
- Define the partition key if per-partition ordering is used
- Define how out-of-order delivery is handled when ordering is expected but not guaranteed
## Anti-Patterns
- Assuming network calls never fail
- Retrying without idempotency
- Treating timeout as failure
- Ignoring partial failure scenarios
- Assuming global ordering when only per-partition ordering is needed
- Using strong consistency when eventual consistency would suffice
- Using eventual consistency when the PRD requires strong consistency

View File

@ -0,0 +1,25 @@
# 錯誤模型設計 (Error Model Design) 知識合約指南
## 概述
`error-model-design` 是知識合約,用來設計錯誤分類、傳播策略、可重試 vs 不可重試錯誤、Partial Failure 行為與回退策略。供 `design-architecture` 在定義錯誤處理時參考。
## 核心原則
錯誤處理必須系統性設計,不能事後才追加。每個錯誤分類必須可追溯到 PRD edge case 或 NFR。錯誤模型必須在整個系統中保持一致。
## 設計重點
- **錯誤分類**Client Errors (4xx)、Server Errors (5xx)、Business Rule Violations、Timeout Errors、Cascading Failures
- **錯誤傳播策略**Fail-Fast、Graceful Degradation、Circuit Breaker
- **錯誤回應格式**:一致的錯誤碼、機器可讀與人類可讀的訊息
- **Retryable vs Non-Retryable**:何時可重試、何時不可重試
- **Partial Failure 行為**All-or-nothing、Best-effort、Saga/Compensation
- **回退策略**:每個外部依賴的回退行為
## 知識合約職責
- 提供錯誤處理的理論指引
- 說明各種模式的 Trade-offs 與適用場景
- 不直接產生錯誤代碼定義或實作程式碼
## 不應做的事
- 不替系統定義具體的錯誤碼
- 不假設所有錯誤都是同樣處理方式
- 不忽略 Partial Failure 場景

View File

@ -0,0 +1,196 @@
---
name: error-model-design
description: "Knowledge contract for designing error categories, propagation strategies, retryable vs non-retryable errors, partial failure behavior, and fallback strategies. Referenced by design-architecture when defining error handling."
---
This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is defining error handling strategy.
## Core Principle
Error handling must be designed systematically, not added as an afterthought. Every error category must trace to a PRD edge case or NFR. The error model must be consistent across the entire system.
## Error Categories
### Client Errors (4xx)
Errors caused by the client sending invalid or incorrect requests.
Common client errors:
- `400 Bad Request` - malformed request body, missing required fields
- `401 Unauthorized` - missing or invalid authentication
- `403 Forbidden` - authenticated but not authorized for this resource
- `404 Not Found` - requested resource does not exist
- `409 Conflict` - state conflict (duplicate, version mismatch, business rule violation)
- `422 Unprocessable Entity` - valid format but business rule violation
- `429 Too Many Requests` - rate limit exceeded
Design principles:
- Client errors are non-retryable (unless 429 with Retry-After)
- Error response must include enough detail for the client to correct the request
- Error codes should be consistent and documented in the API contract (see `api-contract-design`)
### Server Errors (5xx)
Errors caused by the server failing to process a valid request.
Common server errors:
- `500 Internal Server Error` - unexpected server failure
- `502 Bad Gateway` - upstream service failure
- `503 Service Unavailable` - temporary unavailability
- `504 Gateway Timeout` - upstream service timeout
Design principles:
- Server errors may be retryable (see retryable vs non-retryable)
- Error response should not leak internal details in production
- All unexpected server errors must be logged and alerted
- Circuit breakers should protect against cascading server errors
### Business Rule Violations
Errors where the request is valid but violates a business rule.
Design principles:
- Use 422 or 409 depending on the nature of the violation
- Include the specific business rule that was violated
- Include enough context for the client to understand and correct the issue
- Map each business rule violation to a PRD functional requirement
### Timeout Errors
Errors where an operation did not complete within the expected time.
Design principles:
- Always distinguish timeout from confirmed failure
- Timeout means "unknown state" not "failed"
- Define timeout values per operation type
- Document recovery procedures for timed-out operations
- See `distributed-system-basics` for timeout vs failure handling
### Cascading Failures
Failures that propagate from one service to another, potentially bringing down the entire system.
Design principles:
- Use circuit breakers to stop cascade propagation
- Use bulkheads to isolate failure domains
- Define fallback behavior for each dependency failure
- Monitor and alert on circuit breaker state changes
## Error Propagation Strategy
### Fail-Fast
Immediately return an error to the caller when a dependency fails.
Use when:
- The caller cannot proceed without the dependency
- Partial data is worse than no data
- The PRD requires immediate feedback
### Graceful Degradation
Continue serving reduced functionality when a dependency fails.
Use when:
- The PRD allows partial functionality
- Some data is better than no data
- The feature has a clear fallback path
Define for each graceful degradation:
- What functionality is reduced
- What the user sees instead
- How the system recovers when the dependency returns
### Circuit Breaker
Stop calling a failing dependency after a threshold of failures, allowing it time to recover.
Define for each circuit breaker:
- Failure threshold (how many failures before opening)
- Recovery timeout (how long before trying again)
- Half-open behavior (how many requests to allow during recovery)
- Fallback behavior when circuit is open
Use when:
- A dependency is experiencing persistent failures
- Continuing to call will make things worse (cascading failure risk)
- The system can operate with reduced functionality
## Error Response Format
Define a consistent error response format across the entire system:
```json
{
"error": {
"code": "ERROR_CODE",
"message": "Human-readable message describing what happened",
"details": [
{
"field": "field_name",
"code": "SPECIFIC_ERROR_CODE",
"message": "Specific error description"
}
],
"request_id": "correlation-id-for-tracing"
}
}
```
Design principles:
- `code` is a machine-readable string constant (not HTTP status code)
- `message` is human-readable and suitable for display or logging
- `details` provides field-level validation errors when applicable
- `request_id` enables cross-service error tracing
- Never include stack traces, internal paths, or implementation details in production error responses
## Retryable vs Non-Retryable Errors
### Retryable Errors
- Server errors (500, 502, 503, 504) with backoff
- Timeout errors with backoff
- Rate limit errors (429) with Retry-After
- Network connectivity errors
### Non-Retryable Errors
- Client errors (400, 401, 403, 404, 422, 409)
- Business rule violations
- Malformed requests
- Authentication failures
Define per endpoint whether an error is retryable. Include this in the API contract.
## Partial Failure Behavior
Define partial failure behavior for operations that span multiple steps or services:
- **All-or-nothing**: The entire operation succeeds or fails atomically. Use for financial transactions, inventory operations, or any data requiring strong consistency.
- **Best-effort**: Complete as much as possible and report partial success. Use for batch operations, notifications, or operations where partial success is acceptable.
- **Compensating transaction (saga)**: Each step has a compensating action. If a step fails, previous steps are undone via compensation. Use for multi-service operations where atomicity is required but distributed transactions are not available.
For each partial failure scenario:
- Define what "partial" means in this context
- Define whether partial success is acceptable or must be fully rolled back
- Define the recovery procedure
- Map to a PRD edge case
## Fallback Strategy
For each external dependency, define:
- What happens when the dependency is unavailable
- Fallback behavior (cached data, default response, queue and retry, fail with user message)
- How the system recovers when the dependency returns
- SLA implications of the fallback
## Observability
For error model design, define:
- What errors are logged (all unexpected errors, all server errors, sampled client errors)
- What errors trigger alerts (server error rate, DLQ depth, circuit breaker state)
- Error metrics (error rate by code, error rate by endpoint, p99 latency)
- Request tracing (correlation IDs across service boundaries)
Map observability requirements to PRD NFRs.
## Anti-Patterns
- Returning generic 500 errors for all server failures
- Not distinguishing timeout from failure
- Ignoring partial failure scenarios
- Leaking internal details in error responses
- Using the same error handling strategy for all operations regardless of criticality
- Not defining fallback behavior for external dependencies
- Alerting on all errors instead of actionable thresholds
- Using circuit breakers without fallback behavior

View File

@ -0,0 +1,27 @@
# 評估技術棧 (Evaluate Tech Stack) 技能指南
## 概述
`evaluate_tech_stack` 是可交付技能,用來評估與建議技術棧,包含語言、框架、資料庫、佇列、快取與基礎設施。記錄每個選擇的優缺點與理由。供 `design-architecture` 在產生 Technology Stack 子章節時參考。
## 核心原則
技術選擇必須基於 PRD 需求、現有系統、團隊專業知識與營運限制。每個技術選擇都必須有優缺點與理由。
## 評估層次
- **Language**:程式語言(以生態系、效能、團隊專業、函式庫支援評估)
- **Framework**:應用框架(以成熟度、社群、效能、開發者體驗評估)
- **Database**:資料庫(以資料模型契合度、查詢模式、一致性需求評估)
- **Queue / Message Broker**:訊息佇列或事件串流平台
- **Cache**:快取層
- **Infrastructure**:部署基礎設施
## 防範佔位符規則
範例僅供說明用途。不要重複使用範例中的佔位符技術名稱、理由或替代方案。
## 知識合約職責
- 與 `storage-knowledge``architecture-patterns` 知識合約搭配使用
- 前者提供儲存技術比較,後者提供模式選擇指引
## 不應做的事
- 不替 PRD 沒有要求的場景選擇技術
- 不基於時尚或流行選擇技術
- 不產生獨立評估文件(所有內容必須嵌入 `docs/architecture/{feature}.md`

View File

@ -0,0 +1,106 @@
---
name: evaluate_tech_stack
description: "Evaluate and recommend technology stack including language, framework, database, queue, cache, and infrastructure. Document pros, cons, and justification for each choice. A deliverable skill referenced by design-architecture."
---
This skill provides guidance and format requirements for evaluating and recommending the technology stack within the architecture document.
This is a deliverable skill, not a workflow skill. It is referenced by `design-architecture` when evaluating technology choices.
## Purpose
The Architect must evaluate the technology stack for the system, considering requirements from the PRD, existing systems, team expertise, and operational constraints. Each technology choice must be justified with pros, cons, and rationale.
## Technology Stack Evaluation Format
When evaluating the technology stack for a feature, produce a structured evaluation for each stack layer:
```markdown
### {Layer}: {Technology}
- **Pros**:
- {Specific advantage relevant to this use case}
- {Another advantage}
- **Cons**:
- {Specific disadvantage relevant to this use case}
- {Another disadvantage}
- **Why Chosen**:
- {Specific rationale tied to PRD requirements}
- {Why this technology is the best fit for this use case}
- **Alternatives Considered**:
- {Alternative 1}: {Brief reason why not chosen}
- {Alternative 2}: {Brief reason why not chosen}
```
## Evaluation Layers
### Language
- Primary programming language for each service
- Justification based on: ecosystem, performance, team expertise, library support
- Consider: type safety, concurrency model, deployment size, development velocity
### Framework
- Application framework for each service
- Justification based on: maturity, community, performance, developer experience
- Consider: built-in features, middleware ecosystem, testing support, documentation
### Database
- Primary and secondary databases
- Justification based on: data model fit, query patterns, write patterns, consistency requirements, scale expectations
- Consider: ACID vs eventual consistency, operational complexity, backup/restore, migration path
### Queue / Message Broker
- Message queue or event streaming platform
- Justification based on: throughput requirements, ordering guarantees, delivery semantics, durability
- Consider: at-least-once vs exactly-once, partitioning, consumer groups, operational complexity
### Cache
- Caching layer
- Justification based on: access patterns, TTL requirements, invalidation strategy
- Consider: cache-aside vs read-through/write-through, memory limits, persistence options
### Infrastructure
- Deployment infrastructure
- Justification based on: scalability, cost, team expertise, deployment model
- Consider: containerization, orchestration, service mesh, CDN, monitoring
## Decision Principles
When evaluating technology choices, prioritize:
1. **Simplicity**: Choose the simplest technology that meets requirements
2. **Battle-tested**: Prefer technologies with proven production track records
3. **Team expertise**: Prefer technologies the team already knows, unless the learning curve is justified
4. **Operational maturity**: Prefer technologies with good monitoring, tooling, and debugging support
5. **Community and ecosystem**: Prefer technologies with active communities and rich ecosystems
6. **Fit for purpose**: Choose technologies that match the specific data model, access pattern, and consistency requirements
## Anti-Patterns
Avoid:
- Choosing technologies based on hype or fashion without PRD justification
- Choosing different technologies for each service without good reason (polyglot penalty)
- Choosing bleeding-edge technologies without a fallback plan
- Choosing technologies that require significant operational investment without clear benefit
- Choosing technologies that don't match the data model or access pattern
## Knowledge Contract Reference
This deliverable skill works alongside the `storage-knowledge` and `architecture-patterns` knowledge contracts:
- `storage-knowledge` provides detailed comparison of storage technologies
- `architecture-patterns` provides guidance on which patterns suit which technologies
## Anti-Placeholder Rule
Examples in this skill are illustrative only. Do not reuse placeholder technologies, justifications, or alternatives unless explicitly required by the PRD. Every technology selection, justification, and alternative must be grounded in actual requirements and reflect real evaluation for this system.
## Embedding in Architecture Document
Technology stack evaluation must be embedded within the `## System Architecture` section (Technology Stack subsection) of `docs/architecture/{feature}.md`.
For significant technology decisions that affect the overall system structure, also document them as ADRs in the `## ADR` section.
Do NOT produce separate evaluation documents. All technology evaluations must be within the single architecture document.

View File

@ -0,0 +1,38 @@
# 完成架構 (Finalize Architecture) 技能指南
## 概述
`finalize-architecture` 是 Architect Pipeline 的最後一個步驟,在挑戰審查與修訂完成後,對架構文件進行最終完整性檢查與格式驗證。
## 輸入與輸出
### 輸入
- `docs/architecture/{feature}.md`
### 輸出
- 最終 `docs/architecture/{feature}.md`
## 驗證步驟
1. **章節完整性檢查**18 個必備章節是否都存在與實質內容
2. **Mermaid 圖表驗證**:至少 3 張圖表System、Sequence、Data Flow、語法正確、無孤兒元件
3. **資料庫 Schema 驗證**:所有表格含欄位、類型與限制式、索引有理由
4. **API 合約驗證**:所有端點含方法、路徑、請求/回應結構
5. **ADR 驗證**:至少 1 個 ADR含 Context、Decision、Consequences、Alternatives
6. **可追溯性驗證**:每個元素是否可追溯到 PRD 需求
7. **格式驗證**章節順序、Markdown 語法、無外部檔案參照
8. **架構審查閘道**:確認挑戰審查的 Gate Decision 是否為 PASS 或 CONDITIONAL PASS
## 完成檢查清單
- [ ] 18 個必備章節存在且實質(或是 N/A with reason
- [ ] 至少 3 張 Mermaid 圖表
- [ ] Database Schema 有完整表格定義
- [ ] API Contract 有完整端點規格
- [ ] 至少 1 個 ADR 且格式完整
- [ ] 所有元素可追溯到 PRD 需求
- [ ] 架構審查閘道為 PASS 或 CONDITIONAL PASS
- [ ] Risks 區段已填寫
- [ ] Open Questions 區段已填寫(或明確寫 None
## 不應做的事
- 不設計新架構
- 不改變架構決策
- 不新增未在挑戰審查中驗證的重大內容
- 不產生 `docs/architecture/{feature}.md` 以外的檔案

View File

@ -0,0 +1,150 @@
---
name: finalize-architecture
description: "Final completeness check and format validation for the architecture document. The Architect pipeline's final step before handoff to Planner."
---
This skill performs a final completeness check and format validation on the architecture document after challenge and revision.
**Announce at start:** "I'm using the finalize-architecture skill to perform the final completeness check on the architecture document."
## Primary Input
- `docs/architecture/{feature}.md`
## Primary Output (STRICT PATH)
- Final `docs/architecture/{feature}.md`
This is the **only** file artifact in the Architect pipeline. Finalization results are applied directly to this file.
## Process
You MUST complete these steps in order:
### Step 1: Section Completeness Check
Verify all 18 required sections are present and substantive (or explicitly marked N/A with reason):
1. Overview
2. System Architecture
3. Service Boundaries
4. Data Flow
5. Database Schema
6. API Contract
7. Async / Queue Design
8. Consistency Model
9. Error Model
10. Security Boundaries
11. Integration Boundaries
12. Observability
13. Scaling Strategy
14. Non-Functional Requirements
15. Mermaid Diagrams
16. ADR
17. Risks
18. Open Questions
For each missing or empty section, add a placeholder with `N/A — [reason]` or flag it as a gap that must be filled.
### Step 2: Mermaid Diagram Verification
Verify the document contains at minimum:
- **1 System Architecture Diagram** — showing all services, databases, queues, and external integrations
- **1 Sequence Diagram** — showing the primary happy-path interaction flow
- **1 Data Flow Diagram** — showing how data moves through the system
For each diagram, verify:
- Mermaid syntax is valid
- All components referenced in the architecture are present in the diagram
- No orphan components exist in diagrams that are not described elsewhere
### Step 3: Database Schema Verification
Verify the Database Schema section contains:
- All tables with field names, types, constraints, and defaults
- Indexes with justification based on query patterns
- Partition keys where applicable
- Relationships (foreign keys, references)
- Denormalization strategy where applicable
- Migration strategy notes
### Step 4: API Contract Verification
Verify the API Contract section contains:
- All endpoints with method, path, request schema, response schema
- Error codes and error response schemas
- Idempotency requirements per endpoint (where applicable)
- Pagination and filtering (where applicable)
- Authentication requirements
### Step 5: ADR Verification
Verify the ADR section contains at minimum 1 ADR with:
- ADR number and title
- Context
- Decision
- Consequences
- Alternatives considered
### Step 6: Traceability Verification
Verify:
- Every API endpoint traces to a PRD functional requirement
- Every DB table traces to a data requirement
- Every service boundary traces to a domain responsibility
- Every async flow traces to a PRD requirement
- Every security boundary traces to a requirement
- Every integration boundary traces to an external system requirement
### Step 7: Format Verification
Verify:
- The document follows the exact section ordering from the template
- Section headings use proper markdown hierarchy
- Mermaid code blocks use ```mermaid syntax
- Tables use proper markdown table syntax
- No external files are referenced (all content is within the single document)
### Step 8: Architecture Review Gate
Verify the Architecture Review section from `challenge-architecture`:
- Gate decision is either PASS or CONDITIONAL PASS
- All identified issues have been addressed
- No unresolved blockers remain
## Finalization Checklist
- [ ] All 18 required sections present and substantive (or N/A with reason)
- [ ] At least 3 Mermaid diagrams present (system, sequence, data flow)
- [ ] Database Schema has complete table definitions
- [ ] API Contract has complete endpoint specifications
- [ ] At least 1 ADR present with full format
- [ ] All elements trace to PRD requirements
- [ ] Architecture Review gate is PASS or CONDITIONAL PASS
- [ ] Document format follows template ordering
- [ ] No external file references (all content is inline)
- [ ] Risks section is populated
- [ ] Open Questions section is populated (or explicitly states "None")
## Guardrails
This is a pure validation and formatting skill.
Do:
- Verify completeness of all 18 sections
- Validate Mermaid diagram syntax and coverage
- Validate API contract completeness
- Validate database schema completeness
- Validate ADR format
- Validate traceability
- Fix formatting issues directly in `docs/architecture/{feature}.md`
Do not:
- Design new architecture
- Change architectural decisions
- Add significant new content that wasn't validated in challenge-architecture
- Produce any file artifact other than `docs/architecture/{feature}.md`
## Transition
After finalization is complete and all checks pass, the architecture document is ready for handoff to the Planner. The Planner reads only `docs/architecture/{feature}.md`.

View File

@ -0,0 +1,28 @@
# 產生 Mermaid 圖表 (Generate Mermaid Diagram) 技能指南
## 概述
`generate_mermaid_diagram` 是可交付技能,用來產生系統架構圖、序列圖、資料流圖、事件流圖與狀態機圖。供 `design-architecture` 在產生 Mermaid Diagrams 章節時參考。
## 核心原則
圖表必須可視化系統架構,且所有圖表中的元件都必須在架構文件文字中描述。不得有無法對應到實際元件的孤兒元件。
## 必備圖表(至少 3 張)
1. **System Architecture Diagram**:所有服務、資料庫、佇列、快取與外部整合及其連接方式
2. **Sequence Diagram**:主要快樂路徑的互動流程
3. **Data Flow Diagram**:資料如何流經系統,含轉換與儲存點
## 選用圖表
- **Event Flow Diagram**:事件如何傳播
- **State Machine Diagram**:實體生命週期與狀態轉換
## 圖表指南
- **命名慣例**Services 用 PascalCaseDatabases 用 DB suffixQueues/Topics 用描述性名稱
- **關係標籤**:同步用 `-->`,非同步用 `-.->`
- **元件命名**:有意義的標籤,非縮寫(除非文件中已定義縮寫)
## 防範佔位符規則
範例僅供說明用途。不要重複使用範例中的佔位符元件、服務、資料庫或關係。
## 不應做的事
- 不產生與架構文件內容無關的圖表
- 不產生獨立圖表檔案(所有圖表必須嵌入 `docs/architecture/{feature}.md`

View File

@ -0,0 +1,147 @@
---
name: generate_mermaid_diagram
description: "Produce Mermaid diagrams for system architecture, sequence flows, data flows, event flows, and state machines. A deliverable skill referenced by design-architecture."
---
This skill provides guidance and format requirements for producing Mermaid diagrams within the architecture document.
This is a deliverable skill, not a workflow skill. It is referenced by `design-architecture` when producing visual architecture artifacts.
## Purpose
The Architect must produce Mermaid diagrams to visualize the system architecture. Diagrams are embedded directly in the architecture document within the `## Mermaid Diagrams` section.
## Required Diagrams
The architecture document must contain at minimum:
### 1. System Architecture Diagram
Shows all services, databases, queues, caches, and external integrations and how they connect.
```mermaid
graph TD
Client[Client App] --> Gateway[API Gateway]
Gateway --> AuthService[Auth Service]
Gateway --> OrderService[Order Service]
OrderService --> OrderDB[(Order DB)]
OrderService --> EventBus[Event Bus]
EventBus --> NotificationService[Notification Service]
NotificationService --> NotificationDB[(Notification DB)]
AuthService --> AuthDB[(Auth DB)]
AuthService --> Cache[(Redis Cache)]
```
### 2. Sequence Diagram
Shows the primary happy-path interaction flow between components.
```mermaid
sequenceDiagram
participant C as Client
participant GW as API Gateway
participant Auth as Auth Service
participant Order as Order Service
participant DB as Order DB
participant EventBus as Event Bus
C->>GW: POST /orders
GW->>Auth: Validate Token
Auth-->>GW: Token Valid
GW->>Order: Create Order
Order->>DB: Insert Order
DB-->>Order: Order Created
Order->>EventBus: Publish OrderCreated
Order-->>GW: 201 Created
GW-->>C: Order Response
```
### 3. Data Flow Diagram
Shows how data moves through the system, including transformations and storage points.
```mermaid
graph LR
A[User Input] --> B[Validation]
B --> C[Command Handler]
C --> D[(Write DB)]
C --> E[Event Publisher]
E --> F[Event Bus]
F --> G[Projection Handler]
G --> H[(Read DB)]
H --> I[Query API]
```
## Optional Diagrams
Produce these additional diagrams when the architecture requires them:
### Event Flow Diagram
Shows how events propagate through the system.
```mermaid
graph TD
A[Order Created] --> B[Event Bus]
B --> C[Inventory Update]
B --> D[Notification Sent]
B --> E[Analytics Recorded]
C --> F[(Inventory DB)]
D --> G[Email Service]
E --> H[(Analytics DB)]
```
### State Machine Diagram
Shows entity lifecycle and state transitions.
```mermaid
stateDiagram-v2
[*] --> Pending: Order Created
Pending --> Confirmed: Payment Received
Pending --> Cancelled: Cancel Request
Confirmed --> Processing: Process Start
Processing --> Completed: Process Done
Processing --> Failed: Process Error
Failed --> Processing: Retry
Completed --> [*]
Cancelled --> [*]
```
## Diagram Guidelines
### General Rules
- Use consistent naming conventions across all diagrams
- All components in diagrams must be described in the architecture document text
- No orphan components: every diagram element must appear in the document text
- Use meaningful labels, not abbreviations (unless abbreviation is defined in the document)
- Include external systems when they are part of the data flow
### Component Naming
- Services: PascalCase (e.g., `OrderService`, `AuthService`)
- Databases: PascalCase with DB suffix (e.g., `OrderDB`)
- Queues/Topics: PascalCase descriptive name (e.g., `OrderEventBus`)
- External systems: Descriptive name (e.g., `PaymentGateway`)
### Relationship Labels
- Label all edges/connections with the interaction type
- Use `-->` for synchronous calls
- Use `-.->` for asynchronous messages/events
- Include the protocol or verb when relevant (HTTP, gRPC, AMQP)
## Anti-Placeholder Rule
Examples in this skill are illustrative only. Do not reuse placeholder components, services, databases, or relationships unless explicitly required by the PRD. Every element in a Mermaid diagram must be grounded in actual requirements and match the architecture document's content.
## Embedding in Architecture Document
All diagrams must be embedded within the `## Mermaid Diagrams` section of `docs/architecture/{feature}.md` using:
````
```mermaid
graph TD
...
```
````
Do NOT produce separate diagram files. All diagrams must be within the single architecture document.

View File

@ -0,0 +1,29 @@
# 產生 OpenAPI 規格 (Generate OpenAPI Spec) 技能指南
## 概述
`generate_openapi_spec` 是可交付技能,用來產生 OpenAPI 或 gRPC API 合約定義,包含端點、請求/回應結構、錯誤碼、Idempotency、分頁與過濾。供 `design-architecture` 在產生 API Contract 章節時參考。
## 核心原則
API 合約定義必須具體到足以供實作使用。每個端點必須服務至少一個 PRD 功能需求。
## REST API 必備元素
- **端點定義**:方法、路徑、請求結構、回應結構
- **錯誤碼**:一致的錯誤碼與錯誤回應格式
- **Idempotency**:哪些端點需要、機制是什麼
- **分頁**:列表端點的分頁機制與回應格式
- **過濾**:支援哪些過濾欄位與運算子
## gRPC API 必備元素
- **Service 定義**Package、Service name、Methods
- **Message 定義**Request / Response 訊息結構
- **錯誤碼**gRPC status codes
## 防範佔位符規則
範例僅供說明用途。不要重複使用範例中的佔位符端點、欄位名稱、回應結構或錯誤碼。
## 知識合約職責
- 與 `api-contract-design` 知識合約搭配使用:前者提供理論原則,後者提供格式
## 不應做的事
- 不替非 PRD 要求的端點定義結構
- 不產生獨立 OpenAPI YAML 或 gRPC proto 檔案(所有內容必須嵌入 `docs/architecture/{feature}.md`

View File

@ -0,0 +1,203 @@
---
name: generate_openapi_spec
description: "Produce OpenAPI or gRPC API contract definitions including endpoints, request/response schemas, error codes, idempotency, pagination, and filtering. A deliverable skill referenced by design-architecture."
---
This skill provides guidance and format requirements for producing API contract definitions within the architecture document.
This is a deliverable skill, not a workflow skill. It is referenced by `design-architecture` when producing API contract artifacts.
## Purpose
The Architect must produce API contract definitions that are specific enough for implementation. API contracts define the interface between services and between clients and the system.
## REST API (OpenAPI Style)
For REST APIs, use OpenAPI-style definitions within the architecture document.
### Endpoint Definition Format
Each endpoint must include:
```markdown
### {METHOD} {path}
**Description**: {What this endpoint does}
**Authentication**: {None / Bearer Token / API Key / mTLS}
**Idempotency**: {None / Idempotent by method / Requires Idempotency-Key header}
**Request**:
| Field | Location | Type | Required | Description |
|-------|----------|------|----------|-------------|
| ... | header / path / query / body | ... | yes/no | ... |
**Request Body** (if applicable):
```json
{
"field1": "type",
"field2": "type"
}
```
**Response** (Success):
| Status Code | Description | Response Schema |
|-------------|-------------|-----------------|
| 200 / 201 | ... | ... |
**Response Body**:
```json
{
"field1": "type",
"field2": "type"
}
```
**Error Responses**:
| Status Code | Error Code | Description | When |
|-------------|-----------|-------------|------|
| 400 | INVALID_INPUT | ... | ... |
| 401 | UNAUTHORIZED | ... | ... |
| 404 | NOT_FOUND | ... | ... |
| 409 | CONFLICT | ... | ... |
| 429 | RATE_LIMITED | ... | ... |
| 500 | INTERNAL_ERROR | ... | ... |
**Pagination** (if applicable):
- Default page size: {n}
- Maximum page size: {n}
- Pagination parameters: `offset` / `cursor`
- Response includes: `total_count`, `has_more`
**Filtering** (if applicable):
- Supported filters: {list of filterable fields}
- Filter operators: `eq`, `ne`, `gt`, `lt`, `in`, `contains`
```
### Error Response Format
Define a consistent error response format:
```json
{
"error": {
"code": "ERROR_CODE",
"message": "Human-readable message",
"details": [
{
"field": "field_name",
"message": "Specific error message"
}
],
"request_id": "uuid"
}
}
```
### Error Code Catalog
Define system-wide error codes:
```markdown
| Code | HTTP Status | Description |
|------|-------------|-------------|
| INVALID_INPUT | 400 | Request validation failed |
| UNAUTHORIZED | 401 | Authentication required |
| FORBIDDEN | 403 | Insufficient permissions |
| NOT_FOUND | 404 | Resource not found |
| CONFLICT | 409 | Resource already exists |
| RATE_LIMITED | 429 | Too many requests |
| INTERNAL_ERROR | 500 | Unexpected server error |
| SERVICE_UNAVAILABLE | 503 | Dependent service unavailable |
```
## gRPC API
For gRPC APIs, define the service and method specifications.
### Service Definition Format
```markdown
### {ServiceName}
**Package**: {package.name}
#### {MethodName}
**Request**: {MessageName}
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| ... | ... | ... | ... |
**Response**: {MessageName}
| Field | Type | Description |
|-------|------|-------------|
| ... | ... | ... |
**Error Codes**:
| Code | Description |
|------|-------------|
| INVALID_ARGUMENT | ... |
| NOT_FOUND | ... |
| ... | ... |
**Idempotency**: {None / Idempotent / Requires request_id}
```
## Required API Contract Elements
### Endpoints
- Every functional requirement from the PRD must have at least one API endpoint
- Each endpoint must map to the PRD functional requirement it satisfies
### Request / Response Schemas
- Every field must have type, required/optional, and description
- Nested objects must be fully defined
- Enum values must be listed
### Error Codes
- Define consistent error codes across the system
- Differentiate client errors (4xx) from server errors (5xx) from business rule violations
- Include error response format
### Idempotency
- Identify which endpoints require idempotency
- Define idempotency mechanism (method-based, key-based)
- Define idempotency key format and TTL
### Pagination
- Define pagination mechanism for all list endpoints
- Specify default and maximum page sizes
- Define pagination response format
### Filtering
- Define supported filter fields for list endpoints
- Define filter operators
- Define sort options
### Rate Limiting (when applicable)
- Define rate limit expectations per endpoint
- Define rate limit headers and response format
## Knowledge Contract Reference
This deliverable skill works alongside the `api-contract-design` knowledge contract:
- `api-contract-design` provides the theoretical guidance on API design principles
- This skill provides the concrete output format and completeness requirements
## Anti-Placeholder Rule
Examples in this skill are illustrative only. Do not reuse placeholder endpoints, field names, response schemas, or error codes unless explicitly required by the PRD. Every endpoint, field, status code, and error code must be grounded in actual requirements and match the architecture document's functional requirements.
## Embedding in Architecture Document
All API contract definitions must be embedded within the `## API Contract` section of `docs/architecture/{feature}.md`.
Do NOT produce separate OpenAPI YAML or gRPC proto files. All API contracts must be within the single architecture document.

View File

@ -0,0 +1,36 @@
# 整合邊界設計 (Integration Boundary Design) 知識合約指南
## 概述
`integration-boundary-design` 是知識合約,用來提供整合邊界設計的原則與模式。涵蓋外部 API 整合、Webhook 處理、Polling、重試策略、速率限制與失敗模式處理。供 `design-architecture` 在定義 Integration Boundaries 時參考。
## 核心原則
### 整合隔離
- 外部系統失敗不得級聯到系統失敗
- Circuit Breaker 必須保護內部服務
- 整合程式碼必須與商業邏輯隔離Anti-Corruption Layer
### 明確合約
- 每個外部整合必須有明確定義的合約
- 合約必須包含請求/回應結構、錯誤碼與 SLA
- 變更必須版本化並盡可能向後相容
### 假設失敗
- 外部系統會失敗、逾時、返回非預期資料
- 必須為每個整合定義逾時、重試與回退
## 設計重點
- **外部 API 整合**同步呼叫、非同步呼叫、Batch 呼叫、串流
- **Webhook 處理**Inbound Webhooks接收與 Outbound Webhooks發送
- **Polling**:增量 Polling、輪詢間隔與資料差距處理
- **重試策略**:退避策略、重試預算、最大總重試時間
- **速率限制**Token Bucket、Leaky Bucket、Fixed Window、Sliding Window
- **失敗模式處理**Transient、Permanent、Partial、Cascading Failure
## 知識合約職責
- 提供整合邊界設計的理論指引
- 不直接產生整合合約或實作程式碼
## 不應做的事
- 不替特定外部系統選擇整合方式
- 不定義具體的逾時值或重試次數
- 不產生實作程式碼

View File

@ -0,0 +1,144 @@
---
name: integration-boundary-design
description: "Knowledge contract for integration boundary design. Provides principles and patterns for external API integration, webhook handling, polling, retry strategies, rate limiting, and failure mode handling. Referenced by design-architecture when defining integration boundaries."
---
This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing integration boundaries. It does not produce artifacts directly.
## Core Principles
### Integration Isolation
- External system failures must not cascade into system failures
- Circuit breakers must protect internal services from external failures
- Integration code must be isolated from business logic (anti-corruption layer)
### Explicit Contracts
- Every external integration must have an explicitly defined contract
- Contracts must include request/response schemas, error codes, and SLAs
- Changes to contracts must be versioned and backward-compatible whenever possible
### Assume Failure
- External systems will fail, timeout, return unexpected data, and change without notice
- Design for failure: define timeout, retry, and fallback for every integration
- Never assume external system availability or correctness
## External API Integration
### Patterns
- **Synchronous API call**: Request-response, immediate feedback
- **Asynchronous API call**: Request acknowledged, result via callback or polling
- **Batch API call**: Accumulate requests and send in bulk
- **Streaming API**: Continuous stream of data (SSE, WebSocket, gRPC streaming)
### Design Considerations
- Define timeout for every outbound API call (default: 5-30 seconds depending on SLA)
- Define retry strategy for every outbound call (max retries, backoff, jitter)
- Define circuit breaker thresholds (error rate, timeout rate, consecutive failures)
- Define fallback behavior when circuit is open (cached data, default response, error)
- Define data transformation at the boundary (anti-corruption layer)
- Monitor all external calls: latency, error rate, circuit breaker state
## Webhook Handling
### Inbound Webhooks (Receiving)
- Define webhook signature verification (HMAC, asymmetric)
- Define idempotency for webhook processing (external systems may deliver duplicates)
- Define webhook ordering assumptions (ordered vs unordered)
- Define webhook timeout and response (always respond 200 quickly, process asynchronously)
- Define webhook retry handling (what if processing fails?)
### Outbound Webhooks (Sending)
- Define webhook delivery guarantee (at-least-once, at-most-once)
- Define webhook retry strategy (max retries, backoff, jitter)
- Define webhook payload format (versioned, backward-compatible)
- Define webhook authentication (HMAC signature, OAuth2, API key)
- Define webhook status tracking (delivered, failed, pending)
## Polling
### When to Use Polling
- When the external system doesn't support webhooks or streaming
- When the external system has a polling-based API by design
- When real-time updates are not required
### Design Considerations
- Define polling interval based on data freshness requirements
- Use incremental polling (ETag, Last-Modified, since parameter) to avoid redundant data transfer
- Define how to handle polling failures (skip and retry next interval)
- Define how to handle data gaps (missed polls due to downtime)
- Consider long-polling as an alternative when supported
## Retry Strategy
### Retry Decision Tree
1. Is the error retryable? (network errors, timeouts, 429, 503 are typically retryable)
2. What is the retry strategy? (exponential backoff with jitter)
3. What is the max retry count? (3-5 is typical for transient errors)
4. What is the max total retry time? (prevent infinite retry loops)
5. What to do after max retries? (DLQ, alert, manual intervention)
### Backoff Strategies
- **Exponential backoff**: Delay doubles each retry (1s, 2s, 4s, 8s...)
- **Exponential backoff with jitter**: Add randomness to prevent thundering herd
- **Linear backoff**: Fixed additional delay each retry (1s, 2s, 3s, 4s...)
- **Fixed retry**: Same delay every retry (simple but ineffective)
### Retry Budget
- Define maximum retries per time window (prevent retry storms)
- Define retry budget per external system (don't overwhelm a recovering system)
- Consider separate retry budgets for critical vs non-critical operations
## Rate Limiting
### Patterns
- **Token bucket**: Fixed rate refill, burst-capable, most common
- **Leaky bucket**: Fixed rate processing, smooths burst
- **Fixed window**: Simple, but allows burst at window boundaries
- **Sliding window**: More accurate than fixed window, slightly more complex
### Design Considerations
- Define rate limits per endpoint, per client, and per system
- Define rate limit headers to return (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset)
- Define response when rate limited (429 Too Many Requests with Retry-After header)
- Define rate limit storage (Redis, memory, external service)
- Define rate limit for outbound calls to external systems (respect their limits)
## Failure Mode Handling
### Failure Mode Classification
- **Transient**: Network timeout, temporary service unavailable (retry with backoff)
- **Permanent**: Invalid request, authentication failure (fail immediately, no retry)
- **Partial**: Some data processed, some failed (compensate or retry partial)
- **Cascading**: Failure in one service causing failures in others (circuit breaker)
### Design Decision Matrix
| Failure Type | Detection | Response |
|-------------|-----------|----------|
| Timeout | No response within threshold | Retry with backoff, circuit breaker |
| 5xx Error | HTTP 500-599 | Retry with backoff, circuit breaker |
| 429 Rate Limited | HTTP 429 | Backoff and retry after Retry-After |
| 4xx Client Error | HTTP 400-499 | Fail immediately, log and alert |
| Connection Refused | TCP connection failure | Circuit breaker, fail fast |
| Invalid Data | Schema validation failure | Fail immediately, DLQ for investigation |
### Circuit Breaker States
- **Closed**: Normal operation, requests pass through
- **Open**: Failure threshold exceeded, requests fail fast (fallback)
- **Half-Open**: After cooldown, allow test request; if success, close; if fail, stay open
### Fallback Strategies
- **Cached data**: Serve stale data from cache (with staleness warning)
- **Default response**: Return a sensible default (for non-critical data)
- **Graceful degradation**: Return partial data if some services are unavailable
- **Queue and retry**: Store the request and process later when the system recovers
- **Fail fast**: Return error immediately (for critical operations that can't be degraded)
## Anti-Patterns
- **Synchronous chain of external calls**: Minimize synchronous external calls in request path
- **Missing timeout on outbound calls**: Always set a timeout, never wait indefinitely
- **Missing circuit breaker for external systems**: External failures must not cascade
- **Missing idempotency for retries**: Retries will cause duplicate processing
- **Missing rate limiting for outbound calls**: Will hit external system rate limits
- **Missing data transformation at boundary**: External data models must not leak into internal models
- **Missing monitoring on external calls**: External call latency and error rates must be tracked

View File

@ -0,0 +1,35 @@
# 遷移與上線設計 (Migration Rollout Design) 知識合約指南
## 概述
`migration-rollout-design` 是知識合約用來提供遷移與上線設計的原則與模式。涵蓋向後相容、上線策略、Canary Deployment、Feature Flags、Schema Evolution 與 Rollback。供 `design-architecture` 在定義 Migration & Rollout Strategy 時參考。
## 核心原則
### 向後相容優先
- 新版本必須與舊版本共存於遷移期間
- API 必須向後相容直到所有消費者都遷移完成
- 資料庫結構必須同時支援舊版和新版程式碼
### 增量勝過 Big-Bang
- 增量遷移,一步一步來
- 每步必須可獨立部署與逆轉
- Big-Bang 遷移風險高,逆轉困難
### 預設 Rollback
- 每個遷移步驟必須有明確的 rollback 計畫
- Feature Flags 可實現即時 rollback無需重新部署
## 設計重點
- **上線策略**Blue-Green Deployment、Canary Deployment、Rolling Deployment、Feature Flag Deployment
- **Feature Flags**Release flags、Operational flags、Experiment flags、Permission flags
- **Schema Evolution**Additive Changes安全vs Destructive Changes需遷移
- **Migration Strategy**Expand → Migrate → Contract 三階段
- **Rollback**Application Rollback、Database Rollback 與決策矩陣
## 知識合約職責
- 提供遷移與上線設計的理論指引
- 不直接產生遷移腳本或上線腳本
## 不應做的事
- 不替系統選擇特定部署策略
- 不定義具體的 Feature Flag 名稱或值
- 不產生遷移或部署腳本

View File

@ -0,0 +1,145 @@
---
name: migration-rollout-design
description: "Knowledge contract for migration and rollout design. Provides principles and patterns for backward compatibility, rollout strategies, canary deployments, feature flags, schema evolution, and rollback. Referenced by design-architecture when defining migration and rollout strategy."
---
This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing migration and rollout strategies. It does not produce artifacts directly.
## Core Principles
### Backward Compatibility First
- New versions must coexist with old versions during migration
- APIs must be backward-compatible until all consumers have migrated
- Database schemas must support both old and new code during migration
- Never break existing functionality during migration
### Incremental Over Big-Bang
- Migrate incrementally, one step at a time
- Each step must be independently deployable and reversible
- Test each step before proceeding to the next
- Big-bang migrations have higher risk and harder rollback
### Rollback by Default
- Every migration step must have a clear rollback plan
- Practice rollback before you need it
- Automated rollback is preferred over manual rollback
- Feature flags enable instant rollback without deployment
## Rollout Strategies
### Blue-Green Deployment
- Maintain two identical environments (blue and green)
- Deploy new version to the inactive environment
- Switch traffic from active to inactive environment
- If issues are detected, switch traffic back
- **Best for**: Infrastructure-level deployments with full environment replication
### Canary Deployment
- Deploy new version to a small percentage of traffic (1%, 5%, 10%, 25%, 50%, 100%)
- Monitor metrics at each stage before increasing traffic
- If issues are detected, shift traffic back to the old version
- **Best for**: Application-level deployments where you want to test with real traffic gradually
### Rolling Deployment
- Deploy new version to instances one at a time (or in small batches)
- Old and new versions run side by side during the rollout
- If issues are detected, stop the rollout and roll back the updated instances
- **Best for**: Stateless services where instances can be updated independently
### Feature Flag Deployment
- Deploy new code with features disabled (feature flags set to false)
- Enable features gradually using feature flags
- Can enable per-user, per-tenant, per-percentage
- If issues are detected, disable the feature flag instantly
- **Best for**: Feature-level deployments where you want to decouple code deployment from feature release
## Feature Flags
### Types of Feature Flags
- **Release flags**: Enable/disable new features during rollout (short-lived)
- **Operational flags**: Enable/disable operational features (circuit breakers, maintenance mode)
- **Experiment flags**: A/B testing and gradual rollout (medium-lived)
- **Permission flags**: Enable features for specific users/tenants (long-lived)
### Design Considerations
- Feature flags must not add significant latency (evaluate quickly)
- Feature flag evaluation must be consistent within a request (don't re-evaluate mid-request)
- Feature flags must have a defined lifecycle: create, enable, monitor, remove
- Remove feature flags after full rollout to prevent technical debt
- Use a feature flag management service (not hardcoded flags)
- Log feature flag evaluations for debugging
### Feature Flag Rollout
- Start with 0% (flag off)
- Enable for internal users (dogfood)
- Enable for a small percentage of users (canary)
- Enable for all users (full rollout)
- Monitor metrics at each stage
- Remove the flag after full rollout
## Schema Evolution
### Additive Changes (Safe)
- Add a new column with a default value
- Add a new table
- Add a new index (with caution for large tables)
- Add a new optional field to an API response
- Add a new API endpoint
### Destructive Changes (Require Migration)
- Remove a column (requires migration)
- Rename a column (requires migration)
- Change a column type (requires migration)
- Remove a table (requires migration)
- Remove an API endpoint (requires consumer migration)
### Migration Strategy for Destructive Changes
1. **Expand**: Add the new structure alongside the old (both exist)
2. **Migrate**: Migrate data and code to use the new structure (both exist)
3. **Contract**: Remove the old structure (only new exists)
Example: Renaming a column
1. Add new column, keep old column, dual-write to both
2. Migrate existing data from old to new column
3. Update all reads to use new column
4. Remove old column
### Database Migration Best Practices
- Every migration must be reversible (up and down migration)
- Test migrations against production-like data volumes
- Run migrations in a transaction when possible
- For large tables, use online schema change tools (pt-online-schema-change, gh-ost)
- Never lock a production table for more than seconds during a migration
## Rollback
### Application Rollback
- Revert to previous deployment version
- Feature flag disable (instant, no deployment needed)
- Blue-green switch (instant, requires environment)
- Canary shift-back (requires redirecting traffic)
- Rolling redeploy of previous version (requires new deployment)
### Database Rollback
- Run the down migration (reverse of up migration)
- Restore from backup (for destructive changes without down migration)
- Feature flag to disable new code that uses new schema (code rollback, schema stays)
### Rollback Decision Matrix
| What Failed | Rollback Method | Data Loss Risk |
|-------------|----------------|----------------|
| Application bug | Deploy previous version | None |
| Feature bug | Disable feature flag | None |
| Schema migration bug | Run down migration | Low if reversible |
| Data migration bug | Restore from backup | High if not reversible |
| Integration failure | Circuit breaker / fallback | None |
## Anti-Patterns
- **Big-bang migration**: Migrating everything at once has high risk and hard rollback
- **Breaking API changes without versioning**: Old clients will break
- **Schema migration without backward compatibility**: Old code will fail against new schema
- **Deploying without feature flags**: Can't instantly rollback if issues are detected
- **Not testing rollback**: Rollback must be tested before you need it
- **Removing old code before consumers have migrated**: Premature removal breaks dependencies
- **Not monitoring during rollout**: Issues must be detected quickly to prevent wider impact

View File

@ -0,0 +1,31 @@
# 可觀測性設計 (Observability Design) 知識合約指南
## 概述
`observability-design` 是知識合約,用來提供可觀測性設計的原則與模式。涵蓋 Logs、Metrics、Traces、Correlation IDs、Alerts 與 SLOs。供 `design-architecture` 在定義 Observability Strategy 時參考。
## 核心原則
### 可觀測性三支柱
- **Logs**:離散事件,含上下文(誰、什麼、何時、何地)
- **Metrics**時間聚合的數值測量速率、長條圖、Guage
- **Traces**:跨服務端到端請求流
### 可觀測性不是監控
- 監控告訴你什麼時候壞了(已知未知)
- 可觀測性讓你能問為什麼壞了(未知未知)
- 必須從一開始就內建可觀測性,不能事後追加
## 設計重點
- **Logs**Log 等級、結構化 LogJSON、集中式 Log 聚合
- **Metrics**Counter、Gauge、Histogram、Summary命名慣例
- **Traces**分散式追蹤、Correlation ID 傳播、Span 設計
- **Alerts**:症狀警報、非操作警報;閾值與升遷路徑
- **SLOs**可用性、延遲、正確性Error Budget 與 Burn Rate Alerting
## 知識合約職責
- 提供可觀測性設計的理論指引
- 不直接產生監控設定或警報配置
## 不應做的事
- 不替系統選擇特定可觀測性工具
- 不定義具體的指標名稱或警報閾值
- 不產生監控設定檔案

View File

@ -0,0 +1,141 @@
---
name: observability-design
description: "Knowledge contract for observability design. Provides principles and patterns for logs, metrics, traces, correlation IDs, alerts, and SLOs. Referenced by design-architecture when defining observability strategy."
---
This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing observability. It does not produce artifacts directly.
## Core Principles
### Three Pillars of Observability
- **Logs**: Discrete events with context (who, what, when, where)
- **Metrics**: Numeric measurements aggregated over time (rates, histograms, gauges)
- **Traces**: End-to-end request flow across services and boundaries
### Observability Is Not Monitoring
- Monitoring tells you when something is broken (known unknowns)
- Observability lets you ask questions about why something is broken (unknown unknowns)
- Design for observability: emit enough data to diagnose novel problems
### Observability by Design
- Observability must be designed into the architecture, not bolted on after
- Every service must emit structured logs, metrics, and traces from day one
- Every external integration must have observability hooks
## Logs
### Log Levels
- **ERROR**: Something failed that requires investigation (not all errors are ERROR level)
- **WARN**: Something unexpected happened but the system can continue
- **INFO**: Business-significant events (order created, payment processed, user registered)
- **DEBUG**: Detailed information for debugging (only in development, not in production)
- **TRACE**: Very detailed information (almost never used in production)
### Structured Logging
- Use JSON format for all logs
- Every log entry must include: timestamp, level, service name, correlation ID
- Include relevant context: user ID, request ID, entity IDs, error details
- Never log sensitive data: passwords, tokens, PII, secrets
### Log Aggregation
- Send all logs to a centralized log aggregation system
- Define log retention period based on compliance requirements
- Define log access controls (who can see what logs)
- Consider log volume and cost (log only what you need)
## Metrics
### Metric Types
- **Counter**: Monotonically increasing value (request count, error count)
- **Gauge**: Point-in-time value (active connections, queue depth)
- **Histogram**: Distribution of values (request latency, payload size)
- **Summary**: Pre-calculated quantiles (p50, p90, p99 latency)
### Key Business Metrics
- Orders per minute
- Revenue per minute
- Active users
- Conversion rate
- Cart abandonment rate
### Key System Metrics
- Request rate (requests per second per endpoint)
- Error rate (4xx rate, 5xx rate per endpoint)
- Latency (p50, p90, p99 per endpoint)
- Queue depth and age
- Database connection pool usage
- Cache hit rate
- Memory and CPU usage per service
### Metric Naming Convention
- Use dot-separated names: `service.operation.metric`
- Include units in the name or metadata: `request.duration.milliseconds`
- Use consistent labels: `method`, `endpoint`, `status_code`, `tenant_id`
## Traces
### Distributed Tracing
- Every request gets a trace ID that propagates across all services
- Every operation within a request gets a span with operation name, start time, duration
- Span boundaries: service calls, database queries, external API calls, queue operations
### Correlation ID Propagation
- Generate a correlation ID at the request entry point
- Propagate correlation ID through all service calls (headers, message metadata)
- Include correlation ID in all logs, metrics, and error responses
- Use correlation ID to trace a request end-to-end across all services
### Span Design
- Include relevant context in spans: user ID, entity IDs, operation type
- Tag spans with error information when operations fail
- Keep span cardinality reasonable (avoid high-cardinality attributes as tags)
## Alerts
### Alert Design Principles
- Alert on symptoms, not causes (user impact, not internal metrics)
- Every alert must have a clear runbook or remediation steps
- Every alert must be actionable (if you can't act on it, don't alert on it)
- Avoid alert fatigue: set thresholds based on SLOs, not arbitrary numbers
### Alert Categories
- **Page-worthy**: System is broken, immediate action required (high error rate, service down)
- **Ticket-worthy**: Degradation that needs investigation soon (rising latency, approaching limits)
- **Log-worthy**: Informational, no immediate action (deployment completed, config changed)
### Alert Thresholds
- Base alert thresholds on SLOs, not arbitrary numbers
- Use burn rate alerting: alert when the error budget is burning too fast
- Define escalation paths: who gets paged, who gets a ticket, who gets an email
## SLOs (Service Level Objectives)
### SLO Design
- Define SLOs based on user impact, not internal metrics
- Typical SLO categories:
- **Availability**: % of requests that succeed (e.g., 99.9%)
- **Latency**: % of requests that complete within a threshold (e.g., p99 < 500ms)
- **Correctness**: % of operations that produce correct results
- **Freshness**: % of data that is within staleness threshold
### Error Budget
- Error budget = 100% - SLO target
- If SLO is 99.9%, error budget is 0.1% per month
- Track error budget burn rate: how fast are we consuming the budget?
- When error budget is exhausted, focus shifts from feature development to reliability
### SLO Framework
- Define the SLO (what we promise)
- Define the SLI (how we measure it)
- Define the error budget (what we can afford to fail)
- Define the alerting (when we're burning budget too fast)
## Anti-Patterns
- **Logging everything**: Generates noise, increases cost, makes debugging harder
- **Missing correlation ID**: Can't trace requests across services
- **Alerting on causes, not symptoms**: Alerts fire but users aren't impacted
- **Missing business metrics**: Can't tell if the system is serving users well
- **High-cardinality metrics**: Explosive metric count, expensive to store and query
- **Missing observability for external calls**: External integration failures are invisible
- **Logging sensitive data**: Passwords, tokens, PII in logs

View File

@ -0,0 +1,38 @@
# 安全邊界設計 (Security Boundary Design) 知識合約指南
## 概述
`security-boundary-design` 是知識合約,用來提供安全邊界設計的原則與模式。涵蓋 Authentication、Authorization、Service Identity、Token Propagation、Tenant Isolation、Secret Management 與 Audit Logging。供 `design-architecture` 在定義 Security Boundaries 時參考。
## 核心原則
### Defense in Depth
- 不要依賴單一安全邊界
- 在每一層套用安全:網路、服務、資料、應用
- 假設已被入侵:設計時讓單層被破壞不會導致全盤被破壞
### Least Privilege
- 服務與使用者應該只有最小必要權限
- 預設拒絕:從無權限開始,明確授予
- 定期輪換與過期憑證
### Zero Trust
- 不要預設信任內部網路流量
- 每次服務對服務呼叫都要認證與授權
- 傳輸中資料必須加密,即使在內部網路也是
## 設計重點
- **Authentication**Token-based、API Key、Certificate-based、Session-based
- **Authorization**RBAC、ABAC、ACL、ReBAC 的選擇與粒度
- **Service Identity**Service Accounts、Workload Identity、Service Mesh Identity
- **Token Propagation**Pass-through、Token Exchange、Token Relay、Impersonation
- **Tenant Isolation**Database-level、Schema-level、Row-level、Application-level
- **Secret Management**Environment Variables、Secret Management Service、Platform-native、Configuration Service
- **Audit Logging**:認證/授權事件、日誌修改操作、行政動作
## 知識合約職責
- 提供安全邊界設計的理論指引
- 不直接產生安全配置或憑證管理設定
## 不應做的事
- 不替系統選擇特定安全技術
- 不定義具體的 RBAC 角色或權限
- 不產生安全設定檔案

View File

@ -0,0 +1,129 @@
---
name: security-boundary-design
description: "Knowledge contract for security boundary design. Provides principles and patterns for authentication, authorization, service identity, token propagation, tenant isolation, secret management, and audit logging. Referenced by design-architecture when defining security boundaries."
---
This is a knowledge contract, not a workflow skill. It provides theoretical guidance that the Architect references when designing security boundaries. It does not produce artifacts directly.
## Core Principles
### Defense in Depth
- Never rely on a single security boundary
- Apply security at every layer: network, service, data, application
- Assume breach: design so that compromise of one layer doesn't compromise all
### Least Privilege
- Services and users should have the minimum permissions required
- Default deny: start with no access, grant explicitly
- Rotate and expire credentials regularly
### Zero Trust
- Don't trust internal network traffic by default
- Authenticate and authorize every service-to-service call
- Encrypt data in transit, even within the internal network
## Authentication
### Patterns
- **Token-based authentication**: JWT, OAuth2 tokens
- **API key authentication**: For service-to-service and public APIs
- **Certificate-based authentication**: mTLS for internal service communication
- **Session-based authentication**: For web applications with stateful sessions
### Design Considerations
- Define where authentication happens (edge gateway, service level, or both)
- Define token format, issuer, audience, and expiration
- Define token refresh and revocation strategy
- Define credential rotation strategy
- Consider token size impact on request headers
## Authorization
### Patterns
- **RBAC (Role-Based Access Control)**: Assign permissions to roles, assign roles to users
- **ABAC (Attribute-Based Access Control)**: Assign permissions based on attributes (user, resource, environment)
- **ACL (Access Control List)**: Explicit list of who can access what
- **ReBAC (Relationship-Based Access Control)**: Permissions based on relationships between entities
### Design Considerations
- Choose the simplest model that meets PRD requirements
- Define permission granularity: coarse-grained (role-level) vs fine-grained (resource-level)
- Define where authorization is enforced (gateway, service, or both)
- Define how permissions are stored and cached
- Consider multi-tenant authorization: can users in one tenant access resources in another?
## Service Identity
### Patterns
- **Service accounts**: Each service has its own identity with specific permissions
- **Workload identity**: Identity tied to the deployment (Kubernetes service accounts, cloud IAM roles)
- **Service mesh identity**: Identity managed by the service mesh (Istio, Linkerd)
### Design Considerations
- Each service should have its own identity (no shared credentials)
- Service identity should be short-lived and automatically rotated
- Service identity should be bound to the deployment environment
- Service identity permissions should follow least privilege
## Token Propagation
### Patterns
- **Pass-through**: Gateway validates token, passes it to downstream services
- **Token exchange**: Gateway validates external token, issues internal token
- **Token relay**: Each service forwards the token to downstream services
- **Impersonation**: Service calls downstream on behalf of the user
### Design Considerations
- Define token format for internal vs external communication
- Define token lifecycle: creation, validation, refresh, revocation
- Consider token size when propagating through multiple hops
- Consider what context to propagate (user identity, tenant, permissions, correlation ID)
## Tenant Isolation
### Patterns
- **Database-level isolation**: Separate database per tenant
- **Schema-level isolation**: Separate schema per tenant, shared database
- **Row-level isolation**: Shared schema, tenant_id column with enforcement
- **Application-level isolation**: Shared infrastructure, application enforces isolation
### Design Considerations
- Choose isolation level based on PRD requirements (compliance, performance, cost)
- Row-level isolation is simplest but requires careful query filtering
- Database-level isolation provides strongest isolation but highest cost
- Define how tenant context is resolved (subdomain, header, token claim)
- Define how tenant isolation is enforced (middleware, query filter, database policy)
## Secret Management
### Patterns
- **Environment variables**: Simple, but don't support rotation well
- **Secret management service**: HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager
- **Platform-native secrets**: Kubernetes Secrets, cloud IAM role-based access
- **Configuration service**: Centralized configuration with encryption at rest
### Design Considerations
- Secrets must never be stored in code, configuration files in version control, or logs
- Define secret rotation strategy for each type of secret
- Define how services access secrets (sidecar, SDK, environment injection)
- Define audit trail for secret access
- Consider secret hierarchies (global, per-environment, per-service)
## Audit Logging
### Design Considerations
- Log all authentication and authorization events (success and failure)
- Log all data modification operations (who, what, when, from where)
- Log all administrative actions
- Define log retention period based on compliance requirements
- Define log format: structured JSON with consistent fields
- Log must be tamper-evident or append-only for compliance
## Anti-Patterns
- **Shared credentials across services**: Each service must have its own identity
- **Hard-coded secrets**: Secrets must be externalized and rotated
- **Overly broad permissions**: Grant least privilege, not convenience privilege
- **Missing authentication for internal services**: Internal traffic must also be authenticated
- **Missing audit logging for sensitive operations**: All auth events and data modifications must be logged
- **Trust based on network location**: Don't assume internal network is safe

View File

@ -0,0 +1,31 @@
# 儲存知識 (Storage Knowledge) 知識合約指南
## 概述
`storage-knowledge` 是知識合約,用來提供儲存技術選擇的原則與框架。涵蓋關聯式、寬列式、文件式與鍵值式儲存,並提供 Use-When 與 Avoid-When 判斷標準。供 `design-architecture` 在做儲存決策時參考。
## 核心原則
儲存選擇必須由 PRD 中識別的查詢模式、寫入模式、一致性需求與規模預期驅動。不要因為熟悉、時尚或覺得可能需要就選擇儲存技術。
## 儲存選擇標準
選擇儲存前,必須先回答:
1. 主要查詢模式是什麼?(按鍵、按範圍、按複雜過濾、全文搜索)
2. 寫入模式是什麼?(寫入為主、更新為主、僅附加)
3. 需要什麼一致性Strong、Eventual、Tunable
4. 預期規模是多少?(每天列數、總列數、生長率)
5. 存取延遲需求是什麼?(毫秒、秒、最終一致)
6. 與其他實體的關係是什麼?(外鍵、嵌套文件、圖形遍歷)
## 儲存類型
- **Relational (PostgreSQL)**:強一致性、複雜 JOIN、交易完整性
- **Wide-Column (Cassandra)**:高寫入吞吐量、查詢優先建模、線性水平擴展
- **Document (MongoDB)**文件中心資料、Schema 彈性、豐富查詢能力
- **Key-Value (Redis)**快取、速率限制、Idempotency Keys、工作階段
## 知識合約職責
- 提供各儲存技術的比較與選擇框架
- 不替 PRD 做最終儲存選擇
## 不應做的事
- 不替系統選擇特定儲存技術
- 不基於時尚或流行選擇儲存
- 不假設某種儲存適合所有場景

View File

@ -0,0 +1,149 @@
---
name: storage-knowledge
description: "Knowledge contract for selecting storage technologies based on data patterns. Covers relational, wide-column, document, and key-value stores with use-when and avoid-when criteria. Referenced by design-architecture when making storage decisions."
---
This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is making storage technology decisions.
## Core Principle
Storage selection must be driven by query patterns, write patterns, consistency requirements, and scale expectations identified in the PRD. Do not choose a storage technology because it is familiar, fashionable, or might be needed someday.
Every storage choice must be justified. If a simpler option meets the requirements, use it.
## Storage Selection Criteria
For each data entity, answer these questions before selecting storage:
1. What are the primary query patterns? (by key, by range, by complex filter, by full-text search)
2. What are the write patterns? (insert-heavy, update-heavy, append-only)
3. What consistency is required? (strong, eventual, tunable)
4. What scale is expected? (rows per day, total rows, growth rate)
5. What are the access latency requirements? (ms, seconds, eventual)
6. What relationships exist with other entities? (foreign keys, nested documents, graph traversals)
## Relational Database (PostgreSQL, MySQL, etc.)
Use when:
- Strong consistency is required (ACID transactions)
- Complex joins are needed for queries
- Transactional integrity across multiple entities is required
- Data has well-defined structure with relationships
- Referential integrity constraints are important
- Ad-hoc querying on multiple dimensions is common
Avoid when:
- Write throughput exceeds what a single relational node can handle and sharding adds unacceptable complexity
- Data is deeply nested and rarely queried across relationships
- Schema evolves rapidly and migrations are costly
- Full-text search is a primary access pattern (use a search engine instead)
Trade-offs: +strong consistency, +relationships, +ad-hoc queries, +maturity, -scaling complexity, -schema rigidity
### Schema Design for Relational
- Normalize to 3NF by default
- Denormalize selectively based on query patterns (see `data-modeling`)
- Define foreign keys with appropriate ON DELETE behavior
- Define indexes for identified query patterns only
- Consider partitioning for large tables
## Wide-Column / Cassandra
Use when:
- High write throughput is required (append-heavy workloads)
- Query-first modeling (you know all query patterns upfront)
- Large-scale time-series data
- Geographic distribution with local writes
- Linear horizontal scaling is required
- Availability is prioritized over strong consistency (tunable consistency)
Avoid when:
- Ad-hoc queries on arbitrary columns are needed
- Relational joins across tables are common
- Strong consistency is required for all operations
- The data model requires many secondary indexes
- The team lacks Cassandra modeling experience (data modeling mistakes are costly to fix)
Trade-offs: +write throughput, +horizontal scaling, +availability, -no joins, -query-first modeling required, -modeling mistakes are expensive
### Schema Design for Wide-Column
- Model around query patterns: each table serves a specific query
- Partition key must distribute data evenly
- Clustering columns define sort order within a partition
- Denormalize aggressively: one table per query pattern
- Avoid secondary indexes; model queries into the primary key instead
## Document / MongoDB
Use when:
- Data is document-centric with nested structures
- Schema flexibility is required (rapidly evolving data)
- Aggregate boundaries align with document boundaries
- Single-document atomicity is sufficient
- Read-heavy workloads with rich query capabilities
Avoid when:
- Strong relational constraints between entities are required
- Multi-document transactions are frequent (MongoDB supports them but they are slower)
- Data requires complex joins across many collections
- Strict schema validation is critical
Trade-offs: +schema flexibility, +nested structures, +rich queries, +easy to start, -relationship handling, -larger storage for indexes, -multi-document transaction overhead
### Schema Design for Document
- Design documents around access patterns
- Embed data that is always accessed together
- Reference data that is accessed independently
- Use indexes for fields that are frequently filtered
- Consider document size limits (16MB in MongoDB)
- Use change streams for event-driven patterns
## Key-Value / Redis
Use for:
- Caching frequently accessed data
- Rate limiting (counters with TTL)
- Idempotency keys (set with TTL, check existence)
- Ephemeral state (sessions, temporary tokens)
- Distributed locking
- Sorted sets for leaderboards or priority queues
- Pub/sub for lightweight messaging
Avoid when:
- You need complex queries (no query language)
- You need durability for primary data (Redis persistence is not ACID)
- Data size exceeds available memory and eviction is unacceptable
- You need relationships between entities
Trade-offs: +speed, +simplicity, +data structures, -memory cost, -durability (with caveats), -no complex queries
### Using Redis as Primary Storage
Only when:
- Data is inherently ephemeral (sessions, rate limits, idempotency keys)
- Data loss is acceptable or can be reconstructed
- The team understands persistence limitations (RDB snapshots, AOF)
Never use Redis as the primary persistent store for business-critical data unless:
- Durability requirements are clearly defined
- Persistence configuration (RDB + AOF) meets those requirements
- Recovery procedures are tested and documented
## Storage Selection Decision Framework
1. Start with the simplest option that meets requirements
2. Only add complexity when the PRD justifies it
3. Prefer one storage technology when it meets all requirements
4. Add a second storage technology only when a specific PRD requirement demands it
5. Document every storage choice as an ADR with:
- The requirement that drives it
- The alternatives considered
- Why the chosen option is the simplest that works
## Anti-Patterns
- Using Cassandra for a 10,000-row table with ad-hoc queries
- Using MongoDB for highly relational data requiring joins
- Using Redis as a primary persistent store without understanding durability
- Using multiple storage technologies when one suffices
- Choosing storage based on familiarity rather than query/write patterns
- Premature optimization: selecting distributed storage before single-node is proven insufficient

View File

@ -0,0 +1,43 @@
# 系統分解 (System Decomposition) 知識合約指南
## 概述
`system-decomposition` 是知識合約,用來提供將系統拆分為服務或模組的原則。包含邊界定義、資料所有權與依賴方向。供 `design-architecture` 在設計 Service Boundaries 時參考。
## 核心原則
- 每個服務或模組必須有單一、明確的職責
- 資料所有權必須明確:每筆資料只屬於一個服務
- 依賴必須單向流動;嚴禁循環依賴
- 邊界必須環繞領域職責繪製,而非技術分層
## 設計重點
### Modular Monolith vs Microservices
- 選擇 Modular Monolith團隊小、邊界仍在演化、部署簡單優先
- 選擇 Microservices個別服務有不同擴展需求、團隊擁有對齊、獨立部署是需求
### 領域邊界
- 一起變化的實體
- 內聚的業務規則
- 一起存取的資料
- 橫跨一致上下文的使用者工作流
好的邊界:高內聚、低外部耦合、可獨立理解、可獨立部署。
### 耦合 vs 內聚
- 偏好邊界內高內聚
- 最小化邊界間耦合
- 透過明確合約API、事件溝通
- 避免共享資料庫表
### 狀態所有權
- 每筆狀態只有一個擁有者
- 其他服務透過擁有者的 API 或事件存取
- 嚴禁服務直接讀取另一服務的資料庫
## 知識合約職責
- 提供系統分解的理論指引
- 不替 PRD 做最終邊界決策
## 不應做的事
- 不替系統定義特定服務邊界
- 不假設 Microservices 適合所有場景
- 不忽略領域邊界而以技術分層

View File

@ -0,0 +1,100 @@
---
name: system-decomposition
description: "Knowledge contract for splitting systems into services or modules, defining boundaries, data ownership, and dependency direction. Referenced by design-architecture when designing service boundaries."
---
This is a knowledge contract, not a workflow skill. It is referenced by `design-architecture` when the architect is designing service boundaries and system decomposition.
## Core Principles
- Each service or module must have a single, well-defined responsibility
- Data ownership must be clear: each piece of data belongs to exactly one service
- Dependencies must flow in one direction; cyclic dependencies are forbidden
- Boundaries must be drawn around domain responsibilities, not technical layers
## Decomposition Decisions
### Modular Monolith vs Microservices
Choose modular monolith when:
- Team size is small (fewer than 5-8 engineers per service boundary)
- Domain boundaries are still evolving
- Deployment simplicity is a priority
- Inter-service communication overhead would exceed in-process call overhead
- The PRD does not require independent scaling of individual services
Choose microservices when:
- Individual services have different scaling requirements stated in the PRD
- Team ownership aligns with service boundaries
- Domain boundaries are stable and well-understood
- Independent deployment of services is required
- The PRD explicitly requires isolation for reliability or security
Do not choose microservices solely because they are fashionable or because the team might need them someday. YAGNI applies.
### Domain Boundaries
Identify domain boundaries by looking for:
- Entities that change together
- Business rules that are cohesive
- Data that is accessed together
- User workflows that span a consistent context
A good boundary:
- Has high internal cohesion (related logic stays together)
- Has low external coupling (minimal cross-boundary calls)
- Can be understood independently
- Can be deployed independently if needed A bad boundary:
- Requires frequent cross-boundary calls to complete a workflow
- Splits closely related entities across services
- Exists because of technical layering rather than domain logic
- Requires distributed transactions to maintain consistency
### Coupling vs Cohesion
Favor high cohesion within a boundary:
- Related business rules live together
- Related data is owned by the same service
- Related workflows are handled end-to-end
Minimize coupling between boundaries:
- Communicate via well-defined contracts (APIs, events)
- Avoid sharing database tables between services
- Avoid synchronous call chains longer than 2 services deep when possible
- Prefer eventual consistency for cross-boundary state updates
### State Ownership
Each piece of state must have exactly one owner:
- The owning service is the single source of truth
- Other services access that state via the owner's API or events
- No service reads directly from another service's database
- If data is needed in multiple places, replicate via events with a clear source of truth
## Communication Patterns
### Synchronous
- Use when the caller needs an immediate response
- Use for queries and command validation
- Avoid for long-running operations
- Consider timeouts and circuit breakers
### Asynchronous
- Use when the caller does not need an immediate response
- Use for events, notifications, and eventual consistency
- Use when decoupling producer and consumer is valuable
- Consider ordering, retry, and DLQ requirements
### Event-Driven
- Use when multiple consumers need to react to state changes
- Use for cross-boundary consistency (eventual)
- Define event schemas explicitly
- Consider event versioning and backward compatibility
## Anti-Patterns
- Distributed monolith: microservices that must be deployed together
- Shared database: multiple services reading/writing the same tables
- Synchronous chain: 3+ services in a synchronous call chain
- Leaky domain: business rules that require data from other services directly instead of via APIs or events
- Premature decomposition: splitting before boundaries are understood

View File

@ -0,0 +1,32 @@
# 寫作 ADR (Write ADR) 技能指南
## 概述
`write_adr` 是可交付技能用來產生架構決策記錄ADR包含 Context、Decision、Consequences 與 Alternatives。供 `design-architecture` 在產生 ADR 章節時參考。
## 核心原則
ADR 為每個重要架構決策提供永久記錄。包含決策背後的理由、考慮的替代方案與權衡取捨。
## 何時寫 ADR
為以下任何決策寫 ADR
- 影響系統結構或服務邊界
- 涉及技術選擇(語言、框架、資料庫、佇列、快取、基礎設施)
- 涉及一致性模型選擇
- 涉及安全架構決策
- 涉及重大取捨(效能 vs 一致性、複雜度 vs 簡單性)
- 難以或昂貴逆轉的決策
- 其他工程師會質疑「為什麼這樣選?」的決策
## ADR 格式
每個 ADR 必須包含:
- **Context**:為什麼需要這個決策,什麼問題或情境需要決定,什麼 PRD 需求驅動了這個決策
- **Decision**:決定了什麼,清楚且具體地陳述,包含選擇的特定技術、模式或方法
- **Consequences**:這個決策帶來的取捨,正面與負面都要
- **Alternatives**:考慮過什麼其他選項,每個選項的描述與不被選中的理由
## 防範佔位符規則
範例僅供說明用途。不要重複使用範例中的佔位符 ADR 標題、Context、Decision 或 Alternatives。
## 不應做的事
- 不替非實際架構決策寫 ADR
- 不複製範例的內容當作自己的決策
- 不產生獨立 ADR 檔案(所有 ADR 必須嵌入 `docs/architecture/{feature}.md`

102
skills/write_adr/SKILL.md Normal file
View File

@ -0,0 +1,102 @@
---
name: write_adr
description: "Produce Architectural Decision Records with Context, Decision, Consequences, and Alternatives. A deliverable skill referenced by design-architecture."
---
This skill provides guidance and format requirements for producing Architectural Decision Records (ADRs) within the architecture document.
This is a deliverable skill, not a workflow skill. It is referenced by `design-architecture` when documenting significant architectural decisions.
## Purpose
The Architect must document significant architectural decisions using the ADR format. ADRs provide a permanent record of the context, decision, consequences, and alternatives considered for each important choice.
## When to Write an ADR
Write an ADR for any decision that:
- Affects the system structure or service boundaries
- Involves a technology selection (language, framework, database, queue, cache, infra)
- Involves a consistency model choice (strong vs eventual, idempotency strategy)
- Involves a security architecture decision
- Involves a significant trade-off (performance vs consistency, complexity vs simplicity)
- Would be difficult or costly to reverse
- Other engineers would question "why was this chosen?"
## ADR Format
Each ADR must follow this format:
```markdown
### ADR-{N}: {Decision Title}
- **Context**: Why this decision was needed. What is the problem or situation that requires a decision? Which PRD requirements drove this decision? What constraints exist?
- **Decision**: What was decided. State the decision clearly and specifically. Include the specific technology, pattern, or approach chosen.
- **Consequences**: What trade-offs or implications result from this decision. Include both positive and negative consequences. Address:
- What becomes easier?
- What becomes harder?
- What are the risks?
- What are the operational implications?
- **Alternatives**: What other options were considered. For each alternative:
- Brief description
- Why it was not chosen
- Under what circumstances it might be the better choice
```
## ADR Numbering
- Start with ADR-001 for the first decision
- Number sequentially (ADR-001, ADR-002, etc.)
- Each ADR in the architecture document gets a unique number
## ADR Examples
### ADR-001: Use Cassandra for Job Storage
- **Context**: The system needs to handle high write throughput (10,000+ writes/second) for job status updates. Jobs are write-once with frequent status updates. Queries are primarily by job ID and by status+created_at. The PRD requires 99.9% availability for job status writes.
- **Decision**: Use Cassandra as the primary storage for job data. Use PostgreSQL for relational data that requires complex queries and transactions.
- **Consequences**:
- (+) High write throughput for job status updates
- (+) Horizontal scalability for job storage
- (+) 99.9% availability for job writes
- (-) Eventual consistency for job reads (stale reads possible within replication window)
- (-) No complex joins for job data
- (-) Additional operational complexity of managing two database systems
- (-) Data migration if requirements change
- **Alternatives**:
- PostgreSQL only: Simpler operations, but may not handle write throughput under peak load. Would be appropriate if write throughput stays below 5,000 writes/second.
- MongoDB: Good balance of write throughput and query flexibility, but less mature for time-series-like access patterns.
- Redis + PostgreSQL: Redis for hot job data, PostgreSQL for cold storage. Adds complexity of data synchronization.
### ADR-002: Use Event-Driven Architecture for Order Processing
- **Context**: The PRD requires orders to be processed asynchronously with decoupled services. Order processing involves multiple steps (validation, payment, inventory, notification) that may fail independently. Each step must be retryable.
- **Decision**: Use event-driven architecture with the outbox pattern for order processing. Publish OrderCreated events from the Order Service, consumed by downstream services.
- **Consequences**:
- (+) Services are decoupled and can evolve independently
- (+) Individual steps can be retried without reprocessing the entire order
- (+) Natural fit for saga pattern for distributed transactions
- (-) Eventual consistency — downstream services may see stale data
- (-) More complex debugging and tracing
- (-) Requires outbox pattern implementation to ensure at-least-once delivery
- **Alternatives**:
- Synchronous orchestration: Simpler to implement and debug, but creates tight coupling and doesn't handle partial failures well. Appropriate for simple, synchronous workflows.
- Saga orchestration with a central coordinator: More control over flow, but adds a single point of failure and operational complexity.
## Anti-Placeholder Rule
Examples in this skill are illustrative only. Do not reuse placeholder ADR titles, contexts, decisions, or alternatives unless explicitly required by the PRD. Every ADR must document an actual architectural decision made for this system, with real context, real consequences, and real alternatives considered.
## Embedding in Architecture Document
All ADRs must be embedded within the `## ADR` section of `docs/architecture/{feature}.md`.
Do NOT produce separate ADR files. All ADRs must be within the single architecture document.