121 lines
2.4 KiB
Markdown
121 lines
2.4 KiB
Markdown
# Eval 指令
|
||
|
||
管理評估驅動開發 (Eval-driven development) 工作流。
|
||
|
||
## 使用方式
|
||
|
||
`/eval [define|check|report|list] [feature-name]`
|
||
|
||
## 定義評估 (Define Evals)
|
||
|
||
`/eval define feature-name`
|
||
|
||
建立新的評估定義:
|
||
|
||
1. 使用模板建立 `.claude/evals/feature-name.md`:
|
||
|
||
```markdown
|
||
## EVAL: feature-name
|
||
建立日期:$(date)
|
||
|
||
### 能力評估 (Capability Evals)
|
||
- [ ] [能力 1 的描述]
|
||
- [ ] [能力 2 的描述]
|
||
|
||
### 迴歸評估 (Regression Evals)
|
||
- [ ] [現有行為 1 仍正常運作]
|
||
- [ ] [現有行為 2 仍正常運作]
|
||
|
||
### 成功標準
|
||
- 能力評估:pass@3 > 90%
|
||
- 迴歸評估:pass^3 = 100%
|
||
```
|
||
|
||
2. 提示使用者填入具體標準
|
||
|
||
## 檢查評估 (Check Evals)
|
||
|
||
`/eval check feature-name`
|
||
|
||
執行功能的評估:
|
||
|
||
1. 從 `.claude/evals/feature-name.md` 讀取評估定義
|
||
2. 針對每項能力評估:
|
||
- 嘗試驗證標準
|
||
- 記錄 通過/失敗 (PASS/FAIL)
|
||
- 在 `.claude/evals/feature-name.log` 中記錄嘗試
|
||
3. 針對每項迴歸評估:
|
||
- 執行相關測試
|
||
- 與基準值進行比較
|
||
- 記錄 通過/失敗 (PASS/FAIL)
|
||
4. 回報目前狀態:
|
||
|
||
```
|
||
EVAL 檢查:feature-name
|
||
========================
|
||
能力:X/Y 通過
|
||
迴歸:X/Y 通過
|
||
狀態:進行中 / 已就緒 (IN PROGRESS / READY)
|
||
```
|
||
|
||
## 評估報告 (Report Evals)
|
||
|
||
`/eval report feature-name`
|
||
|
||
產生完整的評估報告:
|
||
|
||
```
|
||
EVAL 報告:feature-name
|
||
=========================
|
||
產生日期:$(date)
|
||
|
||
能力評估 (CAPABILITY EVALS)
|
||
----------------
|
||
[eval-1]: PASS (pass@1)
|
||
[eval-2]: PASS (pass@2) - 需重試
|
||
[eval-3]: FAIL - 見備註
|
||
|
||
迴歸評估 (REGRESSION EVALS)
|
||
----------------
|
||
[test-1]: PASS
|
||
[test-2]: PASS
|
||
[test-3]: PASS
|
||
|
||
度量指標 (METRICS)
|
||
-------
|
||
能力 pass@1: 67%
|
||
能力 pass@3: 100%
|
||
迴歸 pass^3: 100%
|
||
|
||
備註 (NOTES)
|
||
-----
|
||
[任何問題、邊緣情況或觀察結果]
|
||
|
||
建議 (RECOMMENDATION)
|
||
--------------
|
||
[可發布 / 需改進 / 已阻斷 (SHIP / NEEDS WORK / BLOCKED)]
|
||
```
|
||
|
||
## 列出評估 (List Evals)
|
||
|
||
`/eval list`
|
||
|
||
顯示所有評估定義:
|
||
|
||
```
|
||
評估定義 (EVAL DEFINITIONS)
|
||
================
|
||
feature-auth [3/5 通過] 進行中
|
||
feature-search [5/5 通過] 已就緒
|
||
feature-export [0/4 通過] 未開始
|
||
```
|
||
|
||
## 參數說明 (Arguments)
|
||
|
||
$ARGUMENTS:
|
||
- `define <name>` - 建立新的評估定義
|
||
- `check <name>` - 執行並檢查評估
|
||
- `report <name>` - 產生完整報告
|
||
- `list` - 顯示所有評估
|
||
- `clean` - 移除舊的評估日誌 (保留最近 10 次執行)
|