8 AI Agents Modifying Code.
How Do We Ensure Quality?
The Harness Engineering system, evolved across 6 versions, answers this question
3-Layer Quality Assurance Architecture
Each layer operates independently, implementing a Defense in Depth strategy
L0
L0 — Safety Net
Automated defense via pre-commit hooks
- Secret protection (protect-secrets)
- Dangerous command blocking (block-dangerous-commands)
- Edit/Write tool monitoring
L1
L1 — Enablement
Custom Subagent Architecture
- agent.md tool permission matrix
- core SKILL.md + references/ separation architecture
- All 7 agents SKILL streamlined
L2
L2 — Traceable Ops
Traceable operations via run_id
- run_id tracking system
- Log collection pipeline
- KPI measurement and 2nd review
Agent Permission Matrix
7 agents × Write/Edit · Bash · Coding model permissions
| Agent | Write/Edit | Bash | Coding Model |
|---|---|---|---|
| CEO Agent | Blocked | Allowlist | — |
| Fullstack Dev | Blocked | Open | Subcontract (codex) |
| QA Engineer | Blocked | Open | Subcontract (codex) |
| DevOps Engineer | Allowed | Open | Direct edit |
| Board Advisor | Blocked | Limited | Subcontract (codex) |
| Historian | Allowed | Limited | Direct edit |
| Content Writer | Allowed | Limited | Direct edit |
Version History
Evolution from v1.0 to v5.1 across 6 versions
v1.0 — Initial Safety Net
Started hooks-based secret protection and dangerous command blocking
v2.0 — Role Definition System
SKILL.md introduced to specify per-agent roles and behavior rules
v3.0 — Safety Net Reinforcement
protect-secrets, block-dangerous-commands L0 layer completed
v4.0 — Custom Subagent Architecture
agent.md-based tool permission matrix, 100% role awareness achieved
v5.0 — Traceable Operations
run_id tracking system, log collection, KPI measurement framework completed
v5.1 — SKILL Diet
Introduced core SKILL.md + references/ separation. 680→134 lines streamlined, all 7 agents completed. Migrated to .claude/skills/ path
KPI — Internal Validation Benchmarks
All figures are based on internal smoke tests and Sprint D benchmarks