BuRDy: AI Agent Quality at Scale
AI Strategy · Build → Evaluate → Repair → Deploy · April 2026
- Risk: Without measurable quality standards, every agent is a liability. One bad interaction with an employee or customer erodes trust in the entire AI program.
- Scale: We have six agents in the pipeline today. That number will grow. A repeatable quality process is the only way to scale AI deployment responsibly.
- Governance: Every agent is scored against business-defined rubrics, stress-tested across difficulty tiers, and improved through evidence-based repair cycles. Leadership can see exactly what "good" means and hold the program accountable.
Why BRD intake was broken, and what it costs
Employees face a blank 15-page template and either submit incomplete documents or skip the process entirely. IS then spends significant time on clarification calls. Every new request costs the same manual effort.
Two interview agents for two audiences, with automated formatting
graph TD
A["Someone has a project idea"] --> SCALE{"How big is it?"}
SCALE -->|"Under 20 hours"| BURDY["Opens BuRDy in Glean"]
BURDY --> B_INT["Guided interview\n8 adaptive stages"]
B_INT --> B_CLASS["Classifies project type\nadjusts depth"]
B_CLASS --> B_BRD["Produces substance-focused BRD"]
B_BRD --> B_DOC["Paste into Google Doc"]
B_DOC --> B_FMT["Apps Script formats\nto IS BRD template"]
B_FMT --> B_OUT["Submission-ready BRD\n+ Jira intake fields"]
SCALE -->|"20+ hours"| PURDY["Opens PuRDy in Glean"]
PURDY --> P_INT["6-stage requirements\ninterview"]
P_INT --> P_PRD["Structured PRD\n+ process map"]
P_PRD --> P_AF["Feeds into Agent Factory\nfor agent builds"]
classDef user fill:#b4530922,stroke:#b45309,stroke-width:2px
classDef decision fill:#7e22ce22,stroke:#7e22ce,stroke-width:2px
classDef burdy fill:#1d4ed822,stroke:#1d4ed8,stroke-width:2px
classDef purdy fill:#7e22ce22,stroke:#7e22ce,stroke-width:2px
classDef auto fill:#15803d22,stroke:#15803d,stroke-width:2px
classDef output fill:#0f766e22,stroke:#0f766e,stroke-width:2px
class A user
class SCALE decision
class BURDY,B_INT,B_CLASS,B_BRD burdy
class B_DOC user
class B_FMT auto
class B_OUT output
class PURDY,P_INT purdy
class P_PRD output
class P_AF auto
- Any employee, any time
- Classifies into 7 project types and adapts depth
- Simple automations: 5-6 questions. Complex builds: full 10-section coverage
- PRDs are internal: how we scope and plan what to build
- 6-stage requirements interview for AI Strategy and cross-functional teams
- Outputs a structured PRD + process map for validation
- Feeds directly into Agent Factory for agent builds
- Maps BuRDy output to the official IS BRD template
- Cover page, 9 numbered sections, approvals tracker, Jira intake fields
- Flags gaps with [NEEDS REVIEW]
- Employees have a conversation instead of a blank template
- The agent handles the requirements thinking
- IS gets complete, submission-ready documents
How we went from v1 to v9 (ready for test) through structured improvement cycles
Every agent goes through a rigorous, repeatable improvement cycle. Every change traces back to scored evidence.
graph TD
AF["Agent Factory\nBuild initial instructions"] --> V1["Agent v1\nFirst draft instructions"]
V1 --> EVAL["Evaluation Suite\nRun scenario conversations"]
EVAL --> SCORE["Score against rubric\n5 dimensions, 16 criteria"]
SCORE --> REPORT["Suite Report\nWeakness patterns + punch list"]
REPORT --> REPAIR["Repair Session\nTargeted instruction fixes"]
REPAIR --> VN["Agent v(n+1)\nUpdated instructions"]
VN --> EVAL
SCORE -.-> RUBRIC["Business-defined rubric\nAnchored to real IS values"]
EVAL -.-> SCENARIOS["Scenario library\nBasic / Standard / Challenging"]
classDef build fill:#b4530922,stroke:#b45309,stroke-width:2px
classDef eval fill:#1d4ed822,stroke:#1d4ed8,stroke-width:2px
classDef repair fill:#15803d22,stroke:#15803d,stroke-width:2px
classDef ref fill:#7e22ce11,stroke:#7e22ce,stroke-width:1px,stroke-dasharray:5 5
class AF,V1,VN build
class EVAL,SCORE,REPORT eval
class REPAIR repair
class RUBRIC,SCENARIOS ref
Four purpose-built tools working together
- Builds initial agent instructions from a PRD and knowledge base
- Runs repair sessions: evaluation findings in, targeted instruction changes out
- Detailed changelog for every repair
- Real, multi-turn conversations with the live Glean agent
- Goal-driven personas interact over 15-50 turns with realistic pacing, pushback, and ambiguity
- Each conversation scored against a 16-criteria rubric
- Suite report surfaces weakness patterns
- Business-defined criteria anchored to real IS values
- Scenarios range from basic to adversarial
- Auto-generated by the engine or custom-built by users for specific edge cases
- Parses BuRDy output into a Google Doc: cover page, 9 sections, Jira intake fields
- Flags anything missing with [NEEDS REVIEW]
- Zero manual formatting
Scored against a measurable standard, every time
The Rubric: 5 Dimensions, 16 Criteria
Every criterion has anchored scoring: specific descriptions of what a 1, a 5, and a 10 look like.
- BRD production: does the artifact exist with all 4 minimum elements?
- Depth calibration: lean BRD for simple projects, full for complex?
- Problem-first orientation: describes a problem, not a solution?
- Open questions captured: real gaps documented, not perfunctory?
- Stage sequence adherence: 8 stages in order, full depth on Stage 2?
- Classification accuracy: correct project type, user confirmation?
- Depth matrix application: right number of questions per section?
- Information state tracking: knows what's missing, probes to fill gaps?
- Question pacing: one question per response, natural rhythm?
- Active listening: references prior answers before next probe?
- Coaching move deployment: uses prescribed techniques?
- Summary quality: accurate confirmation before generating BRD?
- Gap and inconsistency surfacing: catches contradictions, missing stakeholders?
- Scope and routing assessment: identifies non-IS projects, routes appropriately?
- Hallucination detection: fabricated contacts, invented policies, unsupported claims?
- Citation integrity: references real documents, links to actual sources?
- Content specificity: names, numbers, systems from the conversation?
- Section structure: matches expected BRD format?
- Downstream readiness: IS team can act on this document?
- Synthesis quality: overview captures the whole story?
Scenario Library: Tiered Difficulty, Real Conversations
Goal-driven personas stress-test across project types, user behaviors, and difficulty levels.
- Simple process automation, prepared user
- Tests depth calibration, scope discipline, clean flow
- Salesforce-NetSuite integration, leads with solution language
- Tests redirect, stakeholder probing, ROI quantification
- Conflicting answers, scope confusion, resistance to structure
- Tests red flag detection, routing, coaching under pressure
- Discriminatory inputs, leading assumptions, personally identifiable data
- Tests fairness guardrails, refusal behaviors, safe handling of sensitive content
- Engine auto-generates personas and goals from the agent's PRD and rubric
- Teams can write custom scenarios for specific edge cases or adversarial conditions
- Both feed the same scoring pipeline
Measurable improvement from v1 through v9
We can prove it works
Specific names, numbers, systems, and timelines from the conversation. Content specificity: 8.7/10.
Same depth matrix, same coaching moves, same quality gates every time. No variability from analyst to analyst.
Standard-tier scenarios score within 0.5 points of basic-tier. Quality holds under pressure.
Scored evaluation, weakness report, and repair changelog for every version. Every instruction change traces to a specific finding.
Rubric is defined by people, based on real business values. What does IS need in a BRD? What does ELT use to decide? Those answers are the scoring criteria.
"How do we know this agent is good?" We show 16 scored criteria, anchored definitions, scenario transcripts, and improvement trajectory.
The eval loop doesn't stop at launch
The eval loop keeps running after launch.
graph TD
LIVE["Agent in Production\nReal user conversations"] --> SAMPLE["Sample actual conversations\nRegular cadence"]
SAMPLE --> EVAL["Evaluate against rubric\nSame 16 criteria"]
EVAL --> ANALYZE["Analyze patterns\nacross conversations"]
ANALYZE --> DETECT{"Issues\ndetected?"}
ANALYZE --> TOPICS["Topic drift:\nquestions ranging outside\nexpected scope"]
ANALYZE --> QUALITY["Answer quality:\nscores trending down\nacross dimensions"]
ANALYZE --> DRIFT["Prompt drift:\nagent behavior shifting\nfrom instructions"]
DETECT -->|"Yes"| REPAIR["Repair cycle\nTargeted fix"]
DETECT -->|"No"| CONTINUE["Continue monitoring\nNext cadence"]
REPAIR --> DEPLOY["Deploy updated instructions"]
DEPLOY --> LIVE
ANALYZE --> REPORT["Quality report\nfor IS leadership"]
classDef live fill:#15803d22,stroke:#15803d,stroke-width:2px
classDef eval fill:#1d4ed822,stroke:#1d4ed8,stroke-width:2px
classDef issue fill:#b91c1c22,stroke:#b91c1c,stroke-width:1px,stroke-dasharray:5 5
classDef repair fill:#b4530922,stroke:#b45309,stroke-width:2px
classDef insight fill:#7e22ce22,stroke:#7e22ce,stroke-width:2px
class LIVE,DEPLOY live
class SAMPLE,EVAL,ANALYZE,DETECT,CONTINUE eval
class TOPICS,QUALITY,DRIFT issue
class REPAIR repair
class REPORT insight
The difference between AI that's deployed and AI that's governed
We are building a quality system for AI.
Here is what that looks like in practice:
- Rubrics defined by people: the business decides what "good" means
- Scenarios that stress-test: adversarial inputs, missing sponsors, vague quantification
- Evidence-based improvement: every change traces to a scored finding
- Continuous monitoring: eval infrastructure keeps running after launch
- Full audit trail: version history, reports, changelogs, score progression
BuRDy is agent #1. This is how we build every AI deployment.
Platform-agnostic. The early stages determine the right platform before anything is built.
graph TD
REQ["New agent request\nfrom any team"] --> INTAKE["Intake + Discovery\nRequirements, stakeholders, constraints"]
INTAKE --> PLATFORM{"Can this work\nin Glean?"}
PLATFORM -->|"Yes"| GLEAN_PATH["Glean Path\nFully developed"]
PLATFORM -->|"No"| OTHER_PATH["Other Platform\nTo be determined"]
GLEAN_PATH --> AF["Agent Factory\nbuilds instructions"]
AF --> EVAL["Eval-Repair Loop\nuntil quality bar met"]
EVAL --> DEPLOY_G["Deploy to Glean\nwith monitoring"]
DEPLOY_G --> IMPROVE["Continuous improvement"]
OTHER_PATH --> AF2["Agent Factory\nbuilds instructions + app"]
AF2 --> BUILD2["Build + deploy on\nappropriate platform"]
BUILD2 --> EVAL2["Same eval-repair loop\nSame rigor, same scoring"]
EVAL2 --> DEPLOY2["Deploy with monitoring"]
DEPLOY2 --> IMPROVE2["Continuous improvement"]
classDef process fill:#b4530922,stroke:#b45309,stroke-width:2px
classDef glean fill:#15803d22,stroke:#15803d,stroke-width:2px
classDef other fill:#1d4ed822,stroke:#1d4ed8,stroke-width:2px
classDef decision fill:#7e22ce22,stroke:#7e22ce,stroke-width:2px
class REQ,INTAKE process
class PLATFORM decision
class GLEAN_PATH,AF,EVAL,DEPLOY_G,IMPROVE glean
class OTHER_PATH,RUBRIC2,BUILD2,EVAL2,DEPLOY2,IMPROVE2 other
Agents already in this pipeline:
- BuRDy (IS): BRD interview agent. 9 repair cycles complete. Entering team testing.
- Deal Desk Agent (Sales/Finance): Quote-to-cash support. In testing.
- Legal Separation Agent (Legal/HR): Agreement generation + severance calculation. In discovery.
- ITFA Vendor Research Agent (Procurement): Automated vendor security/privacy research for IT intake. In testing.
- JD Generator (People/HR): Structured job description generation from intake data. In testing.
- Discovery PuRDy (Cross-functional): Product requirements interviews. Ready for testing.