BuRDy: AI Agent Quality at Scale

AI Strategy · Build → Evaluate → Repair → Deploy · April 2026

AI agents are already handling real work across the company. The question leadership should be asking: how do we know they're good?
  • Risk: Without measurable quality standards, every agent is a liability. One bad interaction with an employee or customer erodes trust in the entire AI program.
  • Scale: We have six agents in the pipeline today. That number will grow. A repeatable quality process is the only way to scale AI deployment responsibly.
  • Governance: Every agent is scored against business-defined rubrics, stress-tested across difficulty tiers, and improved through evidence-based repair cycles. Leadership can see exactly what "good" means and hold the program accountable.
BuRDy (our BRD interview agent) is the first agent through this process. The methodology applies to Deal Desk, Legal Separation, and every future deployment.
1 - The Problem We Solved

Why BRD intake was broken, and what it costs

15
Page BRD Template
9
Required Sections
60%+
Incomplete Submissions
Linear
Bottleneck Scaling

Employees face a blank 15-page template and either submit incomplete documents or skip the process entirely. IS then spends significant time on clarification calls. Every new request costs the same manual effort.

The deeper issue: Incomplete requirements lead to misscoped projects, scope creep, and rework. The real cost shows up downstream: projects that go wrong because they started with bad requirements.
2 - What We Built

Two interview agents for two audiences, with automated formatting

        graph TD
          A["Someone has a project idea"] --> SCALE{"How big is it?"}

          SCALE -->|"Under 20 hours"| BURDY["Opens BuRDy in Glean"]
          BURDY --> B_INT["Guided interview\n8 adaptive stages"]
          B_INT --> B_CLASS["Classifies project type\nadjusts depth"]
          B_CLASS --> B_BRD["Produces substance-focused BRD"]
          B_BRD --> B_DOC["Paste into Google Doc"]
          B_DOC --> B_FMT["Apps Script formats\nto IS BRD template"]
          B_FMT --> B_OUT["Submission-ready BRD\n+ Jira intake fields"]

          SCALE -->|"20+ hours"| PURDY["Opens PuRDy in Glean"]
          PURDY --> P_INT["6-stage requirements\ninterview"]
          P_INT --> P_PRD["Structured PRD\n+ process map"]
          P_PRD --> P_AF["Feeds into Agent Factory\nfor agent builds"]

          classDef user fill:#b4530922,stroke:#b45309,stroke-width:2px
          classDef decision fill:#7e22ce22,stroke:#7e22ce,stroke-width:2px
          classDef burdy fill:#1d4ed822,stroke:#1d4ed8,stroke-width:2px
          classDef purdy fill:#7e22ce22,stroke:#7e22ce,stroke-width:2px
          classDef auto fill:#15803d22,stroke:#15803d,stroke-width:2px
          classDef output fill:#0f766e22,stroke:#0f766e,stroke-width:2px

          class A user
          class SCALE decision
          class BURDY,B_INT,B_CLASS,B_BRD burdy
          class B_DOC user
          class B_FMT auto
          class B_OUT output
          class PURDY,P_INT purdy
          class P_PRD output
          class P_AF auto
      
Glean Agent (BuRDy)
  • Any employee, any time
  • Classifies into 7 project types and adapts depth
  • Simple automations: 5-6 questions. Complex builds: full 10-section coverage
Glean Agent (Discovery PuRDy)
  • PRDs are internal: how we scope and plan what to build
  • 6-stage requirements interview for AI Strategy and cross-functional teams
  • Outputs a structured PRD + process map for validation
  • Feeds directly into Agent Factory for agent builds
Google Apps Script
  • Maps BuRDy output to the official IS BRD template
  • Cover page, 9 numbered sections, approvals tracker, Jira intake fields
  • Flags gaps with [NEEDS REVIEW]
BRD vs. PRD: A BRD (Business Requirements Document) is what employees submit to IS when they need something built. A PRD (Product Requirements Document) is what our team uses internally to scope and design the solution. BuRDy captures the request. PuRDy captures the plan.
What changed:
  • Employees have a conversation instead of a blank template
  • The agent handles the requirements thinking
  • IS gets complete, submission-ready documents
3 - The Build-Eval-Repair Loop

How we went from v1 to v9 (ready for test) through structured improvement cycles

Every agent goes through a rigorous, repeatable improvement cycle. Every change traces back to scored evidence.

        graph TD
          AF["Agent Factory\nBuild initial instructions"] --> V1["Agent v1\nFirst draft instructions"]
          V1 --> EVAL["Evaluation Suite\nRun scenario conversations"]
          EVAL --> SCORE["Score against rubric\n5 dimensions, 16 criteria"]
          SCORE --> REPORT["Suite Report\nWeakness patterns + punch list"]
          REPORT --> REPAIR["Repair Session\nTargeted instruction fixes"]
          REPAIR --> VN["Agent v(n+1)\nUpdated instructions"]
          VN --> EVAL

          SCORE -.-> RUBRIC["Business-defined rubric\nAnchored to real IS values"]
          EVAL -.-> SCENARIOS["Scenario library\nBasic / Standard / Challenging"]

          classDef build fill:#b4530922,stroke:#b45309,stroke-width:2px
          classDef eval fill:#1d4ed822,stroke:#1d4ed8,stroke-width:2px
          classDef repair fill:#15803d22,stroke:#15803d,stroke-width:2px
          classDef ref fill:#7e22ce11,stroke:#7e22ce,stroke-width:1px,stroke-dasharray:5 5

          class AF,V1,VN build
          class EVAL,SCORE,REPORT eval
          class REPAIR repair
          class RUBRIC,SCENARIOS ref
      
No version ships without evidence. Every instruction change traces to a scored finding, and every finding traces to a real conversation.
4 - The Agentic Toolchain

Four purpose-built tools working together

Agent Factory
Build Repair
  • Builds initial agent instructions from a PRD and knowledge base
  • Runs repair sessions: evaluation findings in, targeted instruction changes out
  • Detailed changelog for every repair
Evaluation Engine
Evaluate Score
  • Real, multi-turn conversations with the live Glean agent
  • Goal-driven personas interact over 15-50 turns with realistic pacing, pushback, and ambiguity
  • Each conversation scored against a 16-criteria rubric
  • Suite report surfaces weakness patterns
Rubric + Scenarios
Define Red Team
  • Business-defined criteria anchored to real IS values
  • Scenarios range from basic to adversarial
  • Auto-generated by the engine or custom-built by users for specific edge cases
Apps Script Pipeline
Format Deliver
  • Parses BuRDy output into a Google Doc: cover page, 9 sections, Jira intake fields
  • Flags anything missing with [NEEDS REVIEW]
  • Zero manual formatting
5 - How Evaluations Work

Scored against a measurable standard, every time

The Rubric: 5 Dimensions, 16 Criteria

Every criterion has anchored scoring: specific descriptions of what a 1, a 5, and a 10 look like.

Goal Completion
25% weight
  • BRD production: does the artifact exist with all 4 minimum elements?
  • Depth calibration: lean BRD for simple projects, full for complex?
  • Problem-first orientation: describes a problem, not a solution?
  • Open questions captured: real gaps documented, not perfunctory?
Process Quality
20% weight
  • Stage sequence adherence: 8 stages in order, full depth on Stage 2?
  • Classification accuracy: correct project type, user confirmation?
  • Depth matrix application: right number of questions per section?
  • Information state tracking: knows what's missing, probes to fill gaps?
Conversational Skill
10% weight
  • Question pacing: one question per response, natural rhythm?
  • Active listening: references prior answers before next probe?
  • Coaching move deployment: uses prescribed techniques?
  • Summary quality: accurate confirmation before generating BRD?
Red Flag Detection
20% weight
  • Gap and inconsistency surfacing: catches contradictions, missing stakeholders?
  • Scope and routing assessment: identifies non-IS projects, routes appropriately?
  • Hallucination detection: fabricated contacts, invented policies, unsupported claims?
  • Citation integrity: references real documents, links to actual sources?
Output Quality
25% weight
  • Content specificity: names, numbers, systems from the conversation?
  • Section structure: matches expected BRD format?
  • Downstream readiness: IS team can act on this document?
  • Synthesis quality: overview captures the whole story?

Scenario Library: Tiered Difficulty, Real Conversations

Goal-driven personas stress-test across project types, user behaviors, and difficulty levels.

Basic Tier
  • Simple process automation, prepared user
  • Tests depth calibration, scope discipline, clean flow
Standard Tier
  • Salesforce-NetSuite integration, leads with solution language
  • Tests redirect, stakeholder probing, ROI quantification
Challenging Tier
  • Conflicting answers, scope confusion, resistance to structure
  • Tests red flag detection, routing, coaching under pressure
Ethics, Bias + Safety
  • Discriminatory inputs, leading assumptions, personally identifiable data
  • Tests fairness guardrails, refusal behaviors, safe handling of sensitive content
Scenario generation is flexible.
  • Engine auto-generates personas and goals from the agent's PRD and rubric
  • Teams can write custom scenarios for specific edge cases or adversarial conditions
  • Both feed the same scoring pipeline
6 - Score Progression Across Versions

Measurable improvement from v1 through v9

v1 - Initial Build
Agent Factory generates first instructions
Avg ~4.5/10. Hallucinated routing details, no classification gate, missing metadata, no exec summary.
v2 - First Repair Cycle
Eliminated hallucinations, added classification gate
Fixed fabricated contacts and Slack channels. Added confirmation gate, metadata, exec summary. One-question-per-response rule.
v3 - Template Alignment
Closed structural gaps against IS BRD template
"Why Now" urgency question, constraints, Enablement section. Template coverage: ~60% to ~95%.
v4-v5 - Coaching Precision
Targeted conversational discipline
Solution-first redirect enforced. Multi-question batching eliminated. Red flag detection: 7.3 to 8.0.
v6-v8 - Cross-Scenario Hardening
Tested against Standard tier, tightened coaching moves
Active listening callbacks, metadata pre-verification, ROI validation, named routing destinations. Scores stabilized at 8.0+ across all dimensions.
v9 - All-Scenario Validation
Ran full suite across all scenario tiers
Red flag detection climbed to 8.3. Process quality dipped to 7.5 on challenging tier (question batching, classification gate timing). Punch list: enforce one-question-per-response, add explicit classification confirmation gate, apply depth-matrix probing as floor.
7 - What the Scores Mean for the Business

We can prove it works

4.5
v1 Average
8.2
Latest Average
+82% improvement
9
Repair Cycles
30+
Findings Resolved
Quality

Specific names, numbers, systems, and timelines from the conversation. Content specificity: 8.7/10.

Consistency

Same depth matrix, same coaching moves, same quality gates every time. No variability from analyst to analyst.

Predictability

Standard-tier scenarios score within 0.5 points of basic-tier. Quality holds under pressure.

Auditability

Scored evaluation, weakness report, and repair changelog for every version. Every instruction change traces to a specific finding.

Governance

Rubric is defined by people, based on real business values. What does IS need in a BRD? What does ELT use to decide? Those answers are the scoring criteria.

Explainability

"How do we know this agent is good?" We show 16 scored criteria, anchored definitions, scenario transcripts, and improvement trajectory.

8 - Post-Deploy: Continuous Monitoring

The eval loop doesn't stop at launch

The eval loop keeps running after launch.

        graph TD
          LIVE["Agent in Production\nReal user conversations"] --> SAMPLE["Sample actual conversations\nRegular cadence"]
          SAMPLE --> EVAL["Evaluate against rubric\nSame 16 criteria"]
          EVAL --> ANALYZE["Analyze patterns\nacross conversations"]
          ANALYZE --> DETECT{"Issues\ndetected?"}

          ANALYZE --> TOPICS["Topic drift:\nquestions ranging outside\nexpected scope"]
          ANALYZE --> QUALITY["Answer quality:\nscores trending down\nacross dimensions"]
          ANALYZE --> DRIFT["Prompt drift:\nagent behavior shifting\nfrom instructions"]

          DETECT -->|"Yes"| REPAIR["Repair cycle\nTargeted fix"]
          DETECT -->|"No"| CONTINUE["Continue monitoring\nNext cadence"]
          REPAIR --> DEPLOY["Deploy updated instructions"]
          DEPLOY --> LIVE
          ANALYZE --> REPORT["Quality report\nfor IS leadership"]

          classDef live fill:#15803d22,stroke:#15803d,stroke-width:2px
          classDef eval fill:#1d4ed822,stroke:#1d4ed8,stroke-width:2px
          classDef issue fill:#b91c1c22,stroke:#b91c1c,stroke-width:1px,stroke-dasharray:5 5
          classDef repair fill:#b4530922,stroke:#b45309,stroke-width:2px
          classDef insight fill:#7e22ce22,stroke:#7e22ce,stroke-width:2px

          class LIVE,DEPLOY live
          class SAMPLE,EVAL,ANALYZE,DETECT,CONTINUE eval
          class TOPICS,QUALITY,DRIFT issue
          class REPAIR repair
          class REPORT insight
      
Agents don't degrade silently. The same eval infrastructure that got BuRDy to 8.2 keeps running after launch. Leadership sees scores, trends, and fixes each month.
9 - Why This Matters

The difference between AI that's deployed and AI that's governed

We are building a quality system for AI.

Here is what that looks like in practice:

  • Rubrics defined by people: the business decides what "good" means
  • Scenarios that stress-test: adversarial inputs, missing sponsors, vague quantification
  • Evidence-based improvement: every change traces to a scored finding
  • Continuous monitoring: eval infrastructure keeps running after launch
  • Full audit trail: version history, reports, changelogs, score progression
10 - The Model Going Forward

BuRDy is agent #1. This is how we build every AI deployment.

Platform-agnostic. The early stages determine the right platform before anything is built.

        graph TD
          REQ["New agent request\nfrom any team"] --> INTAKE["Intake + Discovery\nRequirements, stakeholders, constraints"]
          INTAKE --> PLATFORM{"Can this work\nin Glean?"}
          PLATFORM -->|"Yes"| GLEAN_PATH["Glean Path\nFully developed"]
          PLATFORM -->|"No"| OTHER_PATH["Other Platform\nTo be determined"]
          GLEAN_PATH --> AF["Agent Factory\nbuilds instructions"]
          AF --> EVAL["Eval-Repair Loop\nuntil quality bar met"]
          EVAL --> DEPLOY_G["Deploy to Glean\nwith monitoring"]
          DEPLOY_G --> IMPROVE["Continuous improvement"]
          OTHER_PATH --> AF2["Agent Factory\nbuilds instructions + app"]
          AF2 --> BUILD2["Build + deploy on\nappropriate platform"]
          BUILD2 --> EVAL2["Same eval-repair loop\nSame rigor, same scoring"]
          EVAL2 --> DEPLOY2["Deploy with monitoring"]
          DEPLOY2 --> IMPROVE2["Continuous improvement"]

          classDef process fill:#b4530922,stroke:#b45309,stroke-width:2px
          classDef glean fill:#15803d22,stroke:#15803d,stroke-width:2px
          classDef other fill:#1d4ed822,stroke:#1d4ed8,stroke-width:2px
          classDef decision fill:#7e22ce22,stroke:#7e22ce,stroke-width:2px

          class REQ,INTAKE process
          class PLATFORM decision
          class GLEAN_PATH,AF,EVAL,DEPLOY_G,IMPROVE glean
          class OTHER_PATH,RUBRIC2,BUILD2,EVAL2,DEPLOY2,IMPROVE2 other
      
Platform changes. Quality doesn't. The eval-repair loop applies the same rigor whether an agent runs in Glean, a custom app, or a third-party tool.

Agents already in this pipeline: