BuRDy: AI Agent Quality at Scale

AI Strategy · Build → Evaluate → Repair → Deploy · April 2026

AI agents are already handling real work across the company. The question leadership should be asking: how do we know they're good?

Risk: Without measurable quality standards, every agent is a liability. One bad interaction with an employee or customer erodes trust in the entire AI program.
Scale: We have six agents in the pipeline today. That number will grow. A repeatable quality process is the only way to scale AI deployment responsibly.
Governance: Every agent is scored against business-defined rubrics, stress-tested across difficulty tiers, and improved through evidence-based repair cycles. Leadership can see exactly what "good" means and hold the program accountable.

BuRDy (our BRD interview agent) is the first agent through this process. The methodology applies to Deal Desk, Legal Separation, and every future deployment.

1 - The Problem We Solved

Why BRD intake was broken, and what it costs

Page BRD Template

Required Sections

60%+

Incomplete Submissions

Linear

Bottleneck Scaling

Employees face a blank 15-page template and either submit incomplete documents or skip the process entirely. IS then spends significant time on clarification calls. Every new request costs the same manual effort.

The deeper issue: Incomplete requirements lead to misscoped projects, scope creep, and rework. The real cost shows up downstream: projects that go wrong because they started with bad requirements.

2 - What We Built

Two interview agents for two audiences, with automated formatting

        graph TD
          A["Someone has a project idea"] --> SCALE{"How big is it?"}

          SCALE -->|"Under 20 hours"| BURDY["Opens BuRDy in Glean"]
          BURDY --> B_INT["Guided interview\n8 adaptive stages"]
          B_INT --> B_CLASS["Classifies project type\nadjusts depth"]
          B_CLASS --> B_BRD["Produces substance-focused BRD"]
          B_BRD --> B_DOC["Paste into Google Doc"]
          B_DOC --> B_FMT["Apps Script formats\nto IS BRD template"]
          B_FMT --> B_OUT["Submission-ready BRD\n+ Jira intake fields"]

          SCALE -->|"20+ hours"| PURDY["Opens PuRDy in Glean"]
          PURDY --> P_INT["6-stage requirements\ninterview"]
          P_INT --> P_PRD["Structured PRD\n+ process map"]
          P_PRD --> P_AF["Feeds into Agent Factory\nfor agent builds"]

          classDef user fill:#b4530922,stroke:#b45309,stroke-width:2px
          classDef decision fill:#7e22ce22,stroke:#7e22ce,stroke-width:2px
          classDef burdy fill:#1d4ed822,stroke:#1d4ed8,stroke-width:2px
          classDef purdy fill:#7e22ce22,stroke:#7e22ce,stroke-width:2px
          classDef auto fill:#15803d22,stroke:#15803d,stroke-width:2px
          classDef output fill:#0f766e22,stroke:#0f766e,stroke-width:2px

          class A user
          class SCALE decision
          class BURDY,B_INT,B_CLASS,B_BRD burdy
          class B_DOC user
          class B_FMT auto
          class B_OUT output
          class PURDY,P_INT purdy
          class P_PRD output
          class P_AF auto

Glean Agent (BuRDy)

Any employee, any time
Classifies into 7 project types and adapts depth
Simple automations: 5-6 questions. Complex builds: full 10-section coverage

Glean Agent (Discovery PuRDy)

PRDs are internal: how we scope and plan what to build
6-stage requirements interview for AI Strategy and cross-functional teams
Outputs a structured PRD + process map for validation
Feeds directly into Agent Factory for agent builds

Google Apps Script

Maps BuRDy output to the official IS BRD template
Cover page, 9 numbered sections, approvals tracker, Jira intake fields
Flags gaps with [NEEDS REVIEW]

BRD vs. PRD: A BRD (Business Requirements Document) is what employees submit to IS when they need something built. A PRD (Product Requirements Document) is what our team uses internally to scope and design the solution. BuRDy captures the request. PuRDy captures the plan.

What changed:

Employees have a conversation instead of a blank template
The agent handles the requirements thinking
IS gets complete, submission-ready documents

3 - The Build-Eval-Repair Loop

How we went from v1 to v9 (ready for test) through structured improvement cycles

Every agent goes through a rigorous, repeatable improvement cycle. Every change traces back to scored evidence.

        graph TD
          AF["Agent Factory\nBuild initial instructions"] --> V1["Agent v1\nFirst draft instructions"]
          V1 --> EVAL["Evaluation Suite\nRun scenario conversations"]
          EVAL --> SCORE["Score against rubric\n5 dimensions, 16 criteria"]
          SCORE --> REPORT["Suite Report\nWeakness patterns + punch list"]
          REPORT --> REPAIR["Repair Session\nTargeted instruction fixes"]
          REPAIR --> VN["Agent v(n+1)\nUpdated instructions"]
          VN --> EVAL

          SCORE -.-> RUBRIC["Business-defined rubric\nAnchored to real IS values"]
          EVAL -.-> SCENARIOS["Scenario library\nBasic / Standard / Challenging"]

          classDef build fill:#b4530922,stroke:#b45309,stroke-width:2px
          classDef eval fill:#1d4ed822,stroke:#1d4ed8,stroke-width:2px
          classDef repair fill:#15803d22,stroke:#15803d,stroke-width:2px
          classDef ref fill:#7e22ce11,stroke:#7e22ce,stroke-width:1px,stroke-dasharray:5 5

          class AF,V1,VN build
          class EVAL,SCORE,REPORT eval
          class REPAIR repair
          class RUBRIC,SCENARIOS ref

No version ships without evidence. Every instruction change traces to a scored finding, and every finding traces to a real conversation.

4 - The Agentic Toolchain

Four purpose-built tools working together

Agent Factory

Build Repair

Builds initial agent instructions from a PRD and knowledge base
Runs repair sessions: evaluation findings in, targeted instruction changes out
Detailed changelog for every repair

Evaluation Engine

Evaluate Score

Real, multi-turn conversations with the live Glean agent
Goal-driven personas interact over 15-50 turns with realistic pacing, pushback, and ambiguity
Each conversation scored against a 16-criteria rubric
Suite report surfaces weakness patterns

Rubric + Scenarios

Define Red Team

Business-defined criteria anchored to real IS values
Scenarios range from basic to adversarial
Auto-generated by the engine or custom-built by users for specific edge cases

Apps Script Pipeline

Format Deliver

Parses BuRDy output into a Google Doc: cover page, 9 sections, Jira intake fields
Flags anything missing with [NEEDS REVIEW]
Zero manual formatting

5 - How Evaluations Work

Scored against a measurable standard, every time

The Rubric: 5 Dimensions, 16 Criteria

Every criterion has anchored scoring: specific descriptions of what a 1, a 5, and a 10 look like.

Goal Completion

25% weight

BRD production: does the artifact exist with all 4 minimum elements?
Depth calibration: lean BRD for simple projects, full for complex?
Problem-first orientation: describes a problem, not a solution?
Open questions captured: real gaps documented, not perfunctory?

Process Quality

20% weight

Stage sequence adherence: 8 stages in order, full depth on Stage 2?
Classification accuracy: correct project type, user confirmation?
Depth matrix application: right number of questions per section?
Information state tracking: knows what's missing, probes to fill gaps?

Conversational Skill

10% weight

Question pacing: one question per response, natural rhythm?
Active listening: references prior answers before next probe?
Coaching move deployment: uses prescribed techniques?
Summary quality: accurate confirmation before generating BRD?

Red Flag Detection

20% weight

Gap and inconsistency surfacing: catches contradictions, missing stakeholders?
Scope and routing assessment: identifies non-IS projects, routes appropriately?
Hallucination detection: fabricated contacts, invented policies, unsupported claims?
Citation integrity: references real documents, links to actual sources?

Output Quality

25% weight

Content specificity: names, numbers, systems from the conversation?
Section structure: matches expected BRD format?
Downstream readiness: IS team can act on this document?
Synthesis quality: overview captures the whole story?

Scenario Library: Tiered Difficulty, Real Conversations

Goal-driven personas stress-test across project types, user behaviors, and difficulty levels.

Basic Tier

Simple process automation, prepared user
Tests depth calibration, scope discipline, clean flow

Standard Tier

Salesforce-NetSuite integration, leads with solution language
Tests redirect, stakeholder probing, ROI quantification

Challenging Tier

Conflicting answers, scope confusion, resistance to structure
Tests red flag detection, routing, coaching under pressure

Ethics, Bias + Safety

Discriminatory inputs, leading assumptions, personally identifiable data
Tests fairness guardrails, refusal behaviors, safe handling of sensitive content

Scenario generation is flexible.

Engine auto-generates personas and goals from the agent's PRD and rubric
Teams can write custom scenarios for specific edge cases or adversarial conditions
Both feed the same scoring pipeline

6 - Score Progression Across Versions

Measurable improvement from v1 through v9

v1 - Initial Build

Agent Factory generates first instructions

Avg ~4.5/10. Hallucinated routing details, no classification gate, missing metadata, no exec summary.

v2 - First Repair Cycle

Eliminated hallucinations, added classification gate

Fixed fabricated contacts and Slack channels. Added confirmation gate, metadata, exec summary. One-question-per-response rule.

v3 - Template Alignment

Closed structural gaps against IS BRD template

"Why Now" urgency question, constraints, Enablement section. Template coverage: ~60% to ~95%.

v4-v5 - Coaching Precision

Targeted conversational discipline

Solution-first redirect enforced. Multi-question batching eliminated. Red flag detection: 7.3 to 8.0.

v6-v8 - Cross-Scenario Hardening

Tested against Standard tier, tightened coaching moves

Active listening callbacks, metadata pre-verification, ROI validation, named routing destinations. Scores stabilized at 8.0+ across all dimensions.

v9 - All-Scenario Validation

Ran full suite across all scenario tiers

Red flag detection climbed to 8.3. Process quality dipped to 7.5 on challenging tier (question batching, classification gate timing). Punch list: enforce one-question-per-response, add explicit classification confirmation gate, apply depth-matrix probing as floor.

7 - What the Scores Mean for the Business

We can prove it works

4.5

v1 Average

8.2

Latest Average

+82% improvement

Repair Cycles

30+

Findings Resolved

Quality

Specific names, numbers, systems, and timelines from the conversation. Content specificity: 8.7/10.

Consistency

Same depth matrix, same coaching moves, same quality gates every time. No variability from analyst to analyst.

Predictability

Standard-tier scenarios score within 0.5 points of basic-tier. Quality holds under pressure.

Auditability

Scored evaluation, weakness report, and repair changelog for every version. Every instruction change traces to a specific finding.

Governance

Rubric is defined by people, based on real business values. What does IS need in a BRD? What does ELT use to decide? Those answers are the scoring criteria.

Explainability

"How do we know this agent is good?" We show 16 scored criteria, anchored definitions, scenario transcripts, and improvement trajectory.

8 - Post-Deploy: Continuous Monitoring

The eval loop doesn't stop at launch

The eval loop keeps running after launch.

        graph TD
          LIVE["Agent in Production\nReal user conversations"] --> SAMPLE["Sample actual conversations\nRegular cadence"]
          SAMPLE --> EVAL["Evaluate against rubric\nSame 16 criteria"]
          EVAL --> ANALYZE["Analyze patterns\nacross conversations"]
          ANALYZE --> DETECT{"Issues\ndetected?"}

          ANALYZE --> TOPICS["Topic drift:\nquestions ranging outside\nexpected scope"]
          ANALYZE --> QUALITY["Answer quality:\nscores trending down\nacross dimensions"]
          ANALYZE --> DRIFT["Prompt drift:\nagent behavior shifting\nfrom instructions"]

          DETECT -->|"Yes"| REPAIR["Repair cycle\nTargeted fix"]
          DETECT -->|"No"| CONTINUE["Continue monitoring\nNext cadence"]
          REPAIR --> DEPLOY["Deploy updated instructions"]
          DEPLOY --> LIVE
          ANALYZE --> REPORT["Quality report\nfor IS leadership"]

          classDef live fill:#15803d22,stroke:#15803d,stroke-width:2px
          classDef eval fill:#1d4ed822,stroke:#1d4ed8,stroke-width:2px
          classDef issue fill:#b91c1c22,stroke:#b91c1c,stroke-width:1px,stroke-dasharray:5 5
          classDef repair fill:#b4530922,stroke:#b45309,stroke-width:2px
          classDef insight fill:#7e22ce22,stroke:#7e22ce,stroke-width:2px

          class LIVE,DEPLOY live
          class SAMPLE,EVAL,ANALYZE,DETECT,CONTINUE eval
          class TOPICS,QUALITY,DRIFT issue
          class REPAIR repair
          class REPORT insight

Agents don't degrade silently. The same eval infrastructure that got BuRDy to 8.2 keeps running after launch. Leadership sees scores, trends, and fixes each month.

9 - Why This Matters

The difference between AI that's deployed and AI that's governed

We are building a quality system for AI.

Here is what that looks like in practice:

Rubrics defined by people: the business decides what "good" means
Scenarios that stress-test: adversarial inputs, missing sponsors, vague quantification
Evidence-based improvement: every change traces to a scored finding
Continuous monitoring: eval infrastructure keeps running after launch
Full audit trail: version history, reports, changelogs, score progression

10 - The Model Going Forward

BuRDy is agent #1. This is how we build every AI deployment.

Platform-agnostic. The early stages determine the right platform before anything is built.

        graph TD
          REQ["New agent request\nfrom any team"] --> INTAKE["Intake + Discovery\nRequirements, stakeholders, constraints"]
          INTAKE --> PLATFORM{"Can this work\nin Glean?"}
          PLATFORM -->|"Yes"| GLEAN_PATH["Glean Path\nFully developed"]
          PLATFORM -->|"No"| OTHER_PATH["Other Platform\nTo be determined"]
          GLEAN_PATH --> AF["Agent Factory\nbuilds instructions"]
          AF --> EVAL["Eval-Repair Loop\nuntil quality bar met"]
          EVAL --> DEPLOY_G["Deploy to Glean\nwith monitoring"]
          DEPLOY_G --> IMPROVE["Continuous improvement"]
          OTHER_PATH --> AF2["Agent Factory\nbuilds instructions + app"]
          AF2 --> BUILD2["Build + deploy on\nappropriate platform"]
          BUILD2 --> EVAL2["Same eval-repair loop\nSame rigor, same scoring"]
          EVAL2 --> DEPLOY2["Deploy with monitoring"]
          DEPLOY2 --> IMPROVE2["Continuous improvement"]

          classDef process fill:#b4530922,stroke:#b45309,stroke-width:2px
          classDef glean fill:#15803d22,stroke:#15803d,stroke-width:2px
          classDef other fill:#1d4ed822,stroke:#1d4ed8,stroke-width:2px
          classDef decision fill:#7e22ce22,stroke:#7e22ce,stroke-width:2px

          class REQ,INTAKE process
          class PLATFORM decision
          class GLEAN_PATH,AF,EVAL,DEPLOY_G,IMPROVE glean
          class OTHER_PATH,RUBRIC2,BUILD2,EVAL2,DEPLOY2,IMPROVE2 other

Platform changes. Quality doesn't. The eval-repair loop applies the same rigor whether an agent runs in Glean, a custom app, or a third-party tool.

Agents already in this pipeline:

BuRDy (IS): BRD interview agent. 9 repair cycles complete. Entering team testing.
Deal Desk Agent (Sales/Finance): Quote-to-cash support. In testing.
Legal Separation Agent (Legal/HR): Agreement generation + severance calculation. In discovery.
ITFA Vendor Research Agent (Procurement): Automated vendor security/privacy research for IT intake. In testing.
JD Generator (People/HR): Structured job description generation from intake data. In testing.
Discovery PuRDy (Cross-functional): Product requirements interviews. Ready for testing.