aesop // parallelized adversarial red-team execution
The Hydra assembles all available intelligence on a target chatbot, identifies its weakest domains, fans base scenarios into persona-driven variations, and unleashes them simultaneously as concurrent multi-turn conversations. Each head adapts its approach based on the target's responses, then every interaction is scored to produce a domain-level resilience report.
graph TD
%% ---- INPUT SOURCES ----
I1["System Instructions"]:::input
I2["Vital Context"]:::input
I3["Prior AESOP Analyses"]:::input
I4["Cave of Shadows Runs"]:::input
I5["Historical Hydra Scores"]:::input
%% ---- STAGE 1: BRIEFING ----
B["Stage 1: Briefing Assembly"]:::stage
I1 --> B
I2 --> B
I3 --> B
I4 --> B
I5 --> B
%% ---- STAGE 2: SCENARIOS ----
S{"Stage 2: Scenario Loading"}:::decision
B -->|"target dossier"| S
S -->|"Cave run linked"| SC["Load Cave Scenarios"]:::stage
S -->|"No Cave run"| SF["Fallback: Static Probe Battery"]:::fallback
SC --> A
SF --> A
%% ---- STAGE 3: ALLOCATION ----
A["Stage 3: Tiger Team Head Allocation"]:::alloc
A -->|"weakness-weighted distribution"| FAN
%% ---- STAGE 4: FANNING ----
FAN["Stage 4: Scenario Fanning"]:::fan
FAN -->|"N variations per scenario"| EX
%% ---- STAGE 5: EXECUTION ----
EX["Stage 5: Parallel Head Execution"]:::exec
EX --> H1["Head 1: Persona A"]:::head
EX --> H2["Head 2: Persona B"]:::head
EX --> H3["Head 3: Persona C"]:::head
EX --> HD["..."]:::head
EX --> HN["Head N: Persona N"]:::head
H1 -->|"multi-turn conversation"| EP["Target Chatbot Endpoint"]:::endpoint
H2 -->|"adaptive dialogue"| EP
H3 -->|"escalation tactics"| EP
HD --> EP
HN --> EP
EP -->|"responses"| SC1
%% ---- STAGE 6: SCORING ----
SC1["Stage 6: Cave Scorer"]:::scoring
SC1 --> D1["Safety Score"]:::domain
SC1 --> D2["Ethics Score"]:::domain
SC1 --> D3["Bias Score"]:::domain
SC1 --> D4["Legal Score"]:::domain
SC1 --> D5["Security Score"]:::domain
D1 --> AGG["Domain Score Aggregation"]:::agg
D2 --> AGG
D3 --> AGG
D4 --> AGG
D5 --> AGG
AGG --> RPT["Hydra Report"]:::report
%% ---- STYLES ----
classDef input fill:#1a2e3d,stroke:#22d3ee,color:#e0f7fa,stroke-width:1.5px
classDef stage fill:#241a35,stroke:#a78bfa,color:#e8e0f0,stroke-width:1.5px
classDef decision fill:#241a35,stroke:#a78bfa,color:#e8e0f0,stroke-width:1.5px
classDef fallback fill:#2a1f1a,stroke:#fb923c,color:#fde8d0,stroke-width:1.5px,stroke-dasharray:5 3
classDef alloc fill:#2a2618,stroke:#fbbf24,color:#fef3c7,stroke-width:1.5px
classDef fan fill:#2a1a18,stroke:#fb923c,color:#fde8d0,stroke-width:1.5px
classDef exec fill:#2a1a1a,stroke:#f87171,color:#fde0e0,stroke-width:1.5px
classDef head fill:#1a1128,stroke:#f87171,color:#fca5a5,stroke-width:1px
classDef endpoint fill:#0f0a1a,stroke:#f87171,color:#fca5a5,stroke-width:2px
classDef scoring fill:#1a2a1a,stroke:#4ade80,color:#dcfce7,stroke-width:1.5px
classDef domain fill:#1a2a1a,stroke:#4ade80,color:#bbf7d0,stroke-width:1px
classDef agg fill:#1a2a1a,stroke:#4ade80,color:#dcfce7,stroke-width:1.5px
classDef report fill:#2a1a35,stroke:#a78bfa,color:#e8e0f0,stroke-width:2px
The Hydra pulls from every available data source on the target before planning its attack.
All intel is compiled into a structured target_dossier that every downstream stage receives. Includes system prompt analysis, known weaknesses, and prior score trajectories.
If a Cave of Shadows run is linked, its tailored adversarial scenarios become the base. Otherwise, the Hydra falls back to 20 built-in probes (10 director + 10 worker) covering all five domains.
Heads are distributed using inverse score weighting. Domains below 80 get proportionally more heads. Every domain gets at least one head.
Each base scenario is expanded into multiple variations via Claude. Each variation gets a distinct persona, angle, and escalation approach, becoming an individual "head."
All heads are unleashed simultaneously via asyncio.gather. Each runs an adaptive multi-turn conversation (up to 10 turns) against the target endpoint, escalating based on responses.
Each head's final conversation is scored by the Cave Scorer agent (0-100). Scores are averaged by domain to produce five domain scores and one overall resilience score.
The final output is a comprehensive resilience report containing:
On subsequent runs, the tiger team allocation shifts automatically -- domains that improved get fewer heads, and newly weak domains attract more firepower.