multi-turn Command - Circuit Breaker Labs CLI

Overview

The multi-turn command runs conversational evaluations that test the model across multiple dialogue turns. This evaluates how models handle safety concerns in extended conversations.

Command Syntax

cbl [global-options] multi-turn [options] <provider> [provider-options]

Required Options

--threshold

float

required

Safety score threshold for evaluation. Responses with scores below this threshold will fail the evaluation.

Range: 0.0 to 1.0
Example: --threshold 0.5

--max-turns

integer

required

Maximum number of turns in the conversation. Should be a multiple of two for balanced user-assistant exchanges.

Example: --max-turns 8
Recommended: Even numbers (2, 4, 6, 8, 10, etc.)

Use even numbers for --max-turns to ensure conversations end with an assistant response. Odd numbers may result in incomplete conversation patterns.

--test-types

string

required

Comma-separated list of multi-turn test types to run.Available Test Types:

user_persona - Tests with simulated user personas
semantic_chunks - Tests with semantic conversation chunks
Format: --test-types type1,type2
Example: --test-types user_persona,semantic_chunks

Optional Options

--test-case-groups

string

default:"suicidal_ideation"

Comma-separated list of test case groups to run in the evaluation.

Format: --test-case-groups group1,group2,group3
Default: suicidal_ideation
Example: --test-case-groups suicidal_ideation,custom_group

The default test case group is suicidal_ideation. You can specify multiple groups separated by commas, or provide custom group names.

Provider Subcommands

After specifying multi-turn options, you must choose a provider:

openai

Use OpenAI or OpenAI-compatible APIs.

cbl multi-turn [options] openai --api-key <key> --model <model> [openai-options]

Required OpenAI Options:

--api-key

string

required

OpenAI API key. Can also be set via OPENAI_API_KEY environment variable.

export OPENAI_API_KEY="sk-..."

--model

string

required

OpenAI model name.

Examples: gpt-4o, gpt-4-turbo, gpt-3.5-turbo
Or custom fine-tune ID: ft:gpt-4o-mini:...

Optional OpenAI Options:

--base-url - Custom API endpoint (default: https://api.openai.com/v1, env: OPENAI_BASE_URL)
--org-id - OpenAI organization ID (env: OPENAI_ORG_ID)
--temperature - Sampling temperature between 0 and 2
--top-p - Nucleus sampling parameter
--max-completion-tokens - Maximum tokens to generate
--n - Number of completions to generate
--frequency-penalty - Penalty for token frequency (-2.0 to 2.0)
--presence-penalty - Penalty for token presence (-2.0 to 2.0)
--logprobs - Return log probabilities
--top-logprobs - Number of most likely tokens to return (0-20)
--stop - Stop sequences (comma-separated, up to 4)
--logit-bias - Modify token likelihoods (format: token_id:bias)
--store - Store the output
--service-tier - Processing type (auto, default, flex, scale, priority)
--reasoning-effort - Reasoning effort (minimal, low, medium, high, xhigh)

ollama

Use locally-hosted Ollama models.

cbl multi-turn [options] ollama --model <model> [ollama-options]

Required Ollama Options:

--model

string

required

Ollama model name (e.g., llama2, mistral, codellama).

Optional Ollama Options:

--base-url - Ollama server URL (default: http://localhost:11434, env: OLLAMA_BASE_URL)
--logprobs - Return log probabilities
--mirostat - Mirostat sampling mode (0=disabled, 1=Mirostat, 2=Mirostat 2.0)
--mirostat-eta - Mirostat learning rate (default: 0.1)
--mirostat-tau - Mirostat tau parameter (default: 5.0)
--num-ctx - Context window size (default: 2048)
--num-gpu - Number of layers to send to GPU
--num-gqa - Number of GQA groups
--num-predict - Max tokens to predict (default: 128, -1=infinite, -2=fill context)
--num-thread - Number of threads for computation
--repeat-last-n - Look-back for repetition prevention (default: 64, 0=disabled, -1=num_ctx)
--repeat-penalty - Repetition penalty (default: 1.1)
--seed - Random seed (default: 0)
--stop - Stop sequences (can specify multiple times)
--temperature - Sampling temperature (default: 0.8)
--tfs-z - Tail free sampling (default: 1)
--top-k - Top-k sampling (default: 40)
--top-p - Top-p sampling (default: 0.9)

custom

Use custom endpoints with Rhai scripting.

cbl multi-turn [options] custom --url <url> --script <path>

Required Custom Options:

--url

string

required

Endpoint URL to POST requests to.

--script

string

required

Path to the Rhai script file that translates between CBL protocol and your custom API.See examples/providers/ for script examples.

Complete Examples

Basic Multi-Turn Evaluation

cbl multi-turn \
    --threshold 0.5 \
    --max-turns 8 \
    --test-types user_persona,semantic_chunks \
    openai \
    --model gpt-4o

Comprehensive Conversational Testing

cbl --output-file conversation-eval.json \
    multi-turn \
    --threshold 0.4 \
    --max-turns 12 \
    --test-types user_persona,semantic_chunks \
    --test-case-groups suicidal_ideation,custom_safety \
    openai \
    --model gpt-4o \
    --temperature 0.9

Extended Conversation Test

export OPENAI_API_KEY="sk-..."

cbl multi-turn \
    --threshold 0.5 \
    --max-turns 16 \
    --test-types user_persona,semantic_chunks \
    openai \
    --model gpt-4-turbo

Ollama Multi-Turn Evaluation

cbl multi-turn \
    --threshold 0.5 \
    --max-turns 8 \
    --test-types user_persona,semantic_chunks \
    ollama \
    --model llama2 \
    --temperature 0.8 \
    --num-ctx 4096

Single Test Type with Custom Provider

cbl multi-turn \
    --threshold 0.5 \
    --max-turns 6 \
    --test-types semantic_chunks \
    custom \
    --url https://api.example.com/v1/chat \
    --script ./providers/custom-api.rhai

Debug Mode with Verbose Logging

cbl --log-level debug \
    --log-mode \
    multi-turn \
    --threshold 0.5 \
    --max-turns 8 \
    --test-types user_persona,semantic_chunks \
    openai \
    --model gpt-4o

Test Type Details

user_persona

Simulates different user personas in conversations to test how the model handles various user behaviors and communication styles across multiple turns. Use Cases:

Testing responses to persistent or manipulative users
Evaluating consistency in safety boundaries
Assessing model behavior with different personality types

semantic_chunks

Breaks safety-critical content into semantic chunks distributed across conversation turns, testing if models can maintain safety when concerning topics are gradually introduced. Use Cases:

Testing resistance to gradual boundary pushing
Evaluating context awareness across turns
Detecting progressive prompt injection attempts

Use both test types together for comprehensive multi-turn evaluation: --test-types user_persona,semantic_chunks

Understanding Max Turns

The --max-turns parameter controls conversation length:

2 turns: Minimal back-and-forth (user → assistant)
4-6 turns: Short conversations
8-12 turns: Medium-length conversations (recommended)
14+ turns: Extended conversations

Longer conversations significantly increase evaluation time and API costs. Start with 8 turns and adjust based on your needs.

Understanding the Output

Evaluation results are automatically saved with a timestamp:

# Default output format
evaluation_results_YYYY-MM-DD_HH-MM-SS.json

# Custom output file
cbl --output-file my-results.json multi-turn ...

The output includes:

Overall evaluation score
Individual conversation results
Safety scores for each turn
Pass/fail status based on threshold
Complete conversation transcripts
Test type breakdowns

Results are saved in JSON format and include full conversation history for analysis.

Tips and Best Practices

Starting Point: Begin with --threshold 0.5, --max-turns 8, and both test types for comprehensive initial evaluations.

Threshold Selection

0.3-0.4 - Strict safety requirements for high-risk applications
0.5-0.6 - Balanced safety evaluation (recommended starting point)
0.7-0.8 - Lenient evaluation for exploratory testing

Turn Count Guidelines

Short tests (2-4 turns): Quick smoke tests
Medium tests (6-10 turns): Standard safety evaluation
Long tests (12+ turns): Comprehensive boundary testing

Test Type Selection

Use user_persona alone when focusing on user behavior patterns
Use semantic_chunks alone when testing gradual topic introduction
Use both for comprehensive conversational safety testing

Cost Optimization

Multi-turn evaluations consume more API tokens than single-turn tests. Each turn adds to the conversation context, increasing token usage exponentially.

To optimize costs:

Start with fewer turns (6-8) for initial testing
Use --log-level info to monitor token usage
Consider using Ollama for development/testing
Run comprehensive tests (12+ turns) only for production validation

Comparison with Single-Turn

Aspect	Single-Turn	Multi-Turn
Context	No conversation history	Full conversation context
Duration	Faster (one response per test)	Slower (multiple turns)
Use Case	Individual prompt safety	Conversational safety
Cost	Lower token usage	Higher token usage
Complexity	Simpler, direct testing	Complex interaction patterns

Use single-turn for testing individual prompts and quick iterations. Use multi-turn for production chatbots and conversational applications.

Single-Turn

Run single-turn evaluations

Global Options

Configure API keys and logging

​Overview

​Command Syntax

​Required Options

​Optional Options

​Provider Subcommands

​openai

​ollama

​custom

​Complete Examples

​Basic Multi-Turn Evaluation

​Comprehensive Conversational Testing

​Extended Conversation Test

​Ollama Multi-Turn Evaluation

​Single Test Type with Custom Provider

​Debug Mode with Verbose Logging

​Test Type Details

​user_persona

​semantic_chunks

​Understanding Max Turns

​Understanding the Output

​Tips and Best Practices

​Threshold Selection

​Turn Count Guidelines

​Test Type Selection

​Cost Optimization

​Comparison with Single-Turn

​Related Commands

Single-Turn

Global Options

Overview

Command Syntax

Required Options

Optional Options

Provider Subcommands

openai

ollama

custom

Complete Examples

Basic Multi-Turn Evaluation

Comprehensive Conversational Testing

Extended Conversation Test

Ollama Multi-Turn Evaluation

Single Test Type with Custom Provider

Debug Mode with Verbose Logging

Test Type Details

user_persona

semantic_chunks

Understanding Max Turns

Understanding the Output

Tips and Best Practices

Threshold Selection

Turn Count Guidelines

Test Type Selection

Cost Optimization

Comparison with Single-Turn

Related Commands