Overview
Themulti-turn command runs conversational evaluations that test the model across multiple dialogue turns. This evaluates how models handle safety concerns in extended conversations.
Command Syntax
Required Options
Safety score threshold for evaluation. Responses with scores below this threshold will fail the evaluation.
- Range:
0.0to1.0 - Example:
--threshold 0.5
Maximum number of turns in the conversation. Should be a multiple of two for balanced user-assistant exchanges.
- Example:
--max-turns 8 - Recommended: Even numbers (2, 4, 6, 8, 10, etc.)
Use even numbers for
--max-turns to ensure conversations end with an assistant response. Odd numbers may result in incomplete conversation patterns.Comma-separated list of multi-turn test types to run.Available Test Types:
-
user_persona- Tests with simulated user personas -
semantic_chunks- Tests with semantic conversation chunks -
Format:
--test-types type1,type2 -
Example:
--test-types user_persona,semantic_chunks
Optional Options
Comma-separated list of test case groups to run in the evaluation.
- Format:
--test-case-groups group1,group2,group3 - Default:
suicidal_ideation - Example:
--test-case-groups suicidal_ideation,custom_group
The default test case group is
suicidal_ideation. You can specify multiple groups separated by commas, or provide custom group names.Provider Subcommands
After specifying multi-turn options, you must choose a provider:openai
Use OpenAI or OpenAI-compatible APIs.OpenAI API key. Can also be set via
OPENAI_API_KEY environment variable.OpenAI model name.
- Examples:
gpt-4o,gpt-4-turbo,gpt-3.5-turbo - Or custom fine-tune ID:
ft:gpt-4o-mini:...
--base-url- Custom API endpoint (default:https://api.openai.com/v1, env:OPENAI_BASE_URL)--org-id- OpenAI organization ID (env:OPENAI_ORG_ID)--temperature- Sampling temperature between 0 and 2--top-p- Nucleus sampling parameter--max-completion-tokens- Maximum tokens to generate--n- Number of completions to generate--frequency-penalty- Penalty for token frequency (-2.0 to 2.0)--presence-penalty- Penalty for token presence (-2.0 to 2.0)--logprobs- Return log probabilities--top-logprobs- Number of most likely tokens to return (0-20)--stop- Stop sequences (comma-separated, up to 4)--logit-bias- Modify token likelihoods (format:token_id:bias)--store- Store the output--service-tier- Processing type (auto,default,flex,scale,priority)--reasoning-effort- Reasoning effort (minimal,low,medium,high,xhigh)
ollama
Use locally-hosted Ollama models.Ollama model name (e.g.,
llama2, mistral, codellama).--base-url- Ollama server URL (default:http://localhost:11434, env:OLLAMA_BASE_URL)--logprobs- Return log probabilities--mirostat- Mirostat sampling mode (0=disabled, 1=Mirostat, 2=Mirostat 2.0)--mirostat-eta- Mirostat learning rate (default: 0.1)--mirostat-tau- Mirostat tau parameter (default: 5.0)--num-ctx- Context window size (default: 2048)--num-gpu- Number of layers to send to GPU--num-gqa- Number of GQA groups--num-predict- Max tokens to predict (default: 128, -1=infinite, -2=fill context)--num-thread- Number of threads for computation--repeat-last-n- Look-back for repetition prevention (default: 64, 0=disabled, -1=num_ctx)--repeat-penalty- Repetition penalty (default: 1.1)--seed- Random seed (default: 0)--stop- Stop sequences (can specify multiple times)--temperature- Sampling temperature (default: 0.8)--tfs-z- Tail free sampling (default: 1)--top-k- Top-k sampling (default: 40)--top-p- Top-p sampling (default: 0.9)
custom
Use custom endpoints with Rhai scripting.Endpoint URL to POST requests to.
Path to the Rhai script file that translates between CBL protocol and your custom API.See examples/providers/ for script examples.
Complete Examples
Basic Multi-Turn Evaluation
Comprehensive Conversational Testing
Extended Conversation Test
Ollama Multi-Turn Evaluation
Single Test Type with Custom Provider
Debug Mode with Verbose Logging
Test Type Details
user_persona
Simulates different user personas in conversations to test how the model handles various user behaviors and communication styles across multiple turns. Use Cases:- Testing responses to persistent or manipulative users
- Evaluating consistency in safety boundaries
- Assessing model behavior with different personality types
semantic_chunks
Breaks safety-critical content into semantic chunks distributed across conversation turns, testing if models can maintain safety when concerning topics are gradually introduced. Use Cases:- Testing resistance to gradual boundary pushing
- Evaluating context awareness across turns
- Detecting progressive prompt injection attempts
Understanding Max Turns
The--max-turns parameter controls conversation length:
- 2 turns: Minimal back-and-forth (user → assistant)
- 4-6 turns: Short conversations
- 8-12 turns: Medium-length conversations (recommended)
- 14+ turns: Extended conversations
Understanding the Output
Evaluation results are automatically saved with a timestamp:- Overall evaluation score
- Individual conversation results
- Safety scores for each turn
- Pass/fail status based on threshold
- Complete conversation transcripts
- Test type breakdowns
Results are saved in JSON format and include full conversation history for analysis.
Tips and Best Practices
Threshold Selection
0.3-0.4- Strict safety requirements for high-risk applications0.5-0.6- Balanced safety evaluation (recommended starting point)0.7-0.8- Lenient evaluation for exploratory testing
Turn Count Guidelines
- Short tests (2-4 turns): Quick smoke tests
- Medium tests (6-10 turns): Standard safety evaluation
- Long tests (12+ turns): Comprehensive boundary testing
Test Type Selection
- Use user_persona alone when focusing on user behavior patterns
- Use semantic_chunks alone when testing gradual topic introduction
- Use both for comprehensive conversational safety testing
Cost Optimization
To optimize costs:- Start with fewer turns (6-8) for initial testing
- Use
--log-level infoto monitor token usage - Consider using Ollama for development/testing
- Run comprehensive tests (12+ turns) only for production validation
Comparison with Single-Turn
| Aspect | Single-Turn | Multi-Turn |
|---|---|---|
| Context | No conversation history | Full conversation context |
| Duration | Faster (one response per test) | Slower (multiple turns) |
| Use Case | Individual prompt safety | Conversational safety |
| Cost | Lower token usage | Higher token usage |
| Complexity | Simpler, direct testing | Complex interaction patterns |
Use single-turn for testing individual prompts and quick iterations. Use multi-turn for production chatbots and conversational applications.
Related Commands
Single-Turn
Run single-turn evaluations
Global Options
Configure API keys and logging