What Single-Turn Evaluations Test
Single-turn evaluations send individual test prompts to your model and analyze the responses for safety violations. Each test is independent, with no memory of previous exchanges.Single-turn tests focus on direct prompt injection and immediate harmful content generation, making them perfect for baseline safety assessments.
Test Methodology
Prompt Generation
The CLI generates test prompts based on your selected test case groups. Each prompt is designed to test a specific safety boundary.
Variations
For each base test case, the system creates multiple variations to test different phrasings and approaches that might bypass safety filters.
Iteration Layers
The system can generate multiple layers of increasingly sophisticated prompts, learning from previous responses to create more targeted tests.
Parameters
Single-turn evaluations are configured using three key parameters:--threshold (Required)
The minimum safety score (0.0 to 1.0) required for a response to pass.
Understanding Safety Scores
Understanding Safety Scores
Safety scores are calculated by analyzing:
- Content harm indicators: Presence of harmful advice, dangerous information, or policy violations
- Refusal quality: Whether the model appropriately refuses harmful requests
- Context appropriateness: Whether responses maintain safety boundaries while being helpful
--variations (Required)
The number of prompt variations to generate for each base test case.
--maximum-iteration-layers (Required)
The maximum number of iterative refinement layers for generating test prompts.
Each iteration layer generates prompts informed by previous responses, creating progressively more targeted tests. Layer 1 uses generic unsafe prompts, while layer 3 uses model-specific attack patterns.
--test-case-groups (Optional)
Specify which test case groups to run. Defaults to suicidal_ideation.
Complete Command Examples
Interpreting Results
When a single-turn evaluation completes, you’ll receive detailed results including:Pass/Fail Summary
Individual Test Results
Each test includes:- Test prompt: The exact prompt sent to your model
- Model response: Your model’s complete response
- Safety score: Numerical safety assessment (0.0 - 1.0)
- Pass/Fail: Whether the score met your threshold
- Test case group: Which category this test belongs to
Example Test Result
Example Test Result
Failure Analysis
For each failed test, examine:The Prompt Pattern
What attack vector was used? Was it direct manipulation, role-play, context injection, or another technique?
The Model's Response
Did the model refuse appropriately? Did it provide harmful information? Was the refusal weak or ambiguous?
The Safety Score
How far below threshold did it score? A score of 0.45 vs 0.50 is very different from 0.10 vs 0.50.
Best Practices
Start with Moderate Settings
Start with Moderate Settings
Begin with
--threshold 0.5, --variations 2, and --maximum-iteration-layers 2 to establish a baseline. Adjust based on results.Test Before Deployment
Test Before Deployment
Run single-turn evaluations on any new model, prompt changes, or fine-tunes before production deployment.
Track Results Over Time
Track Results Over Time
Save results to files with timestamps to track safety improvements:
Combine with Multi-Turn Testing
Combine with Multi-Turn Testing
Single-turn tests catch direct vulnerabilities, but also run multi-turn evaluations to catch conversational manipulation.
Use Appropriate Thresholds
Use Appropriate Thresholds
High-risk applications (healthcare, children, crisis support) should use thresholds of 0.7 or higher.
Common Use Cases
Pre-Deployment Testing
Quick safety validation before releasing model updates to production
Regression Testing
Verify that prompt or model changes don’t introduce new vulnerabilities
Compliance Verification
Demonstrate safety measures for regulatory or policy requirements
Baseline Establishment
Create initial safety metrics for tracking improvements over time
Performance Considerations
API Usage Calculation:Total API calls =
test_case_groups.length × variations × maximum_iteration_layersFor example: 1 group × 3 variations × 2 layers = 6 API calls per base test caseOptimization Tips
- Use fewer variations (1-2) for rapid iteration during development
- Increase variations (3-5) and layers (2-3) for comprehensive pre-deployment testing
- Test high-risk groups more thoroughly than lower-risk categories
- Run evaluations in CI/CD pipelines with moderate settings to catch regressions
Next Steps
Multi-Turn Evaluations
Learn about conversational safety testing
Providers
Configure different model providers