Calibrations: Advanced Features

Team consensus and analytics reporting

Group Calibration Sessions

Group calibrations allow multiple calibrators to review conversations together and reach a consensus, creating a "gold standard" evaluation that can be compared against the AI.

Starting a Group Calibration

  1. From a calibration report, click Start Group Calibration


  2. Select the conversations you want to calibrate as a group using the checkboxes


  3. Click Done Selecting

Note: You can change the selected conversations later if needed by clicking "Change Conversations".

Completing a Group Calibration

Once started, the Group Calibration section appears at the top of the report.

Review Process

For each selected conversation:

  1. Expand the conversation in the accordion

  2. Review each playbook and criterion

  3. Reference the read-only evaluations shown for each calibrator and the AI, including any remarks they provided

  4. Select the independent consensus value using the fixed radio buttons: Yes, No, or NA

  5. Optionally add a remark to explain the consensus decision

Consensus Selection

For each criterion, you will see:

  • The AI's original evaluation (read-only)

  • Each calibrator's original evaluation and any remarks they left (read-only)

Below these, select the consensus value that the team agrees is correct using the Yes / No / NA radio buttons. This is an independent selection — it does not have to match any individual calibrator or the AI.

You can also record an optional remark to document the reasoning behind the consensus decision.

Progress Tracking

Each conversation shows a checkmark icon when all criteria have been reviewed and consensus values selected.

Saving and Completing

  • Save Progress: Saves your current selections without completing the calibration (you can return later)

  • Complete Calibration: Finalizes the group calibration (only enabled when all conversations are complete)

Group Calibration Report View

After completing a group calibration, the report switches to a tabbed view:

Group Calibration Tab

Displays a table comparing all calibrators' evaluations (including the AI) against the consensus value.

Calibrator Score summary (shown above the playbook tables):

A compact table showing the overall accuracy score for each calibrator and the AI across all criteria in the session. Scores are colour-coded:

  • Green: ≥ 90% agreement with consensus

  • Orange: 70–89% agreement

  • Red: < 70% agreement

Per-playbook tables show:

  • AI Column: The AI's original evaluation for each criterion

  • Calibrator Columns: Each calibrator's original evaluation

  • Consensus Column: The group's agreed-upon correct evaluation

  • Score Row: Per-criterion accuracy for the AI and each calibrator (percentage of conversations where their value matched the consensus)

Highlighted cells (red/pink background) indicate where a calibrator or the AI differed from the consensus.

Individual Calibration Tab

Shows the original individual calibration table with all calibrators' evaluations.

Using Group Calibration Insights

The completed group calibration provides valuable data for both AI and human calibrator performance:

Low scores on specific criteria may indicate:

  • The playbook criterion needs clearer definition

  • The AI needs retraining on edge cases

  • The criterion might be too subjective for consistent evaluation

  • A calibrator may benefit from additional training on a specific criterion

Use these insights to:

  1. Update playbook descriptions and examples

  2. Create test cases from consensus evaluations

  3. Train the AI on the consensus values

  4. Coach individual calibrators where their scores are consistently low

  5. Adjust scoring weights if needed

Analytics & Performance Tracking

Monitor your calibration program's effectiveness through the Analytics dashboard.

Accessing Calibration Analytics

Navigate to Analytics > QA Performance > Calibrations

Filtering Data

Use the Completed Date filter to analyze specific time periods:

  • Last 7 days

  • Last 30 days

  • Custom date ranges

Interpreting the Data

Healthy calibration programs typically show:

  • Consistent weekly or daily completion rates

  • Gradual increase in calibrator and AI scores over time (as playbooks improve)

  • High completion rates relative to scheduled sessions

Warning signs to watch for:

  • Multiple missed or overdue sessions

  • Declining or stagnant scores over time

  • Single calibrator completing all reviews (should be distributed)

Best Practices

Starting Your Calibration Program

Week 1-2: Pilot Phase

  1. Add 2-3 calibrators initially

  2. Sample 5-10 conversations per session

  3. Run daily sessions to build momentum

  4. Focus on reviewing results together as a team

Week 3-4: Establish Rhythm

  1. Adjust sample size based on calibrator capacity

  2. Fine-tune frequency (daily vs. weekly)

  3. Begin identifying common variance patterns

  4. Start making playbook improvements based on findings

Month 2+: Optimize

  1. Run regular group calibrations for complex cases

  2. Build test case library from calibrated conversations

  3. Track score trends over time for AI and individual calibrators

  4. Expand calibrator team if needed

Calibrator Selection

Good calibrator candidates:

  • Deep understanding of your quality standards

  • Consistent, reliable judgment

  • Available time to complete reviews on schedule

  • Able to provide constructive feedback

Tips:

  • Start with 2-3 calibrators to keep sessions manageable

  • Rotate calibrators periodically to avoid bias

  • Ensure calibrators represent different teams/perspectives

Sampling Strategy

Balanced sampling considers:

  • Conversation scores: Include both high and low scoring conversations

  • Time of day: Morning, afternoon, and evening interactions

  • Channels: Calls, emails, chats in proportion to your volume

  • Representatives: All active representatives unless excluded

The system automatically applies these weights, but you can adjust the sample size to focus more deeply on specific patterns.

Responding to Low Scores

When you see low scores for the AI or a calibrator:

1. Review the Specific Disagreements

  • Look at the highlighted cells (red background) to identify where values differed from consensus

  • Read remarks left by calibrators and on the consensus to understand the reasoning

  • Look for patterns (same criterion, same type of conversation, etc.)

2. Diagnose the Root Cause

  • Is the criterion description unclear?

  • Is this an edge case not covered in the playbook?

  • Does a specific calibrator need training on this criterion?

  • Is the AI missing context?

3. Take Action

  • Unclear criteria: Rewrite playbook descriptions with specific examples

  • Edge cases: Document them in the playbook or add to test cases

  • Calibrator training needs: Share calibration results in team meetings and discuss disagreements

  • AI issues: Create test cases to train the AI on correct evaluations

4. Measure Improvement

  • Track scores over time for both the AI and individual calibrators

  • Expect gradual improvement as playbooks and calibrator training improve

  • Celebrate wins when scores consistently reach 90% or above

Group Calibration Best Practices

When to use group calibration:

  • Complex edge cases that need discussion

  • High-variance criteria needing clarification

  • Training new calibrators

  • Quarterly deep-dives into playbook accuracy

How to run effective sessions:

  1. Review individually first: Have calibrators complete individual reviews before the group session

  2. Focus discussion: Only discuss items with disagreement

  3. Document decisions: Use the consensus remark field to explain why a particular value was chosen

  4. Update playbooks: Immediately update criteria based on group discussions

  5. Create tests: Add consensus conversations to test suites

Frequency recommendations:

  • Monthly group calibrations for mature programs

  • Weekly during playbook development or major changes

  • Ad-hoc for specific training needs

Maintaining Momentum

Keep calibrators engaged:

  • Share insights from calibration data in team meetings

  • Recognize calibrators who complete reviews on time

  • Demonstrate playbook improvements driven by their feedback

  • Rotate focus areas (different playbooks, criteria) to maintain interest

Avoid calibration fatigue:

  • Don't over-sample (10-20 conversations per session is usually sufficient)

  • Vary the conversation types in samples

  • Provide adequate time for completion (3-7 days)

  • Automate reporting and recognition where possible

Continuous Improvement Cycle

  1. Calibrate: Run regular sessions per your schedule

  2. Analyze: Review variance metrics and specific disagreements

  3. Improve: Update playbooks, coach calibrators, adjust AI

  4. Validate: Create test cases to prevent regression

  5. Repeat: Continue the cycle with improving scores over time

Success metrics to track:

  • Calibrator and AI score trends (increasing over time)

  • Calibrator completion rates (consistently high)

  • Test case pass rates (improving over time)

  • QA team confidence scores (via surveys)