flask-vialAgent Evaluation

MixedVoices provides the ability to evaluate agents through simulated text-to-text conversations. This allows you to test your agent's behavior across different scenarios before deploying to production.

Quick Start

import mixedvoices as mv
from mixedvoices.metrics import empathy, Metric

# Create project with metrics
hangup_metric = Metric(
    name="call_hangup",
    definition="FAILS if the bot faces problems in ending the call",
    scoring="binary"
)
project = mv.create_project("dental_clinic", metrics=[empathy, hangup_metric])

# Create version
v1 = project.create_version(
    "v1", 
    prompt="You are a friendly dental receptionist...",
    metadata={"model": "gpt-4", "deployment_date": "2024-01-15"}
)

# Generate test cases
test_generator = mv.TestCaseGenerator(v1.prompt)
test_cases = test_generator.add_from_transcripts([existing_conversation])
                          .add_edge_cases(2)
                          .add_from_descriptions(["An elderly patient", "A rushed parent"])
                          .generate()

# Create and run evaluator
evaluator = project.create_evaluator(test_cases, metric_names=["empathy", "call_hangup"])
evaluator.run(v1, MyAgent, agent_starts=False)

Test Case Generation

The TestCaseGenerator class provides multiple methods to create diverse test cases:

Implementing Your Agent

Create a class that inherits from BaseAgent:

Running Evaluations

Best Practices

  1. Test Case Diversity

    • Mix different generation methods

    • Include edge cases and failure scenarios

    • Use real conversation transcripts when available

  2. Agent Implementation

    • Handle empty input for conversation starts

    • Implement clear conversation ending logic

    • Return accurate conversation end status

  3. Metrics Selection

    • Choose metrics relevant to your use case

    • Balance binary and continuous metrics

    • Consider including custom metrics

  4. Evaluation Strategy

    • Run evaluations before deployment

    • Test specific conversation paths

    • Validate fixes for known issues

    • Monitor evaluation runs for completion

Last updated