NewRead the docs →
Beta · v0.1

Evaluate. Train. Repeat.
Continuously improve your agents.

Rollout closes the loop between testing an agent and actually making it better. Define what “good” looks like, run your agent in a real sandbox, see exactly where it breaks, and turn every passing run into training data — so each version is measurably better than the last.

Sign up free

Live preview

Traces, metrics, datasets, and optimizations
— under one workspace.

app.rollout.work/dashboard

Dashboard

Workspace overview for you@lab.org

Activity

Jun 16 · 15 traces

TracesEvents
JunSepDecMarJun

Metrics · since last login

P95 duration

Tail latency

1.1m

+1.1m (+100%)

Jun 8Jun 15

Traces

Trace volume

15

+15 (+100%)

Jun 8Jun 15

Success rate

Successful traces

100%

+100% (+100%)

Jun 8Jun 15

Errors

Failed traces

0

No change

Jun 8Jun 15

Cost

Observed spend

$0.00

No change

Jun 8Jun 15

Tokens

Token volume

0

No change

Jun 8Jun 15

Full observability

Every step, tool call, and token captured as a readable trace you can replay.

Verifiers

Deterministic checks, LLM judges, or your own function decide pass or fail.

Training data

Every passing rollout is a labeled trajectory — export it as fine-tuning or RL data.

The loop

Four steps, one feedback loop.

01 · Define

Define what “good” looks like.

Write your tasks once and attach the checks that decide pass or fail — a deterministic assertion, an LLM judge, or your own function. Version them like code so every run stays comparable.

Datasets

Task sets with deterministic, versioned exports.

Verifiers

Deterministic checks, LLM judges, or your own function.

Files

Attach reference artifacts to tasks and runs.

datasets.pyrollout
from rollout import Dataset, verifyds = Dataset("support-triage", version="a13d460")@ds.taskdef refund(env):    env.user("Refund order 4421")    return env.run(agent="triage-v3")@verify(refund)def refunded(trace):    # decide pass / fail however you like    return trace.tool("issue_refund").ok
02 · Run

Run your agent where it'll actually work.

Drop your agent into a real sandbox — shell, browser, filesystem — and roll it out across thousands of tasks in parallel. Isolated and reproducible, so what you measure is what your agent really did.

Environments

Compose shell, browser, and filesystem access.

Gallery

Browse and reuse community environments.

Playground

Try one rollout interactively before a batch.

Parallel

Roll out across thousands of tasks at once.

environment.pyrollout
from rollout import Environment, Shell, Browser, FileSystem# compose a sandbox from the resources your agent needsenv = Environment(    "research-box",    resources=[        Shell(image="python:3.13"),        Browser(headless=True),        FileSystem(mount="./workspace"),    ],)# roll out across the whole dataset, 32 at a timeds.rollout(env, agent="researcher", parallelism=32)
03 · Inspect

See exactly where it breaks.

Every step, tool call, and output is captured. Replay any run, diff two versions side by side, and find the one prompt change that broke task 47 — instead of guessing.

Traces

Every step, tool call, and output — captured.

Replay

Step back through any run, exactly as it happened.

Run diff

Compare two versions side by side.

Scores

Verifier results, pass rates, and failures.

inspect.pyrollout
from rollout import Runrun = Run.load("a13d460")          # any past rolloutprint(run.pass_rate, run.score)    # how did it do?# replay the failures, step by stepfor trace in run.failures():    trace.replay()                 # every tool call + output    trace.diff(run.baseline)       # what changed vs. last time
04 · Improve

Turn results into a better agent.

Filter to the runs that passed and export them as fine-tuning data or RL signal. A/B the new version against the old on the same tasks, and let the verifiers confirm it's genuinely better before you ship.

Training export

Export passing rollouts as SFT or RL data.

A/B compare

Pit two versions against the same tasks.

Versioning

Track every prompt, tool, and policy change.

Tools

Define the tools your agents can call.

improve.pyrollout
from rollout import Datasetds = Dataset("support-triage")# export the passing rollouts as fine-tuning datads.latest_run.passing().export("sft.jsonl")# did v4 actually beat v3? let the verifiers decidereport = ds.compare("triage-v3", "triage-v4")report.winner       # "triage-v4"  ·  +12% pass rate

Out of the box

From install to an optimized prompt, in one session.

zsh — rollout
$ pip install mv37-rollout
Successfully installed mv37-rollout 0.1.0
$ rollout login
Authenticated as you@lab.org
$ rollout datasets pull support-triage
Pulled support-triage · 1,000 tasks → harbor
$ rollout verifiers create refund.json
Created verifier refund · pass_threshold 1.0
$ rollout optimize run triage-prompt
GEPA · scoring candidates ............ done (32s)
best 0.91 · baseline 0.79 · +12% on holdout
promotion report → run a13d460

Roadmap.txt

What is shipping next.

Soon

  • Hosted RL training loops — point at a dataset, get a fine-tuned policy back
  • Semantic trace search and run-to-run diff
  • Verifier authoring from natural-language task descriptions

Later

  • Public gallery of community environments and datasets
  • Multi-agent rollouts with shared world state
  • On-prem and air-gapped deployments

From the lab

MV37 is an independent lab working toward AI that builds better AI.
Rollout is our first step: automating as much of the post-training pipeline as we can.

MV37

$ rollout init

Ready to roll out your first agent?

Jump straight in and run your first rollout — no setup call required.