Full observability
Every step, tool call, and token captured as a readable trace you can replay.
Rollout closes the loop between testing an agent and actually making it better. Define what “good” looks like, run your agent in a real sandbox, see exactly where it breaks, and turn every passing run into training data — so each version is measurably better than the last.
Live preview
Dashboard
Workspace overview for you@lab.org
Activity
Jun 16 · 15 traces
Metrics · since last login
P95 duration
Tail latency
1.1m
+1.1m (+100%)
Traces
Trace volume
15
+15 (+100%)
Success rate
Successful traces
100%
+100% (+100%)
Errors
Failed traces
0
No change
Cost
Observed spend
$0.00
No change
Tokens
Token volume
0
No change
Every step, tool call, and token captured as a readable trace you can replay.
Deterministic checks, LLM judges, or your own function decide pass or fail.
Every passing rollout is a labeled trajectory — export it as fine-tuning or RL data.
The loop
Write your tasks once and attach the checks that decide pass or fail — a deterministic assertion, an LLM judge, or your own function. Version them like code so every run stays comparable.
Datasets
Task sets with deterministic, versioned exports.
Verifiers
Deterministic checks, LLM judges, or your own function.
Files
Attach reference artifacts to tasks and runs.
from rollout import Dataset, verifyds = Dataset("support-triage", version="a13d460")@ds.taskdef refund(env): env.user("Refund order 4421") return env.run(agent="triage-v3")@verify(refund)def refunded(trace): # decide pass / fail however you like return trace.tool("issue_refund").okDrop your agent into a real sandbox — shell, browser, filesystem — and roll it out across thousands of tasks in parallel. Isolated and reproducible, so what you measure is what your agent really did.
Environments
Compose shell, browser, and filesystem access.
Gallery
Browse and reuse community environments.
Playground
Try one rollout interactively before a batch.
Parallel
Roll out across thousands of tasks at once.
from rollout import Environment, Shell, Browser, FileSystem# compose a sandbox from the resources your agent needsenv = Environment( "research-box", resources=[ Shell(image="python:3.13"), Browser(headless=True), FileSystem(mount="./workspace"), ],)# roll out across the whole dataset, 32 at a timeds.rollout(env, agent="researcher", parallelism=32)Every step, tool call, and output is captured. Replay any run, diff two versions side by side, and find the one prompt change that broke task 47 — instead of guessing.
Traces
Every step, tool call, and output — captured.
Replay
Step back through any run, exactly as it happened.
Run diff
Compare two versions side by side.
Scores
Verifier results, pass rates, and failures.
from rollout import Runrun = Run.load("a13d460") # any past rolloutprint(run.pass_rate, run.score) # how did it do?# replay the failures, step by stepfor trace in run.failures(): trace.replay() # every tool call + output trace.diff(run.baseline) # what changed vs. last timeFilter to the runs that passed and export them as fine-tuning data or RL signal. A/B the new version against the old on the same tasks, and let the verifiers confirm it's genuinely better before you ship.
Training export
Export passing rollouts as SFT or RL data.
A/B compare
Pit two versions against the same tasks.
Versioning
Track every prompt, tool, and policy change.
Tools
Define the tools your agents can call.
from rollout import Datasetds = Dataset("support-triage")# export the passing rollouts as fine-tuning datads.latest_run.passing().export("sft.jsonl")# did v4 actually beat v3? let the verifiers decidereport = ds.compare("triage-v3", "triage-v4")report.winner # "triage-v4" · +12% pass rateOut of the box
$ pip install mv37-rolloutSuccessfully installed mv37-rollout 0.1.0$ rollout login✓ Authenticated as you@lab.org$ rollout datasets pull support-triage✓ Pulled support-triage · 1,000 tasks → harbor$ rollout verifiers create refund.json✓ Created verifier refund · pass_threshold 1.0$ rollout optimize run triage-promptGEPA · scoring candidates ............ done (32s)best 0.91 · baseline 0.79 · +12% on holdout✓ promotion report → run a13d460
Roadmap.txt
Soon
Later
From the lab
MV37 is an independent lab working toward AI that builds better AI.
Rollout is our first step: automating as much of the post-training pipeline as we can.
$ rollout init
Jump straight in and run your first rollout — no setup call required.