AutoEval

Overview

AutoEval is a system that autonomously evaluates generalist robot policies in the real world 24/7, and the evaluation results correspond closely to the ground truth human-run evaluations. AutoEval requires minimal human intervention by using robust success classifiers and reset policies fine-tuned from foundation models. We currently open access to two AutoEval stations with WidowX robots and four different tasks. Users can submit policies through an online dashboard, and get back detailed evaluation report (success rates, videos, etc.) after the evaluation job is done. See a detailed video below for more details.

Tasks

The currently publically available tasks are:

Close the drawer
Open the drawer
Put the eggplant in the yellow basket
Put the eggplant in the blue sink

Note: These four tasks will be available until Jan 1, 2026. The tasks in 2026 is TBD.

Instructions for Evaluating Your Policy in our AutoEval Cells

Your policy should take as input a 256x256 RGB image, an 8-dimensional proprioceptive state of the robot (xyz, r_xyz, 0, gripper), where gripper is between 0 (fully closed) and 0.39 (fully open), and a language command in string. Your policy output should be a 7-dimensional end-effector delta action for the WidowX robot. We use blocking control to execute the policy action.

Set up a policy server that can output actions once queried with an observation (256 x 256 RGB image), proprio states, and language command. Follow this template and fill out the two TODOs to create your own policy server. Basically you need to implement the model loading logic & model.call() logic. This should take you 5-10 minutes.
- If you would like to quickly try out AutoEval, you can use our pre-made policy server for OpenVLA to submit a job right away.
Keep your policy server up and running (e.g. in a tmux) with python policy_server.py. This will set up your policy server on your localhost.
Make sure your policy server's IP & Port are publically accessible. After Step 2, you can expose your localhost to a public server with ngrok or bore. Both can be installed in ~1 min (See quick installation for bore here); ngrok has limited calls for free users, while bore is open-source and completely free.

With ngrok:
ngrok http 8000
With bore.pub:
bore local 8000 --to bore.pub
Either of these commands should also be persistently run (e.g. in a tmux). Note: if you aren't able to resolve bore.pub's DNS (e.g. if ping bore.pub fails), you can use their actual IP:
bore local 8000 --to 159.223.171.199
[RECOMMENDED] We provide two resources to help debug and sanity check your policy before doing the full real-world evaluation:
- You can use our SIMPLER environments and make sure your policy outputs reasonable actions in simulated environments. Check out the instructions here.
- You can compute the action MSE between your policy output and ground truth actions against Bridge V2 trajectories. We provide a script for this "MSE sanity check" here. You can run the sanity check by python test_policy_client_mse.py --ip SERVER_IP --port SERVER_PORT
Go to the AutoEval Dashboard at the top of the page, and enter your publically accessible IP & port of the policy server, and choose your eval task and other options. Submit your job!
Monitor your job progress in the dashboard, and view it executing in real time under the live feed.
Once the eval is finished, you can access the evaluation report through the wandb link on the dashboard. The evaluation report (example here) includes success rates, videos, and other detailed metrics on the evaluation job.

Please refer to the FAQ section at the bottom of the page if you run into any issues.

Evaluation Report

The evaluation report includes various metrics and visualizations to help you understand your policy's performance (see example here). It is hosted publicly on weights and biases. Some important metrics include:

Success rate statistics:

overall_success_rate: average success rate over the entire evaluation job
recent_success_rate: success rate over the most recent 20 episodes in the evaluation job.
episode_success: whether each episode succeeded (1) or failed (0).

Visualization of evaluation:

video: low resolution video of the trajectory
frames: a film strip of the episode instead of a full video
initial_and_final_frames: periodic snapshots of the initial and final frames of an episode, visualized alongside success signal.

Other metrics

FAQ

Q: Does AutoEval support evaluation policies with action chunking and observation history?
A: Yes, but the policy server would need to keep track of the states (e.g. history, action chunks). We provide an example of such a policy server here.

Q: Why did I get the "Exception: Failed to connect to action server" error?
A: This error occurs because the AutoEval server cannot connect to the robot action server (through agentlace). Sometimes re-submitting the job to the same robot will fix the issue. If two consecutive jobs fail, the robot is likely down; please contact the authors with your job ID.

Q: Will the Motors Overheat from continuous evals?
A: We put the robot to rest for 20 minutes every 6 hours. So if your job doesn't get queued when there is no jobs before it, it might be in the resting period.

Q: Can I evaluate goal-conditioned policies?
A: Yes. Even though the AutoEval server will query your policy with the language command, your policy server can ignore the language command and use the proper goal image. We provide the goal images of the four tasks in the github repo.

Q: Are the evaluation data logged and available?
A: Yes. All evaluation data (e.g. observations, actions, success) are logged at full resolution and available on a huggingface dataset. You can find the directory to your job by going to wandb eval report > logs > printed out at the bottom

If you encounter problems not described here, please contact the authors or reach out on the Discord channel.

Video

BibTeX

@article{zhou2025autoeval,
    title={AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World},
    author={Zhou, Zhiyuan and Atreya, Pranav and Tan, You Liang and Pertsch, Karl and Levine, Sergey},
    journal = {arXiv preprint arXiv:2503.24278},
    year={2025},
}