AUTOEVAL: AUTONOMOUS EVALUATION OF GENERALIST ROBOT MANIPULATION POLICIES IN THE REAL WORLD

1UC Berkeley 2NVIDIA

Overview

AutoEval is a system that autonomously evaluates generalist robot policies in the real world 24/7, and the evaluation results correspond closely to the ground truth human-run evaluations. AutoEval requires minimal human intervention by using robust success classifiers and reset policies fine-tuned from foundation models. We currently open access to two AutoEval stations with WidowX robots and four different tasks. Users can submit policies through an online dashboard, and get back detailed evaluation report (success rates, videos, etc.) after the evaluation job is done. See a detailed video below for more details.

AutoEval Teaser

Tasks

Drawer Scene Sink Scene

The currently publically available tasks are:

Note: These four tasks will be available until Jan 1, 2026. The tasks in 2026 is TBD.

Instructions for Evaluating Your Policy in our AutoEval Cells

Your policy should take as input a 256x256 RGB image, an 8-dimensional proprioceptive state of the robot (xyz, r_xyz, 0, gripper), where gripper is between 0 (fully closed) and 0.39 (fully open), and a language command in string. Your policy output should be a 7-dimensional end-effector delta action for the WidowX robot. We use blocking control to execute the policy action.

  1. Set up a policy server that can output actions once queried with an observation (256 x 256 RGB image), proprio states, and language command. Follow this template and fill out the two TODOs to create your own policy server. Basically you need to implement the model loading logic & model.call() logic. This should take you 5-10 minutes.
    • If you would like to quickly try out AutoEval, you can use our pre-made policy server for OpenVLA to submit a job right away.
  2. Keep your policy server up and running (e.g. in a tmux) with python policy_server.py. This will set up your policy server on your localhost.
  3. Make sure your policy server's IP & Port are publically accessible. After Step 2, you can expose your localhost to a public server with ngrok or bore. Both can be installed in ~1 min (See quick installation for bore here); ngrok has limited calls for free users, while bore is open-source and completely free.

    With ngrok:

    ngrok http 8000

    With bore.pub:

    bore local 8000 --to bore.pub
    Either of these commands should also be persistently run (e.g. in a tmux). Note: if you aren't able to resolve bore.pub's DNS (e.g. if ping bore.pub fails), you can use their actual IP:
    bore local 8000 --to 159.223.171.199
  4. [RECOMMENDED] We provide two resources to help debug and sanity check your policy before doing the full real-world evaluation:
    • You can use our SIMPLER environments and make sure your policy outputs reasonable actions in simulated environments. Check out the instructions here.
    • You can compute the action MSE between your policy output and ground truth actions against Bridge V2 trajectories. We provide a script for this "MSE sanity check" here. You can run the sanity check by python test_policy_client_mse.py --ip SERVER_IP --port SERVER_PORT
  5. Go to the AutoEval Dashboard at the top of the page, and enter your publically accessible IP & port of the policy server, and choose your eval task and other options. Submit your job!
  6. Monitor your job progress in the dashboard, and view it executing in real time under the live feed.
  7. Once the eval is finished, you can access the evaluation report through the wandb link on the dashboard. The evaluation report (example here) includes success rates, videos, and other detailed metrics on the evaluation job.

Please refer to the FAQ section at the bottom of the page if you run into any issues.

Evaluation Report

The evaluation report includes various metrics and visualizations to help you understand your policy's performance (see example here). It is hosted publicly on weights and biases. Some important metrics include:

Evaluation Report Example

FAQ

  1. Q: Does AutoEval support evaluation policies with action chunking and observation history?
    A: Yes, but the policy server would need to keep track of the states (e.g. history, action chunks). We provide an example of such a policy server here.

  2. Q: Why did I get the "Exception: Failed to connect to action server" error?
    A: This error occurs because the AutoEval server cannot connect to the robot action server (through agentlace). Sometimes re-submitting the job to the same robot will fix the issue. If two consecutive jobs fail, the robot is likely down; please contact the authors with your job ID.

  3. Q: Will the Motors Overheat from continuous evals?
    A: We put the robot to rest for 20 minutes every 6 hours. So if your job doesn't get queued when there is no jobs before it, it might be in the resting period.

  4. Q: Can I evaluate goal-conditioned policies?
    A: Yes. Even though the AutoEval server will query your policy with the language command, your policy server can ignore the language command and use the proper goal image. We provide the goal images of the four tasks in the github repo.

  5. Q: Are the evaluation data logged and available?
    A: Yes. All evaluation data (e.g. observations, actions, success) are logged at full resolution and available on a huggingface dataset. You can find the directory to your job by going to wandb eval report > logs > printed out at the bottom

If you encounter problems not described here, please contact the authors or reach out on the Discord channel.

Video

BibTeX

@article{zhou2025autoeval,
    title={AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World},
    author={Zhou, Zhiyuan and Atreya, Pranav and Tan, You Liang and Pertsch, Karl and Levine, Sergey},
    journal = {arXiv preprint arXiv:2503.24278},
    year={2025},
}