Automatic benchmarking of large multimodal models via iterative experiment programming
Evaluating a large multimodal model still requires writing custom experimental code for each research question — a high barrier for rapid iteration. APEx automates the full pipeline: given a question in natural language, it generates and executes experiments through a pre-built tool library, compiles a running report, and iterates until the evidence is sufficient to conclude. The resulting reports reproduce established findings while also enabling novel analyses without writing a single line of evaluation code.
1 University of Trento
2 Fondazione Bruno Kessler (FBK)
01 The problem / 問
Evaluating large multimodal models is inherently a research activity: each new hypothesis — does the model understand spatial reasoning? how does it perform on fine-grained domains? — requires designing a specific experimental protocol, selecting appropriate datasets, writing evaluation code, running the model, and interpreting the results. This process is manual, time-consuming, and rarely reusable, creating a high barrier for rapid iteration and systematic analysis.
APEx (Automatic Programming of Experiments) addresses this bottleneck by automating the full benchmarking pipeline from a natural language research question. The system treats evaluation as an iterative scientific investigation: it formulates experiments, executes them using a pre-built tool library, observes results, decides whether the evidence is sufficient, and continues probing or concludes accordingly. The goal is to allow researchers to pose questions about a model’s capabilities in plain language and receive a structured scientific report in return, without writing a single line of evaluation code.
02 The approach / 法
APEx is built around a loop between an LLM planner and a modular tool library. Given a research question, the planner generates an initial set of experiments — each experiment is a specific invocation of one or more tools from the library (e.g., evaluate model X on dataset Y using metric Z). The tool library encapsulates access to benchmark datasets, model inference wrappers, and standard evaluation metrics, providing a well-defined programmatic interface that the LLM can call without needing to produce arbitrary code.
After each round of experiments, APEx compiles the results into a running scientific report. The planner reads the current state of this report and decides whether the evidence gathered is sufficient to answer the original question, or whether additional experiments — on different datasets, with different prompts, or under different settings — are warranted. This iterative, report-driven loop continues until the planner judges the investigation complete. The final report is then written in natural language, summarizing findings and conclusions for the user. The framework is designed to be extensible: adding a new dataset or evaluation tool to the library immediately makes it available to future APEx runs.
03 Results / 験
Datasets
We evaluate APEx by tasking it with reproducing the findings of established LMM evaluation studies, providing a ground-truth reference against which the automatically generated reports can be assessed. APEx is given only the research questions from these studies — not the experimental protocols — and its independently derived conclusions are compared to the published results.
Quantitative results
APEx successfully reproduces the core findings of the target studies, correctly identifying the relative strengths and weaknesses of evaluated LMMs across multiple capability dimensions including visual question answering, captioning quality, and fine-grained recognition. Beyond reproduction, APEx also demonstrates the ability to conduct novel analyses — exploring hypothesis subspaces and ablating factors not examined in the original studies — without any manual protocol design. These results validate the framework’s core premise: that structured tool access and iterative LLM planning are sufficient to automate meaningful scientific benchmarking of large multimodal models.