Logo m&ms

Evaluating Tool-Use Agents for multi-step multi-modal Tasks

1University of Washington, 2Allen Institue for AI

m&ms contains a large quantity of diverse user queries involving three modalities (i.e. text, image, and audio) as well as human-verified plans that consist of 1 - 3 tools across three categories: multi-modal machine learning models, public APIs and image processing modules.

Abstract

Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold tremendous promise for automating the generation of such computational plans. However, the lack of standardized benchmarks for evaluating LLMs as planners for multi-step multi-modal tasks has prevented a systematic study of planner design decisions. Should LLMs generate a full plan in a single shot or step-by-step? Should they invoke tools directly with Python code or through structured data formats like JSON? Does feedback improve planning?

To answer these questions and more, we introduce m&m's: a benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools that include multi-modal ML models, (free) public APIs, and image processing modules. For each of these task queries, we provide automatically generated plans using this realistic toolset. We further provide a high-quality subset of 1,565 task plans that are human-verified and correctly executable.

m&ms Dataset

Overview

Overall, Logo m&ms contains a large quantity of diverse ecologically-valid task queries. Each task is associated with human-verified and executable plans. Concretely, there are a total of 4,427 raw examples in m&ms, where 1,565 have been verified to be correct by three human annotators. After additional filtering, we select a subset of 882 examples for evaluation. m&ms tasks are granular in difficulty with 70 queries that require a single tool, 159 need two tools, and 653 need three tools. In terms of tools, there are 33 unique tools in total across three different categories, of which 13 are multi-modal machine learning models on HuggingFace, 11 are image processing modules from VisProg, and 9 are free public APIs from RapidAPI. Our final dataset includes 317 representative tool sequences where each sequence consists of a mix of tools across categories and maps to multiple unique queries. See a summary of the dataset statistics below.

data-overview

Dataset statistics of m&ms.

You can download the different splits of the m&ms dataset via HuggingFace Dataset.

Examples

We present additional examples of query-plan pairs in the m&ms dataset below:

Tools

In this section, you can find more details about the tools and their distributions in m&ms.

data-overview

A complete list of tools across three categories in the m&ms dataset.

data-overview

The distribution of tools before (blue) and after (red) data filtering. See more details about the filtering in our paper.

Experimental Results

Overview

With m&m's, we evaluate 6 popular LLMs with 2 planning strategies (multi-step vs. step-by-step planning), 2 plan formats (JSON vs. code), and 3 types of feedback (parsing/verification /execution). We highlight three key findings from our extensive experiments:

  1. All planning agents perform better on tool selection with multi-step planning than with step-by-step planning re- gardless of the underlying LLMs' sizes;
  2. Verification and execution feedback can help models improve tool invocation by predicting correct argument names and generating executable plans but can lead to worse tool selection due to wrong fixes;
  3. While models perform comparably on tool selection with JSON versus code generation, they produce more overall executable plans with JSON-format generation.

Planning Strategies

We find that models consistently perform better on tool-F1 and pass rate when instructed to perform multi-step planning instead of step-by-step planning regardless of their sizes.

Feedback

External feedback can improve planning agents' performance on argname-F1 and pass rate at a small cost of tool-F1.

Plan format

Models perform comparably on tool-F1 with JSON- format and code generation but much worse on pass rate with code generation.

Erorr analysis

We provide examples of common errors observed in step-by-step planning, with verification/execution feedback, and in code generation.

BibTeX

@article{ma2024mms,
      title={m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks}, 
      author={Zixian Ma and Weikai Huang and Jieyu Zhang and Tanmay Gupta and Ranjay Krishna},
      year={2024},
      journal={arXiv preprint arXiv:2403.11085},
    }