Large language models display strong few-shot and instruction-following abilities but still perform poorly on basic operations like arithmetic and factual lookup, where smaller specialized systems excel. Toolformer addresses this gap by enabling a language model to learn how to call external tools through simple APIs, combining general-purpose reasoning with the precision of dedicated components. The approach focuses on making the model decide autonomously which tools to use and how to integrate their outputs while preserving its underlying language modeling capabilities.
Toolformer is trained to determine which APIs to call, when to call them, what arguments to provide, and how to feed the returned results back into subsequent token prediction. Training proceeds in a self-supervised manner with only a handful of demonstrations required for each API, avoiding the need for large bespoke labeled datasets. The model effectively learns a policy over tool usage as part of standard next-token prediction, so tool calls become a natural extension of its text generation process instead of a separate control mechanism.
The researchers incorporate a diverse set of tools into Toolformer, including a calculator, a question and answer system, a search engine, a translation system, and a calendar. Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks and is often competitive with much larger models, while maintaining its core language modeling performance. The work suggests that carefully integrating external tools via self-supervised learning can allow language models to overcome persistent weaknesses in areas such as computation and factual retrieval, without scaling model size or sacrificing fluency.
