Audio OverView

Powered by Notebook LM

The Evolution of LLM Agents

Large Language Models (LLMs) have evolved beyond mere text generators into intelligent agents capable of multi-step reasoning, decision-making, and interaction. This evolution drives the emergence of LLM-based autonomous systems that can break down complex problems, maintain context over multiple turns, and provide structured outputs. These advancements have led to a growing interest in Tool Learning with Frozen Language Models, where LLMs can extend their reasoning capabilities by dynamically interacting with external tools without altering their core architecture.

The appeal of these systems stems from the reasoning ability of LLMs. Models like GPT-4, Claude, excel in Chain-of-Thought (CoT) prompting by breaking down tasks into steps to reach conclusions. However, LLMs are limited; they cannot execute calculations, access real-time data, or interact with external systems.

Why Tool Learning is Necessary

Tool Learning has emerged as a critical field of research and engineering to overcome these inherent boundaries. Tool Learning is an LLM’s ability to recognize when it needs external help and to invoke the appropriate tool (Like a calculator, web search, weather API, or a specific database) at the right moment during reasoning.

This capability transforms static language models into dynamic agents that can engage with external environments, handle diverse modalities, and complete complex tasks beyond text generation. Rather than hardcoding all functionalities within the model, Tool Learning enables modular, extensible systems where LLMs act as orchestrators—delegating tasks to specialized tools while preserving their core reasoning abilities.

Tool Learning becomes especially powerful when integrated with Chain-of-Thought reasoning. In this setup, the LLM generates step-by-step reasoning and decides contextually when and how to use tools as part of the reasoning chain. However, building such systems introduces unique technical challenges, particularly when models must generalize to unseen tools and operate at scale.

Challenges in Existing Tool Learning Approaches

Tool Learning Comparison Chain of Tool – http://www.ajithp.com Tool Learning Approaches Comparison FINE-TUNING BASED e.g., ToolLLM, API-Bank IN-CONTEXT LEARNING e.g., HuggingGPT, AgentBench CHAIN-OF-TOOLS (CoTools) Model Modification Directly modifies LLM weights through supervised fine-tuning Prompt Engineering Uses few-shot prompts with tool demonstrations in context Hidden State Analysis Uses LLM hidden states with lightweight modules PROS Efficient tool invocation Precise for seen tools PROS No model modification Works with unseen tools PROS Preserves LLM capabilities Works with unseen tools Scales to thousands of tools CONS Can’t use unseen tools Risks degrading core model capabilities Requires extensive labeled data CONS Inefficient with many tools Long context windows with high token usage Degraded performance at scale CONS Model architecture dependency Complex tool returns Unseen Tool Score: 0% Unseen Tool Score: Low Unseen Tool Score: 10.4%

Three primary paradigms have emerged in Tool Learning, each with distinct strengths and limitations:

Fine-Tuning-Based Approaches:

Systems like ToolLLM and API-Bank rely on supervised fine-tuning to teach LLMs how to invoke tools. While these methods can be precise and efficient when using tools encountered during training, they require extensive labeled data and typically fail to generalize to tools outside the training set. Moreover, aggressive fine-tuning risks altering the LLM’s core capabilities, potentially degrading its reasoning fluency or language understanding.

In-Context Learning (ICL) Based Approaches:

Methods like HuggingGPT and AgentBench avoid modifying the model weights by using few-shot prompting. These systems are flexible and can generalize better to new tools but become increasingly inefficient as the number of tools grows. Each prompt must include demonstrations or documentation for relevant tools, leading to lengthy context windows and degraded performance in large-scale settings.

Token Embedding Methods:

ToolkenGPT introduces a clever mechanism that associates tools with special tokens and only fine-tunes those embeddings. This preserves the frozen LLM and achieves efficient tool invocation. However, it still suffers from a critical drawback: it cannot use unseen tools without retraining and embedding updates.

All these methods reveal a common trade-off: efficiency versus generalization. Currently, no existing method fully enables frozen LLMs to dynamically reason with vast pools of previously unseen tools.

Introducing Chain-of-Tools (CoTools)

Overview of the method CoTools. Just like the example in Figure 1, CoTools judges whether to call a tool whenever a new answer token is to be generated. The answer Fragment means the answer text that has been generated by the foundation model. Image Courtesy :Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models

Chain-of-Tools (CoTools) introduces a breakthrough in Tool Learning with frozen language models. A novel architecture allows LLMs to reason step-by-step while selectively invoking tools, even those never seen during training. CoTools is specifically designed to integrate with CoT reasoning and leverage the semantic richness of hidden states produced by frozen LLMs for tool-related decisions.

Instead of modifying the language model itself, CoTools attaches lightweight modules that interpret the internal representations of the LLM to:

  • Determine whether a tool is needed at each generation step.
  • Select the most relevant Tool from a large, dynamically loaded toolset.
  • Inject the tool output directly into the reasoning chain.

This architecture balances flexibility, efficiency, and generalization. It scales to thousands of tools and maintains the foundation model’s original reasoning capabilities, making it well-suited for real-world agents that operate in open-ended environments.

Key Contributions of Chain-of-Tools

The CoTools framework presents several key innovations that significantly advance the state of Tool Learning with frozen language models:

  • A New Fine-Tuning-Based Paradigm: CoTools trains only lightweight judgment and retrieval modules, keeping the base LLM frozen and preserving its general-purpose capabilities.
  • Massive Unseen Tool Generalization: By encoding tools through natural language descriptions and using contrastive learning, CoTools can effectively retrieve and invoke tools never seen during training.
  • SimpleToolQuestions Dataset: The authors introduce a new benchmark dataset containing 1,836 tools, with a split between seen and unseen tools, tailored for evaluating large-scale tool selection under realistic constraints.
  • Superior Performance Across Tasks: CoTools demonstrates state-of-the-art performance on both numerical reasoning (GSM8K-XL, FuncQA) and knowledge-based QA benchmarks (KAMEL, SimpleToolQuestions), outperforming baselines in accuracy and scalability.
  • Interpretable Hidden State Analysis: The framework reveals which dimensions of the LLM’s hidden states are most influential for tool selection, offering insights into how semantic knowledge is encoded within LLMs.

With these contributions, Chain-of-Tools opens a new direction in developing scalable, extensible, and interpretable LLM agents.


Methodology: How Tool Learning Works with Frozen Language Models

The ideal tool calling procedure.Take the input query “What’s the weather like at my destination tomorrow?” as an example.
Image Courtesy :Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models

The Core Concept Behind Chain-of-Tools

Chain-of-Tools (CoTools) introduces a structured approach to Tool Learning that preserves the integrity of the underlying language model. Instead of modifying the model’s weights, CoTools leverages the rich semantic representations already encoded in the hidden states of frozen LLMs to make intelligent decisions about tool usage in real-time.

The key innovation lies in how CoTools integrates tool selection seamlessly into the token-by-token generation process during Chain-of-Thought (CoT) reasoning. As the model generates each token, CoTools continuously monitors the evolving internal state to determine:

  • Whether a tool is needed at the current generation step
  • Which specific Tool from a potentially massive pool is most relevant
  • How to properly invoke it with contextually appropriate parameters

This modular design creates a system that can work with thousands of tools—even ones never seen during training—without requiring modification to the base language model. By maintaining the model’s frozen state, CoTools preserves the original reasoning capabilities while extending functionality through external interfaces.

Chain of Tools Process Flow - www.ajithp.com

Tool Judge: Determining When to Call Tools

The first component in the CoTools pipeline is the Tool Judge. This lightweight binary classifier determines if a tool should be invoked at a specific point in the generation process.

As the LLM generates output token by token, it produces a hidden state ht ∈ ℝd for each token t. The Tool Judge applies a specialized neural network to this hidden state to compute a score:

ScoreJ = J(ht) ∈ [0,1]

Where J represents the Tool Judge module with trainable parameters WJ_gate, WJ_up, and WJ_down that allow it to interpret the hidden state.

This score represents the probability that a tool call is needed at this precise moment in the reasoning chain. If the score exceeds a predefined threshold (typically 0.5), the system proceeds to the tool selection phase. Otherwise, the LLM continues generating the next token through its standard process.

The Tool Judge is trained as a sequence labeling task using binary cross-entropy loss:

LJudge = LBCE(ScoreJ, Label)

Because this module operates solely on the LLM’s hidden state and requires minimal computation, it enables efficient real-time evaluation without disrupting the flow of the model’s reasoning process.

Tool Retriever: Selecting the Right Tool

When the Tool Judge signals that a tool is needed, the Tool Retriever identifies the most contextually relevant Tool—even from pools containing hundreds or thousands of options, many of which were never seen during training.

The Tool Retriever consists of two key components:

  1. Query Encoder (EQ): Processes the hidden state derived from the current context (query + answer fragment generated so far) to produce a query vector VQ
  2. Tool Encoder (ET): Processes the natural language description of each Tool to generate corresponding tool vectors VT

Both encoders use residual connections and normalization to preserve the rich semantics of the original hidden states:

EQ(h) = norm(Wdim ⊗ (h + EQ'(h)))

Where EQ’ applies additional transformations to the hidden state while maintaining its original information through residual connections. This approach is critical as it preserves as much information as possible from the original hidden states.

The system computes the similarity between the query vector and each tool vector using a simple dot product:

ScoreQ,T = VQ · VT

The Tool with the highest similarity score is selected for invocation. This retrieval mechanism is trained using contrastive learning, which optimizes the placement of related query-tool pairs in the embedding space while pushing unrelated pairs apart.

Critically, this architecture enables zero-shot generalization to unseen tools. As long as a new tool can be described in natural language, the retriever can evaluate its relevance without requiring pre-trained embeddings or tool-specific training data. This stands in stark contrast to previous approaches like ToolkenGPT, which require retraining for new tools.

Tool Calling: Executing and Integrating Tool Results

Once the appropriate Tool has been selected, CoTools dynamically generates the input parameters using in-context learning (ICL) style prompting. The system tokenizes the query and answers fragment with a specialized calling prompt that includes both tool documentation and examples of proper usage.

The parameters are inserted into a predefined tool-call template and passed to the tool execution environment. After execution, the Tool’s return value is injected directly into the ongoing CoT reasoning, effectively becoming part of the LLM’s response.

For example, when answering “What’s the weather like at my destination tomorrow?”, CoTools might first use a scheduling tool to identify the destination (e.g., “Shanghai”) and then a weather tool to retrieve the forecast (e.g., “sunny”). This result is then seamlessly incorporated into the continuing generation process, resulting in a response like “Tomorrow you will take part in the NLP conference, Shanghai. The weather in Shanghai will be sunny. Have a nice day!”

The tool calling format is emphasized in the prompt to ensure consistent parsing through regex expressions, allowing CoTools to remain agnostic to the specific type or source of the external Tool.

Why Frozen Language Models Remain Untouched

A fundamental design principle of CoTools is maintaining the integrity of the foundation model. Unlike traditional fine-tuning approaches that risk overfitting to specific tools or degrading general capabilities, CoTools augments the LLM externally without modifying its weights.

This design decision offers several strategic advantages over previous approaches:

  1. Preservation of CoT Reasoning: The model’s original step-by-step problem-solving capabilities remain intact and unaffected by tool integration, unlike fine-tuning methods that can degrade these emergent abilities.
  2. Scalability and Modularity: New tools can be added or removed dynamically without requiring any retraining of the base model, addressing a major limitation of token embedding methods like ToolkenGPT.
  3. Generalization to Unseen Tools: The semantic matching approach enables CoTools to interface with entirely new APIs and functions that weren’t available during training, overcoming the closed-set constraint of fine-tuning approaches.
  4. Maintained Knowledge and Fluency: By keeping the LLM frozen, all of its pre-trained knowledge and linguistic fluency is preserved while still extending its capabilities.
  5. Efficient Resource Utilization: Training only lightweight modules—Tool Judge, Query Encoder, and Tool Encoder—requires significantly fewer computational resources than full model fine-tuning.

By training only these specialized modules, CoTools creates an efficient architecture that enhances the model’s functionality without compromising its core strengths. This approach effectively resolves the efficiency-generalization trade-off that has challenged previous tool-learning methods, providing a more sustainable path toward building truly capable AI agents.


Evaluating Tool Learning with the SimpleToolQuestions Dataset

Motivation Behind Creating STQuestions

Existing benchmarks for evaluating Tool Learning capabilities suffer from significant limitations. Most datasets contain only a small number of tools (typically 5-20), feature synthetic or template-based queries, and rarely test generalization to unseen tools. These constraints limit their utility in evaluating true generalization or scalability.

To address this gap, the authors of Chain-of-Tools introduced the SimpleToolQuestions (STQuestions) dataset—a purpose-built benchmark designed to stress-test tool selection under challenging conditions. Specifically, STQuestions simulates a scenario where an agent must choose the correct Tool from a massive, heterogeneous tool pool—many of which were never seen during training—and still produce an accurate answer within a CoT reasoning context.

This makes STQuestions a valuable resource for measuring how well a tool-learning framework can scale, generalize, and operate reliably in real-world agent settings where new tools can be added or deprecated at any time. Unlike previous benchmarks that focus on smaller tool sets or controlled environments, STQuestions deliberately creates a more realistic and demanding evaluation environment.

Dataset Design and Construction

The STQuestions dataset builds upon the widely-used SimpleQuestions v2 benchmark, originally developed for knowledge base question answering (KBQA). In its raw form, each SimpleQuestions example includes a natural language query and a knowledge triple (head entity, relationship, tail entity). However, this format is not ideal for tool-learning tasks for two reasons:

  1. It assumes a narrow search space for tool selection, often focusing on a small subset of relations.
  2. The natural questions are too short and vague to support fine-grained tool retrieval across a large tool pool.

To overcome these limitations, the authors reconstructed the dataset using an LLM (ChatGPT). Each original question was rewritten to include more detailed and descriptive language that clearly reflects the semantic intent of the required Tool (i.e., relationship). This allowed the authors to expand the retrieval scope and simulate realistic agent queries that require interpreting subtle contextual cues.

For example, a simple original question like “Who directed Titanic?” might be expanded to “I’m researching filmmaking techniques in disaster movies and need to know which visionary director was responsible for creating the iconic 1997 film Titanic.” This richer context provides more semantic information for tool selection while maintaining the core information needed.

Additionally, for every Tool (relation), the dataset includes a natural language description that mimics tool documentation—enabling zero-shot matching via semantic similarity rather than hardcoded tokens or fixed embeddings. This is crucial for evaluating Tool Learning with frozen language models, which must rely on understanding descriptions rather than memorizing tool IDs.

The dataset construction process involved several steps:

  1. Extracting relation-entity pairs from the original SimpleQuestions dataset
  2. Generating detailed descriptions for each relation to serve as tool documentation
  3. Rewriting questions to be more descriptive and contextually rich
  4. Creating a deliberate split between seen and unseen tools for testing generalization

Key Features of STQuestions

The final STQuestions dataset contains:

  • 1,836 unique tools, each with a human-readable description that specifies its purpose and functionality
  • clear split between seen and unseen tools
    • 999 tools appear during training (seen)
    • 837 tools appear only in the test set (unseen)
  • 13,256 total examples distributed across: 
    • 10,483 training examples utilizing only the 999 seen tools
    • 1,707 development examples utilizing only the 999 seen tools
    • 2,773 test examples split between: 
      • 1,707 examples utilizing the 999 seen tools
      • 1,066 examples utilizing the 837 unseen tools

This deliberate partition enables researchers to separately evaluate:

  • Tool selection accuracy for seen tools (testing memory and representation)
  • Generalization ability to unseen tools (testing semantic understanding and retrieval)

STQuestions spans a wide range of domains, including geography, sports, film, history, science, business, and more, further increasing its realism and diversity. Each Tool comes from the Freebase knowledge base, ensuring that the relations represent real-world semantic connections rather than artificial constructs.

Each question requires the model to:

  1. Understand the user’s intent from a detailed prompt
  2. Retrieve the correct Tool from a large pool using only its description
  3. Generate or complete the answer using the retrieved Tool’s output

Unlike synthetic datasets with contrived tools or fixed templates, STQuestions emphasizes natural language variability, domain diversity, and the practical challenge of large-scale tool retrieval—making it a critical benchmark for evaluating modern Tool Learning systems like Chain-of-Tools.

The dataset’s design specifically challenges models on their ability to handle a massive tool pool that includes both seen and unseen tools, closely simulating real-world scenarios where new tools are constantly being added to an agent’s available toolset. This feature is particularly important for evaluating the zero-shot generalization capabilities that are essential for practical tool-using AI systems.


Experimental Results: Measuring Tool Learning Performance in Frozen LLMs

Benchmarks for Numerical Reasoning and Knowledge Tasks

To rigorously evaluate the Chain-of-Tools (CoTools) framework, the researchers conducted comprehensive testing across four diverse benchmarks spanning both numerical computation and knowledge-based question answering. These benchmarks collectively assess the framework’s ability to invoke tools accurately during reasoning, scale to large tool pools, generalize to unseen tools, and maintain reasoning fluency.

GSM8K-XL

This extension of the popular Grade School Math 8K dataset focuses specifically on arithmetic reasoning capabilities. GSM8K-XL includes just four basic arithmetic tools (addition, subtraction, multiplication, and division) but requires precise multi-step numerical computation to solve complex word problems. With 5,054 training examples, 1,000 development examples, and 568 test examples, this benchmark evaluates how well models can use simple tools in complex reasoning chains.

FuncQA

A synthetic dataset comprising 13 mathematical function tools, FuncQA tests both one-hop (single Tool) and multi-hop (multiple tools) question scenarios. The dataset includes 611 training examples, 39 development examples, and 128 test examples (60 one-hop and 68 multi-hop). FuncQA challenges models to not just select the right Tool but also to chain multiple tools together effectively—an essential capability for solving complex real-world problems.

KAMEL

Built on knowledge from Wikidata, KAMEL contains 234 relational tools spanning diverse domains. What makes KAMEL particularly valuable is its dual training sets: a gold standard human-curated set (19,000 examples) and a synthetic ChatGPT-generated set (8,095 examples). This allows researchers to evaluate how models perform with both high-quality and potentially noisy training data. The test set contains 500 examples, providing a robust evaluation of tool selection accuracy in structured knowledge environments.

SimpleToolQuestions (STQuestions)

Introduced specifically for this research, STQuestions represents the most challenging benchmark with 1,836 unique tools across diverse domains. Its deliberate split between 999 seen tools and 837 unseen tools (10,483 training examples, 1,707 development examples, and 2,773 test examples) makes it ideal for assessing both large-scale tool retrieval and zero-shot generalization capabilities. This benchmark most closely simulates real-world conditions where new tools are constantly being added to an agent’s toolset.

Together, these benchmarks provide a comprehensive evaluation framework that tests both reasoning types (numerical and factual), tool pool sizes (from just four tools to over 1,800), and tool familiarity (both seen during training and completely novel).

Tool Learning Results on GSM8K-XL and FuncQA

In the arithmetic reasoning domain, CoTools was evaluated against several competitive baseline methods, including zero-shot ChatGPT, various prompting techniques with LLaMA and Mistral models, and the state-of-the-art ToolkenGPT system. The results revealed several important patterns.

GSM8K-XL Performance

On the GSM8K-XL benchmark, CoTools demonstrated significant improvements over existing approaches:

MethodGSM8K-XL Accuracy
ChatGPT (0-shot)0.17
Prompting LLaMA0.04
CoT Prompting LLaMA0.00
ToolkenGPT (LLaMA2)0.18
CoTools (LLaMA2)0.19
Prompting Mistral0.14
CoT Prompting Mistral0.10
CoTools (Mistral)0.42

The results show that CoTools achieved a round accuracy of 0.19 with LLaMA2-7B-Chat and an impressive 0.42 with Mistral-7B-Instruct-v0.2, more than doubling the performance of standard approaches. This dramatic improvement with Mistral demonstrates that CoTools can effectively leverage stronger foundation models, amplifying their inherent capabilities rather than restricting them.

FuncQA Performance

The FuncQA results further validate CoTools’ effectiveness:

MethodOne-HopMulti-Hop
ChatGPT (0-shot)0.550.09
Prompting LLaMA0.050.00
CoT Prompting LLaMA0.000.00
ToolkenGPT (LLaMA2)0.480.06
CoTools (LLaMA2)0.530.07
Prompting Mistral0.170.04
CoT Prompting Mistral0.200.06
CoTools (Mistral)0.630.07

For one-hop questions, CoTools outperformed ToolkenGPT by five percentage points with LLaMA2 and by 15 percentage points with Mistral compared to standard prompting approaches. On the more challenging multi-hop questions, all methods struggled, but CoTools maintained a slight edge.

These results demonstrate that Tool Learning with frozen language models can significantly benefit from the CoTools architecture, particularly when combined with high-quality LLMs. The methodology not only preserves but amplifies the reasoning strengths of the base model, outperforming both token-based approaches like ToolkenGPT and various prompting strategies.

Knowledge-Based Tool Learning with Frozen LLMs

Perhaps the most compelling results emerged from the knowledge-intensive benchmarks, where CoTools demonstrated remarkable advantages in tool selection accuracy and generalization to unseen tools.

KAMEL Results

When evaluated on the KAMEL dataset, the results showed an interesting pattern:

MethodKAMEL (Gold)KAMEL (Synthetic)
ToolkenGPT (LLaMA2)93.4%20.6%
CoTools (LLaMA2)93.8%43.6%

With high-quality human-labeled training data (KAMEL Gold), both methods performed exceptionally well, achieving over 93% accuracy. However, when trained on synthetic data generated by ChatGPT, CoTools dramatically outperformed ToolkenGPT, more than doubling its accuracy from 20.6% to 43.6%.

This striking difference highlights CoTools’ robustness in low-quality or noisy training environments, likely due to its contrastive learning-based retriever, which can better distinguish subtle semantic differences between similar tools even with imperfect training signals.

STQuestions Results

The most revealing test of generalization came from the STQuestions benchmark, with its large-scale tool pool and deliberate split between seen and unseen tools:

MethodSeen ToolsUnseen Tools
ToolkenGPT (LLaMA2)23.8%0.0%
CoTools (LLaMA2)35.1%10.4%

CoTools demonstrated a clear advantage in both categories:

  • Seen Tools: CoTools outperformed ToolkenGPT by 11.3 percentage points, showing better discrimination among similar tools.
  • Unseen Tools: While ToolkenGPT completely failed to select any correct unseen tools (0.0% accuracy), CoTools achieved a non-zero accuracy of 10.4% and even higher results when evaluated on top-5 retrievals.

Further analysis revealed that ToolkenGPT exhibited a strong bias toward a small subset of tools it had memorized during training, whereas CoTools showed a more balanced distribution of predictions across both seen and unseen tools. This demonstrates CoTools’ ability to reason over tool descriptions rather than relying solely on memorized patterns.

The combined results across all four benchmarks confirm that CoTools represents a significant advancement in Tool Learning for frozen language models. Its unique architecture balances efficiency and generalization, enabling it to operate effectively even with massive tool pools and previously unseen tools—precisely the conditions found in real-world agent deployments.

Most importantly, these results validate the core hypothesis behind CoTools: that the semantic richness of hidden states in frozen LLMs can be effectively leveraged for tool selection and invocation without modifying the base model’s weights, preserving its reasoning capabilities while extending its functional reach through external tools.


Insights from Tool Learning with Frozen Language Models

Impact of Training Data Quality

One of the most striking advantages of Chain-of-Tools (CoTools) is its remarkable resilience when trained on lower-quality data. This characteristic was clearly demonstrated through experiments with the KAMEL benchmark, which features two distinct training sets:

  • KAMEL(sup): A high-quality, human-curated dataset containing 19,000 examples
  • KAMEL(syn): A synthetic dataset generated by ChatGPT with 8,095 examples, designed to simulate real-world scenarios where manually labeled data is scarce

When trained on the gold-standard human-curated data, both CoTools and ToolkenGPT performed exceptionally well, achieving tool selection accuracy above 93%. However, the systems diverged dramatically when trained on synthetic data:

MethodKAMEL(sup)KAMEL(syn)
ToolkenGPT93.4%20.6%
CoTools93.8%43.6%

CoTools more than doubled the accuracy of ToolkenGPT on synthetic data, maintaining 43.6% accuracy compared to just 20.6%. This substantial difference highlights CoTools’ ability to extract meaningful patterns even from noisy, imperfect training signals.

The robustness stems primarily from CoTools’ contrastive learning approach, which focuses on optimizing the relative positions of query-tool pairs in semantic space rather than memorizing specific patterns. This learning strategy enables CoTools to identify discriminative features even when training data lacks the precision, structure, or completeness typically required by fine-tuning methods.

For real-world applications, this resilience represents a crucial advantage. High-quality, manually annotated tool invocation data is expensive and time-consuming to create, particularly as toolsets expand. CoTools’ ability to maintain reasonable performance even with synthetic data makes it significantly more practical for production environments where perfect training data is rarely available.

Handling a Large Number of Tools

Scalability remains one of the central challenges in Tool Learning, especially as autonomous agents increasingly need to interface with hundreds or thousands of APIs, functions, and services. CoTools addresses this challenge through its semantic vector matching approach, which scales efficiently with toolset size.

The researchers conducted extensive testing on the STQuestions benchmark, systematically varying the number of tools from 200 to the full set of 999 seen tools. The results revealed a clear pattern:

  • ToolkenGPT’s performance degraded significantly as the tool pool expanded, dropping from approximately 38% accuracy with 200 tools to 23.8% with the full set of 999 tools
  • CoTools maintained a much more stable performance, declining only modestly from around 40% to 35.1% as the tool count increased nearly fivefold

Even more impressive was CoTools’ performance when evaluated with top-5 retrieval metrics. In this scenario, the system achieved nearly 80% accuracy even with the complete set of 999 tools, demonstrating that the correct Tool was almost always among the top five candidates.

This has profound implications for practical applications. In real-world scenarios, agent systems can employ various reranking, validation, or fallback mechanisms to select from a shortlist of candidate tools. CoTools consistently produces high-quality shortlists, making it particularly suitable for production environments where absolute precision isn’t required on the first attempt.

The architecture’s scalability advantage comes from its approach to tool representation. Rather than encoding each Tool as a discrete token (which becomes unwieldy as toolsets grow), CoTools represents tools as points in a continuous semantic space, allowing for efficient nearest-neighbor search even across thousands of candidates.

Generalization to Unseen Tools

Perhaps the most significant breakthrough demonstrated by CoTools is its ability to generalize to completely unseen tools—a capability that has eluded previous Tool Learning approaches. Traditional methods, including ToolkenGPT, rely on fixed representations of tools encountered during training, making them fundamentally incapable of utilizing new tools without retraining.

CoTools overcomes this limitation by treating tools as semantic entities described in natural language. By learning to encode and match tools based on their descriptions rather than memorized identifiers, CoTools can perform zero-shot retrieval—selecting appropriate tools even when encountering them for the first time.

This capability was rigorously tested using the 837 unseen tools in the STQuestions benchmark. The results were definitive:

MethodSeen ToolsUnseen Tools
ToolkenGPT23.8%0.0%
CoTools35.1%10.4%

While ToolkenGPT completely failed with unseen tools (0% accuracy), CoTools achieved a top-1 accuracy of 10.4% and a top-5 accuracy of 33.68%. These numbers may seem modest in absolute terms, but they represent an unprecedented capability—selecting the correct Tool from 837 previously unseen options based solely on natural language descriptions.

Further analysis of error patterns revealed additional insights. ToolkenGPT exhibited a strong bias toward a small subset of familiar tools, essentially defaulting to what it had memorized during training. In contrast, CoTools distributed its predictions more evenly across both seen and unseen tools, indicating genuine semantic reasoning rather than pattern matching.

This generalization ability has profound implications for real-world deployment. In practice, toolsets are rarely static—new APIs emerge, existing ones change, and custom tools are developed for specific use cases. CoTools’ ability to adapt to these changes without retraining makes it significantly more flexible and maintainable in dynamic environments.

Identifying Key Hidden State Dimensions for Tool Selection

Beyond its practical performance advantages, CoTools offers valuable insights into how frozen language models encode semantic knowledge. A critical component of its architecture is the shared weight matrix Wdim, which helps identify which dimensions of the LLM’s hidden state vectors are most relevant for tool retrieval.

Analyzing the Wdim Matrix

The researchers conducted an innovative analysis by training CoTools with different learning rates for Wdim (0.001, 0.01, and 0.1) while keeping other parameters constant. This revealed that specific dimensions consistently carried more weight in computing semantic similarity between queries and tools. At higher learning rates, these key dimensions became more pronounced, suggesting that only a subset of the LLM’s internal representations is activated for tool-related reasoning.

When visualizing the normalized weight distributions, clear patterns emerged—certain dimensions were consistently amplified across all learning rates, while others were consistently suppressed. This provided evidence for a latent structure within the LLM’s hidden states, where different subspaces specialize in different aspects of semantic understanding.

To validate this hypothesis, the researchers conducted an ablation study using only the top-ranked dimensions (as determined by weight magnitude). The results were remarkable: when using just 1,561 of the 4,096 dimensions (approximately 38%), tool selection accuracy decreased by only 1.4% from the original 93.8% to 92.4%. Top-5 accuracy remained unchanged, indicating that the pruned dimensions contained minimal relevant information for tool selection.

This finding has significant implications for both computational efficiency and interpretability:

  1. Efficient Inference: By focusing only on key dimensions, the system can significantly reduce computational overhead without sacrificing performance
  2. Enhanced Interpretability: Identifying which dimensions encode tool-relevant information provides a window into how LLMs organize semantic knowledge
  3. Targeted Probing: Researchers can now develop more focused probing techniques to understand specific capabilities within the broader LLM architecture

These insights suggest that while LLMs learn holistic representations during pre-training, specific tasks like tool selection activate specialized subspaces within those representations. Understanding these subspaces not only improves current Tool Learning methods but could inform more efficient and interpretable architectures for future language models.


My Perspective: The Strategic Significance of Chain-of-Tools

A Paradigm Shift in Tool Learning

The Chain-of-Tools (CoTools) approach represents a fundamental paradigm shift in how AI systems interact with external tools. This shift is happening during a critical evolutionary period for large language models, which have progressed from simple text generators to sophisticated reasoning agents capable of complex multi-step problem-solving.

Three strategic transformations stand out when examining CoTools in the broader context of AI development:

1. From Static to Dynamic Tool Integration

Traditional approaches to tool learning required either extensive fine-tuning (compromising the model’s core capabilities) or inefficient in-context learning (limiting scalability). CoTools breaks this trade-off by enabling frozen language models to dynamically interface with thousands of tools without modifying the underlying model.

This represents a shift from “hard-coded” tool capabilities to a flexible, adaptable system that can incorporate new tools on the fly. As the AI industry evolves toward more modular architectures where agents can “use different tools to adjust their responses based on the situation,” SuperAnnotate, CoTools provides a technical foundation for truly extensible systems.

2. From Memorization to Comprehension

Perhaps the most significant strategic shift is moving from tool memorization to semantic understanding. Previous methods like ToolkenGPT could only use tools they were explicitly trained on, essentially memorizing specific tool tokens. CoTools instead reasons about tool functionality through natural language descriptions.

This parallels the broader evolution in AI toward deeper comprehension rather than pattern matching. In a landscape where specialized “reasoning models” are becoming prominent in Wikipedia (like OpenAI’s o1 and DeepSeek’s R1), CoTools aligns with this trend by leveraging semantic understanding to make context-aware tool selection decisions.

3. From Isolated to Integrated Intelligence

The research demonstrates a shift from viewing LLMs as standalone systems to seeing them as orchestrators within a broader ecosystem of specialized tools. This matches the industry’s movement toward agent-based architectures where LLMs serve as central coordinators.

Recent developments show this direction gaining momentum, with numerous frameworks emerging for “LLM-Based Agents: Tool Use, Planning, and Feedback Learning” GitHub. CoTools provides a crucial capability for these agent systems: the ability to recognize when specialized external capabilities are needed and to integrate them appropriately.

The Competitive Landscape and Implications

CoTools enters a rapidly evolving competitive landscape where the ability to effectively utilize external tools is becoming a critical differentiator for AI systems. Several strategic implications emerge:

1. Accelerating the Agent Revolution

By solving a fundamental bottleneck in tool learning, CoTools could accelerate the development of autonomous AI agents. Current trends show an increasing focus on agent-based architectures, with “LLMOps Practices” and frameworks like LangChain gaining prominence in machine learning mastery for building tool-using systems. CoTools provides a more scalable foundation for these approaches.

2. Democratizing Advanced Capabilities

The ability to use frozen models has significant implications for accessibility. As specialized AI systems become more common (with models like DeepSeek-R1 focusing on complex reasoning and mathematical problem-solving WhatIs), CoTools enables these specialized capabilities to be delivered through tool interfaces rather than requiring everyone to develop custom models.

3. Enabling Multimodal Integration

The research connects to emerging trends in multimodal AI. Recent work on “MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning” ArXiv shows how similar techniques can help models become “conscious of multimodal input instruction and then select the function-matched tool correctly.” CoTools’ fundamental approach of using semantic representations for tool selection could extend beyond text to incorporate visual and audio tools.

4. Addressing Environmental and Economic Concerns

As the AI industry grapples with the environmental impact of increasingly large models, CoTools offers a more sustainable path forward. Instead of creating increasingly large generalist models, CoTools enables specialized capabilities to be added through external tools without retraining. This aligns with growing concerns about “how sustainable that is and what the long-term environmental impact will be on our energy sources” HatchWorks AI as models continue to grow.

Future Implications and Research Directions

Looking ahead, CoTools opens several important research avenues that will likely shape the field:

  1. Extension to Reasoning-First Models: As the industry pivots toward reasoning-specialized models like OpenAI’s o1 and DeepSeek-R1 Wikipedia, integrating CoTools with these architectures could create even more powerful reasoning agents.
  2. Cross-Modal Tool Learning: Building on research in multimodal tool selection, extending CoTools to understand and select tools that process images, audio, and video represents a natural evolution.
  3. Recursive Tool Composition: Moving beyond single tool selection to creating complex toolchains that solve complex problems through multi-step processes.
  4. Interpretability Research: The discoveries around key hidden state dimensions that encode tool-relevant information could lead to broader insights about how language models represent conceptual knowledge.

Chain-of-Tools represents a strategic shift toward more modular, extensible, and efficient AI systems. By enabling frozen language models to dynamically interface with an evolving toolkit of external capabilities, CoTools opens the door to AI systems that can continuously expand their abilities without expensive retraining or compromising their core reasoning processes.


Related Articles 

  1. AI Research Agents with MLGym and MLGym-Bench
    This article examines frameworks for training and evaluating AI research agents, highlighting how LLMs can be enhanced through specialized environments. The MLGym architecture shares conceptual similarities with CoTools in wrapping base LLMs while preserving their core capabilities.
  2. Exploring Agentive AI: Understanding its Applications, Benefits, Challenges, and Future Potential
    This piece provides context on how LLM-based agents operate in broader applications, discussing how generative AI models can be enhanced with tool-using capabilities like those demonstrated in Chain-of-Tools.
  3. Large Concept Model (LCM): Advancing Multilingual and Modality-Agnostic AI Understanding
    While focusing on a different architecture, this article explores an alternative approach to extending LLM capabilities by operating in a higher-dimensional embedding space – conceptually related to how CoTools leverages hidden state representations.
  4. TinyTroupe: AI Persona Simulation
    This project demonstrates another way LLMs can be extended with specialized capabilities, modeling realistic personas for diverse applications – complementary to how CoTools adds tool-using abilities to foundation models.
  5. Self-Rewarding Language Models
    This article explores innovative training methods where models generate and evaluate their own training data, representing another approach to enhancing LLM capabilities while maintaining efficiency.

Conclusion: Scaling Tool Learning with Frozen Language Models

Recap of Chain-of-Tools Effectiveness

Chain-of-Tools (CoTools) represents a pivotal advancement in Tool Learning with frozen language models. Unlike previous approaches that required either extensive fine-tuning or inefficient prompt engineering, CoTools offers a solution that preserves the integrity of the foundation model while dramatically enhancing its ability to interface with external tools.

The key innovation lies in CoTools’ architecture, which leverages the rich semantic representations already encoded within frozen LLMs. By attaching lightweight, trainable components to the model’s hidden states, CoTools enables three critical capabilities:

  1. Strategic tool invocation through the Tool Judge module, which determines when a tool should be called during the reasoning process
  2. Contextual tool selection via the Query and Tool Encoders, which match the current reasoning context to the most appropriate Tool from a potentially massive pool
  3. Seamless integration of tool outputs back into the Chain-of-Thought reasoning flow

The empirical results across diverse benchmarks demonstrate CoTools’ effectiveness. On arithmetic tasks like GSM8K-XL and FuncQA, it outperforms both prompting-based methods and existing approaches like ToolkenGPT. Knowledge-intensive benchmarks like KAMEL and STQuestions show remarkable generalization abilities, even handling unseen tools with meaningful accuracy.

Most importantly, these capabilities are achieved without any modification to the language model itself. By keeping the foundation model frozen, CoTools preserves its pre-trained knowledge, linguistic fluency, and general reasoning abilities—while adding the crucial ability to recognize when and how to leverage external tools. This represents a significant step toward building truly capable AI agents that combine the strengths of language models with the precision and utility of specialized tools.

Real-World Applications and Future Potential

The implications of CoTools extend far beyond academic benchmarks. As AI systems increasingly move toward autonomous agent architectures, the ability to dynamically interact with external APIs, services, and knowledge sources becomes essential. CoTools provides a practical foundation for applications such as:

  • Enterprise Agents: Systems that can automatically call internal APIs for operations, HR, finance, or analytics workflows without requiring custom integration for each new Tool
  • Customer Support Systems: AI assistants that dynamically query product databases, status tools, or CRM systems to provide accurate, up-to-date information
  • Scientific Assistants: Research tools that can leverage specialized computational resources, simulation engines, or data processing frameworks through natural language interfaces
  • Developer Copilots: Programming assistants that interact with build systems, code repositories, or DevOps platforms via contextually appropriate tool invocation
  • Search and Retrieval Agents: Knowledge workers that seamlessly integrate information from knowledge graphs, retrieval APIs, and web interfaces

What makes CoTools particularly compelling for real-world deployment is its modular extensibility. New tools can be introduced by simply updating the tool description catalog—no retraining, fine-tuning, or prompt engineering is required. This significantly reduces the engineering overhead typically associated with expanding an AI system’s capabilities and allows applications to evolve alongside changing business needs and technical environments.

The system’s demonstrated ability to generalize to unseen tools positions it as a practical foundation for agents who can learn to use tools on the fly—a critical requirement for dynamic task automation and effective human-AI collaboration in environments where available tools change frequently.

Addressing Limitations and Future Research in Tool Learning

Despite its significant advances, CoTools opens the door to several important areas of future exploration in Tool Learning with frozen language models:

Limitations

  1. Handling Multi-Return Tools: The current implementation focuses primarily on tools that return a single value. In practice, real-world APIs often return structured, multi-field responses requiring more sophisticated integration into the reasoning process. The researchers propose a Return Value Encoder as a solution, but this remains an untested component that would benefit from systematic evaluation.
  2. Dataset Availability and Coverage: While STQuestions represents a significant contribution to benchmarking tool learning, the field still lacks large-scale, diverse, and truly realistic tool learning datasets. Future research should focus on curating datasets that simulate real business workflows, multimodal tool usage, and decision-making across tools with similar or overlapping descriptions.
  3. Tool Chaining and Multi-Step Invocation: Current experiments primarily evaluate single tool calls per reasoning step. A promising direction involves orchestrating multiple tools within a single Chain-of-Thought trace, requiring planning, intermediate validation, and dynamic state updates as the reasoning progresses.
  4. Cross-Model Generalization: An open question is how well the retriever and judge modules trained with one base model (e.g., LLaMA2) transfer to others (e.g., Mistral, Mixtral, Claude, or GPT-4). Building model-agnostic tool learning layers could accelerate development across model ecosystems and improve efficiency.
  5. Security and Governance of Tool Access: As LLM agents become capable of invoking arbitrary tools, future systems must incorporate robust controls around tool authorization, auditing, rate-limiting, and behavior bounding—especially for sensitive or regulated domains where tool misuse could have significant consequences.

Addressing these open questions will be essential to evolving CoTools from a powerful research prototype into a production-ready tool layer for the next generation of intelligent agents. Nevertheless, the framework represents a significant step forward in creating language models that can effectively reason about when and how to use tools while maintaining their core capabilities—bringing us closer to truly helpful AI assistants that can interact meaningfully with the digital and physical world.


Key Links

Research Paper : Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models

Authors: Mengsong Wu, Tong Zhu, Han Han Xiang Zhang, Wenbiao Shao, Wenliang Chen


Discover more from Ajith Vallath Prabhakar

Subscribe to get the latest posts sent to your email.