In the world of Agentic AI, the ability to call tools is what translates natural language into executable software actions. Last month, we released FunctionGemma, a specialized version of our Gemma 3 270M model explicitly fine-tuned for function calling. It is designed for developers building fast and cost-effective agents that translate natural language into executable API actions.

Specific applications often require specialist models. In this post, we demonstrate how to fine-tune FunctionGemma to handle tool selection ambiguity: when a model must choose between one or more seemingly similar functions to call. We also introduce the "FunctionGemma Tuning Lab", a demo tool that makes this process accessible without writing a single line of training code.

Why Fine-Tune for Tool Calling?

If FunctionGemma already supports tool calling, why is fine-tuning necessary?

The answer lies in context and policy. A generic model doesn't know your business rules. Common use cases for fine-tuning include:

Resolving Selection Ambiguity : If a user asks, "What is the travel policy?", a base model might default to a Google search. An enterprise model, however, should search the internal knowledge base.

: If a user asks, "What is the travel policy?", a base model might default to a Google search. An enterprise model, however, should search the internal knowledge base. Ultra-Specialization : You can train a model to master niche tasks or proprietary formats not found in public data, such as handling domain-specific mobile actions (e.g., controlling device features) or parsing internal APIs to construct highly complex regulatory reports.

: You can train a model to master niche tasks or proprietary formats not found in public data, such as handling domain-specific mobile actions (e.g., controlling device features) or parsing internal APIs to construct highly complex regulatory reports. Model Distillation: You can use a large model to generate synthetic training data, then fine-tune a smaller, faster model to run that specific workflow efficiently.

The Case Study: Internal Docs vs. Google Search

Let's look at a practical example from the technical guide on fine-tuning FunctionGemma using the Hugging Face TRL library.

The Challenge

The goal was to train a model to distinguish between two specific tools:

search_knowledge_base (Internal documents) search_google (Public information)

When asked "What are the best practices for writing a simple recursive function in Python?", a generic model defaults to Google. However, for a query like "What is the reimbursement limit for travel meals?", the model needs to know that this is an internal policy question.

The Solution

To evaluate performance, we used the bebechien/SimpleToolCalling dataset, which contains sample conversations requiring a choice between two tools: search_knowledge_base and search_google .

This dataset is split into training and testing sets. We keep the test set separate so we can evaluate the model on "unseen" data, ensuring it learns the underlying routing logic rather than just memorizing specific examples.

When we evaluated the base FunctionGemma model using a 50/50 split between training and testing, the results were suboptimal. The base model chose the wrong tool or offered to "discuss" the policy rather than executing the function call.

⚠️ A Critical Note on Data Distribution

When preparing your dataset, how you split your data is just as important as the data itself.