tool calling eval #256

Ibrahim-Haroon · 2025-05-29T19:32:47Z

Description

Used Gemini to create 40 simple functions that will be passed as tools. Created a queries.json which has 15 queries per function which will be used as an "eval set" to determine accuracy of tool selection. Similar code to tests/eval_tests/tests.py but after running eval.py it shows successful calls per tool and not overall.

How Has This Been Tested?

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

…xpected to trigger its tool call.

…ests LlamaStack models against various tools and generates detailed performance plots, including per-tool analysis.

…uation. These tools enable structured metric tracking to CSV and dynamic visualization of performance through generated plots. Added newline to end of queries.json

tests/tool_calling/src/eval.py

tests/tool_calling/queries.json

tests/tool_calling/src/utils.py

tests/tool_calling/src/eval.py

MichaelClifford · 2025-05-30T15:06:56Z

tests/tool_calling/src/tools.py

+    :description: Adds two numbers.
+    :use_case: Use when the user wants to find the sum, total, or combined value of two numbers.


I'm not sure if you need the :description:, :use_case: or :returns: ( at least according do the client tool decorator docstring). Have you found this to be superior to a plain docstring function description?

https://github.com/meta-llama/llama-stack-client-python/blob/645d2195c5af1c6f903cb93c293319d8f94c36cc/src/llama_stack_client/lib/agents/client_tool.py#L150-L170

At first I added :description: and :use_case: tags because I thought it would need it, didn't look at the docs. But ya, looking at client_tool.py it even dumps information about the return information on purpose, so that won't be needed. From limited experimentation it seems :description: and :use_case: slightly increase tool trivial accuracy.
experiment.pdf

…calize tests.

… utils.py.

Ibrahim-Haroon added 7 commits May 29, 2025 13:48

40 simple functions that will be be passed to an agent as tools.

dc37feb

__init__.py

afafbf3

eval set for all 40 functions, each function as 15 queries that are e…

d8c454f

…xpected to trigger its tool call.

added matplotlib

355e4cc

uv lock update

c410d1e

Implement comprehensive tool calling evaluation script. This script t…

5600619

…ests LlamaStack models against various tools and generates detailed performance plots, including per-tool analysis.

Introduce comprehensive logging and plotting utilities for model eval…

564c494

…uation. These tools enable structured metric tracking to CSV and dynamic visualization of performance through generated plots. Added newline to end of queries.json

Ibrahim-Haroon force-pushed the ibrahim/tool-calling-eval branch from 3785414 to 564c494 Compare May 29, 2025 19:36

MichaelClifford requested changes May 30, 2025

View reviewed changes

refactored /tool_calling into /eval_tests to remove redunduncy and lo…

5cf7fa2

…calize tests.

Ibrahim-Haroon force-pushed the ibrahim/tool-calling-eval branch 3 times, most recently from ad5f2bd to 3f3142a Compare May 30, 2025 20:12

merged /tool_calling/eval.py into tests.py and /tool_calling/utils.py…

e9dd497

… utils.py.

Ibrahim-Haroon force-pushed the ibrahim/tool-calling-eval branch from 3f3142a to e9dd497 Compare June 2, 2025 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tool calling eval #256

tool calling eval #256

Uh oh!

Ibrahim-Haroon commented May 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MichaelClifford May 30, 2025

Uh oh!

Ibrahim-Haroon Jun 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

		:description: Adds two numbers.
		:use_case: Use when the user wants to find the sum, total, or combined value of two numbers.

tool calling eval #256

Are you sure you want to change the base?

tool calling eval #256

Uh oh!

Conversation

Ibrahim-Haroon commented May 29, 2025

Description

How Has This Been Tested?

Merge criteria:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MichaelClifford May 30, 2025

Choose a reason for hiding this comment

Uh oh!

Ibrahim-Haroon Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Ibrahim-Haroon Jun 2, 2025 •

edited

Loading