-
Notifications
You must be signed in to change notification settings - Fork 48
tool calling eval #256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
tool calling eval #256
Conversation
…xpected to trigger its tool call.
…ests LlamaStack models against various tools and generates detailed performance plots, including per-tool analysis.
…uation. These tools enable structured metric tracking to CSV and dynamic visualization of performance through generated plots. Added newline to end of queries.json
3785414
to
564c494
Compare
tests/tool_calling/src/tools.py
Outdated
:description: Adds two numbers. | ||
:use_case: Use when the user wants to find the sum, total, or combined value of two numbers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if you need the :description:
, :use_case:
or :returns:
( at least according do the client tool decorator docstring). Have you found this to be superior to a plain docstring function description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first I added :description:
and :use_case:
tags because I thought it would need it, didn't look at the docs. But ya, looking at client_tool.py
it even dumps information about the return information on purpose, so that won't be needed. From limited experimentation it seems :description:
and :use_case:
slightly increase tool trivial accuracy.
experiment.pdf
ad5f2bd
to
3f3142a
Compare
3f3142a
to
e9dd497
Compare
Description
Used Gemini to create 40 simple functions that will be passed as tools. Created a
queries.json
which has 15 queries per function which will be used as an "eval set" to determine accuracy of tool selection. Similar code totests/eval_tests/tests.py
but after runningeval.py
it shows successful calls per tool and not overall.How Has This Been Tested?
Merge criteria: