-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Evaluation workflow for GitHub Actions #2350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add evaluation workflow
New branch for eval
New PR for eval
AZURE_SEARCH_QUERY_LANGUAGE = os.getenv("AZURE_SEARCH_QUERY_LANGUAGE", "en-us") | ||
AZURE_SEARCH_QUERY_SPELLER = os.getenv("AZURE_SEARCH_QUERY_SPELLER", "lexicon") | ||
AZURE_SEARCH_QUERY_LANGUAGE = os.getenv("AZURE_SEARCH_QUERY_LANGUAGE") or "en-us" | ||
AZURE_SEARCH_QUERY_SPELLER = os.getenv("AZURE_SEARCH_QUERY_SPELLER") or "lexicon" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the fix for the weird search error
@@ -420,7 +420,7 @@ async def setup_clients(): | |||
OPENAI_HOST = os.getenv("OPENAI_HOST", "azure") | |||
OPENAI_CHATGPT_MODEL = os.environ["AZURE_OPENAI_CHATGPT_MODEL"] | |||
OPENAI_EMB_MODEL = os.getenv("AZURE_OPENAI_EMB_MODEL_NAME", "text-embedding-ada-002") | |||
OPENAI_EMB_DIMENSIONS = int(os.getenv("AZURE_OPENAI_EMB_DIMENSIONS", 1536)) | |||
OPENAI_EMB_DIMENSIONS = int(os.getenv("AZURE_OPENAI_EMB_DIMENSIONS") or 1536) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also results in a bad error if the env variable is an empty string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 7 out of 15 changed files in this pull request and generated no comments.
Files not reviewed (8)
- evals/evaluate_config.json: Language not supported
- evals/results/baseline/config.json: Language not supported
- evals/results/baseline/evaluate_parameters.json: Language not supported
- evals/results/baseline/summary.json: Language not supported
- evals/results/gpt-4o-mini/config.json: Language not supported
- evals/results/gpt-4o-mini/evaluate_parameters.json: Language not supported
- evals/results/gpt-4o-mini/summary.json: Language not supported
- .github/workflows/azure-dev.yml: Evaluated as low risk
Comments suppressed due to low confidence (2)
evals/evaluate.py:24
- [nitpick] The error message could be more descriptive. Suggest changing to: 'Received a None response, unable to compute any_citation metric. Defaulting to -1.'
logger.warning("Received response of None, can't compute any_citation metric. Setting to -1.")
evals/evaluate.py:17
- Ensure that the new AnyCitationMetric class and its behavior are covered by tests.
class AnyCitationMetric(BaseMetric):
Example PR using the /evaluate workflow: |
docs/evaluation.md
Outdated
* `resultsdir`: The directory to write the evaluation results. By default, this is a timestamped folder in `evals/results`. This option can also be specified in `eval_config.json`. | ||
* `targeturl`: The URL of the running application to evaluate. By default, this is `http://localhost:50505`. This option can also be specified in `eval_config.json`. | ||
|
||
🕰️ This may take a long time, possibly several hours, depending on the number of ground truth questions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokens allocated to deployment also affect this, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I'll add a note.
if response is None: | ||
logger.warning("Received response of None, can't compute any_citation metric. Setting to -1.") | ||
return {cls.METRIC_NAME: -1} | ||
return {cls.METRIC_NAME: bool(re.search(r"\[([^\]]+)\.\w{3,4}(#page=\d+)*\]", response))} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice regex 👍
* Add evaluation workflow * Trying to trigger workflow * Remove conditional * Update workflow * Add back old python file * New branch for eval * Fix uv * Remove python tests for now * New PR for eval * Add debug * Add workflow dispatch * Add workflow dispatch * Remove comment for now * Add workflow push * Add checkout * Try azd env new first * Try refresh * Add env config * Fix the action vars * Fix local server start * Fix app run * logs pos * Run app directly * nohup * Log more * Logger calls * Fix log calls * Remove empty string values * Ask less questions * Evaluate all questions * Base on comment * Base on comment * Revert unneeded changes * Add note, link eval docs in more places, link to videos
Purpose
This PR introduces a new workflow for evaluating the RAG answer flow, triggered by issue comments and configured with extensive Azure environment variables. The developer must first run
azd pipeline config
before the workflow will work.Does this introduce a breaking change?
When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.
Does this require changes to learn.microsoft.com docs?
This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.
Type of change
Code quality checklist
See CONTRIBUTING.md for more details.
python -m pytest
).python -m pytest --cov
to verify 100% coverage of added linespython -m mypy
to check for type errorsruff
andblack
manually on my code.