Skip to content

Evaluation workflow for GitHub Actions #2350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 40 commits into from
Feb 11, 2025
Merged

Conversation

pamelafox
Copy link
Collaborator

@pamelafox pamelafox commented Feb 11, 2025

Purpose

This PR introduces a new workflow for evaluating the RAG answer flow, triggered by issue comments and configured with extensive Azure environment variables. The developer must first run azd pipeline config before the workflow will work.

Does this introduce a breaking change?

When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.

[ ] Yes
[X] No

Does this require changes to learn.microsoft.com docs?

This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.

[X] Yes - The evaluation tutorial
[ ] No

Type of change

[ ] Bugfix
[X] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

Code quality checklist

See CONTRIBUTING.md for more details.

  • The current tests all pass (python -m pytest).
  • I added tests that prove my fix is effective or that my feature works
  • I ran python -m pytest --cov to verify 100% coverage of added lines
  • I ran python -m mypy to check for type errors
  • I either used the pre-commit hooks or ran ruff and black manually on my code.

AZURE_SEARCH_QUERY_LANGUAGE = os.getenv("AZURE_SEARCH_QUERY_LANGUAGE", "en-us")
AZURE_SEARCH_QUERY_SPELLER = os.getenv("AZURE_SEARCH_QUERY_SPELLER", "lexicon")
AZURE_SEARCH_QUERY_LANGUAGE = os.getenv("AZURE_SEARCH_QUERY_LANGUAGE") or "en-us"
AZURE_SEARCH_QUERY_SPELLER = os.getenv("AZURE_SEARCH_QUERY_SPELLER") or "lexicon"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the fix for the weird search error

@@ -420,7 +420,7 @@ async def setup_clients():
OPENAI_HOST = os.getenv("OPENAI_HOST", "azure")
OPENAI_CHATGPT_MODEL = os.environ["AZURE_OPENAI_CHATGPT_MODEL"]
OPENAI_EMB_MODEL = os.getenv("AZURE_OPENAI_EMB_MODEL_NAME", "text-embedding-ada-002")
OPENAI_EMB_DIMENSIONS = int(os.getenv("AZURE_OPENAI_EMB_DIMENSIONS", 1536))
OPENAI_EMB_DIMENSIONS = int(os.getenv("AZURE_OPENAI_EMB_DIMENSIONS") or 1536)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also results in a bad error if the env variable is an empty string

@pamelafox pamelafox changed the title WIP: Evaluation workflow for GitHub Actions Evaluation workflow for GitHub Actions Feb 11, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 7 out of 15 changed files in this pull request and generated no comments.

Files not reviewed (8)
  • evals/evaluate_config.json: Language not supported
  • evals/results/baseline/config.json: Language not supported
  • evals/results/baseline/evaluate_parameters.json: Language not supported
  • evals/results/baseline/summary.json: Language not supported
  • evals/results/gpt-4o-mini/config.json: Language not supported
  • evals/results/gpt-4o-mini/evaluate_parameters.json: Language not supported
  • evals/results/gpt-4o-mini/summary.json: Language not supported
  • .github/workflows/azure-dev.yml: Evaluated as low risk
Comments suppressed due to low confidence (2)

evals/evaluate.py:24

  • [nitpick] The error message could be more descriptive. Suggest changing to: 'Received a None response, unable to compute any_citation metric. Defaulting to -1.'
logger.warning("Received response of None, can't compute any_citation metric. Setting to -1.")

evals/evaluate.py:17

  • Ensure that the new AnyCitationMetric class and its behavior are covered by tests.
class AnyCitationMetric(BaseMetric):

@pamelafox
Copy link
Collaborator Author

Example PR using the /evaluate workflow:
pamelafox#5

* `resultsdir`: The directory to write the evaluation results. By default, this is a timestamped folder in `evals/results`. This option can also be specified in `eval_config.json`.
* `targeturl`: The URL of the running application to evaluate. By default, this is `http://localhost:50505`. This option can also be specified in `eval_config.json`.

🕰️ This may take a long time, possibly several hours, depending on the number of ground truth questions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokens allocated to deployment also affect this, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I'll add a note.

if response is None:
logger.warning("Received response of None, can't compute any_citation metric. Setting to -1.")
return {cls.METRIC_NAME: -1}
return {cls.METRIC_NAME: bool(re.search(r"\[([^\]]+)\.\w{3,4}(#page=\d+)*\]", response))}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice regex 👍

@pamelafox pamelafox merged commit e873ba9 into Azure-Samples:main Feb 11, 2025
15 checks passed
@pamelafox pamelafox deleted the formain branch February 11, 2025 08:16
dfl-aeb pushed a commit to dfl-aeb/azure-search-openai-demo that referenced this pull request Feb 19, 2025
* Add evaluation workflow

* Trying to trigger workflow

* Remove conditional

* Update workflow

* Add back old python file

* New branch for eval

* Fix uv

* Remove python tests for now

* New PR for eval

* Add debug

* Add workflow dispatch

* Add workflow dispatch

* Remove comment for now

* Add workflow push

* Add checkout

* Try azd env new first

* Try refresh

* Add env config

* Fix the action vars

* Fix local server start

* Fix app run

* logs pos

* Run app directly

* nohup

* Log more

* Logger calls

* Fix log calls

* Remove empty string values

* Ask less questions

* Evaluate all questions

* Base on comment

* Base on comment

* Revert unneeded changes

* Add note, link eval docs in more places, link to videos
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants