Evaluation workflow for GitHub Actions #2350

pamelafox · 2025-02-11T01:21:22Z

Purpose

This PR introduces a new workflow for evaluating the RAG answer flow, triggered by issue comments and configured with extensive Azure environment variables. The developer must first run azd pipeline config before the workflow will work.

Does this introduce a breaking change?

When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.

[ ] Yes
[X] No

Does this require changes to learn.microsoft.com docs?

This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.

[X] Yes - The evaluation tutorial
[ ] No

Type of change

[ ] Bugfix
[X] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

Code quality checklist

See CONTRIBUTING.md for more details.

The current tests all pass (python -m pytest).
I added tests that prove my fix is effective or that my feature works
I ran python -m pytest --cov to verify 100% coverage of added lines
I ran python -m mypy to check for type errors
I either used the pre-commit hooks or ran ruff and black manually on my code.

Add evaluation workflow

New branch for eval

New PR for eval

Evalsci

pamelafox · 2025-02-11T01:27:16Z

app/backend/app.py

-    AZURE_SEARCH_QUERY_LANGUAGE = os.getenv("AZURE_SEARCH_QUERY_LANGUAGE", "en-us")
-    AZURE_SEARCH_QUERY_SPELLER = os.getenv("AZURE_SEARCH_QUERY_SPELLER", "lexicon")
+    AZURE_SEARCH_QUERY_LANGUAGE = os.getenv("AZURE_SEARCH_QUERY_LANGUAGE") or "en-us"
+    AZURE_SEARCH_QUERY_SPELLER = os.getenv("AZURE_SEARCH_QUERY_SPELLER") or "lexicon"


This was the fix for the weird search error

pamelafox · 2025-02-11T01:27:32Z

app/backend/app.py

@@ -420,7 +420,7 @@ async def setup_clients():
    OPENAI_HOST = os.getenv("OPENAI_HOST", "azure")
    OPENAI_CHATGPT_MODEL = os.environ["AZURE_OPENAI_CHATGPT_MODEL"]
    OPENAI_EMB_MODEL = os.getenv("AZURE_OPENAI_EMB_MODEL_NAME", "text-embedding-ada-002")
-    OPENAI_EMB_DIMENSIONS = int(os.getenv("AZURE_OPENAI_EMB_DIMENSIONS", 1536))
+    OPENAI_EMB_DIMENSIONS = int(os.getenv("AZURE_OPENAI_EMB_DIMENSIONS") or 1536)


This also results in a bad error if the env variable is an empty string

Copilot

Copilot reviewed 7 out of 15 changed files in this pull request and generated no comments.

Files not reviewed (8)

evals/evaluate_config.json: Language not supported
evals/results/baseline/config.json: Language not supported
evals/results/baseline/evaluate_parameters.json: Language not supported
evals/results/baseline/summary.json: Language not supported
evals/results/gpt-4o-mini/config.json: Language not supported
evals/results/gpt-4o-mini/evaluate_parameters.json: Language not supported
evals/results/gpt-4o-mini/summary.json: Language not supported
.github/workflows/azure-dev.yml: Evaluated as low risk

Comments suppressed due to low confidence (2)

evals/evaluate.py:24

[nitpick] The error message could be more descriptive. Suggest changing to: 'Received a None response, unable to compute any_citation metric. Defaulting to -1.'

logger.warning("Received response of None, can't compute any_citation metric. Setting to -1.")

evals/evaluate.py:17

Ensure that the new AnyCitationMetric class and its behavior are covered by tests.

class AnyCitationMetric(BaseMetric):

pamelafox · 2025-02-11T01:29:30Z

Example PR using the /evaluate workflow:
pamelafox#5

mattgotteiner · 2025-02-11T01:29:57Z

docs/evaluation.md

+* `resultsdir`: The directory to write the evaluation results. By default, this is a timestamped folder in `evals/results`. This option can also be specified in `eval_config.json`.
+* `targeturl`: The URL of the running application to evaluate. By default, this is `http://localhost:50505`. This option can also be specified in `eval_config.json`.
+
+🕰️ This may take a long time, possibly several hours, depending on the number of ground truth questions.


tokens allocated to deployment also affect this, right?

Yes! I'll add a note.

mattgotteiner · 2025-02-11T01:31:04Z

evals/evaluate.py

+            if response is None:
+                logger.warning("Received response of None, can't compute any_citation metric. Setting to -1.")
+                return {cls.METRIC_NAME: -1}
+            return {cls.METRIC_NAME: bool(re.search(r"\[([^\]]+)\.\w{3,4}(#page=\d+)*\]", response))}


nice regex 👍

…penai-demo into formain

* Add evaluation workflow * Trying to trigger workflow * Remove conditional * Update workflow * Add back old python file * New branch for eval * Fix uv * Remove python tests for now * New PR for eval * Add debug * Add workflow dispatch * Add workflow dispatch * Remove comment for now * Add workflow push * Add checkout * Try azd env new first * Try refresh * Add env config * Fix the action vars * Fix local server start * Fix app run * logs pos * Run app directly * nohup * Log more * Logger calls * Fix log calls * Remove empty string values * Ask less questions * Evaluate all questions * Base on comment * Base on comment * Revert unneeded changes * Add note, link eval docs in more places, link to videos

pamelafox and others added 30 commits February 10, 2025 10:21

Add evaluation workflow

d593dc2

Trying to trigger workflow

a7bbd8b

Remove conditional

d7bdddd

Update workflow

a8b2beb

Add back old python file

9fbf046

Merge pull request #1 from pamelafox/evalsci

bfb1b9d

Add evaluation workflow

New branch for eval

f28283c

Fix uv

f6dd98b

Remove python tests for now

0cac252

Merge pull request #2 from pamelafox/evalsci

49b66e5

New branch for eval

New PR for eval

e564b24

Add debug

a2c8469

Add workflow dispatch

9916bfc

Merge pull request #3 from pamelafox/evalsci

d7cc6fa

New PR for eval

Add workflow dispatch

934129c

Remove comment for now

68f9abe

Add workflow push

7c95d88

Add checkout

7b022b8

Try azd env new first

f932ef9

Try refresh

550ee3f

Add env config

feb7a00

Fix the action vars

36121f6

Fix local server start

1a3e00e

Fix app run

1050b50

logs pos

d07c263

Run app directly

f11813f

nohup

a076539

Log more

182c310

Logger calls

13b3f78

Fix log calls

340a411

pamelafox and others added 8 commits February 10, 2025 14:16

Remove empty string values

86bd5eb

Ask less questions

c51afed

Evaluate all questions

a197f3c

Base on comment

062e9b8

Base on comment

c4861fe

Merge pull request #4 from pamelafox/evalsci

d7b105d

Evalsci

Revert unneeded changes

f4a7334

Merge branch 'main' into formain

8d28207

pamelafox commented Feb 11, 2025

View reviewed changes

pamelafox changed the title ~~WIP: Evaluation workflow for GitHub Actions~~ Evaluation workflow for GitHub Actions Feb 11, 2025

pamelafox requested review from Copilot and mattgotteiner February 11, 2025 01:28

Copilot AI reviewed Feb 11, 2025

View reviewed changes

mattgotteiner reviewed Feb 11, 2025

View reviewed changes

mattgotteiner approved these changes Feb 11, 2025

View reviewed changes

pamelafox added 2 commits February 10, 2025 23:51

Add note, link eval docs in more places, link to videos

c7dae8e

Merge branch 'formain' of https://github.com/pamelafox/azure-search-o…

b4f30eb

…penai-demo into formain

pamelafox merged commit e873ba9 into Azure-Samples:main Feb 11, 2025
15 checks passed

pamelafox deleted the formain branch February 11, 2025 08:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation workflow for GitHub Actions #2350

Evaluation workflow for GitHub Actions #2350

pamelafox commented Feb 11, 2025 •

edited

Loading

pamelafox Feb 11, 2025

pamelafox Feb 11, 2025

Copilot AI left a comment

pamelafox commented Feb 11, 2025

mattgotteiner Feb 11, 2025

pamelafox Feb 11, 2025

mattgotteiner Feb 11, 2025

Evaluation workflow for GitHub Actions #2350

Evaluation workflow for GitHub Actions #2350

Conversation

pamelafox commented Feb 11, 2025 • edited Loading

Purpose

Does this introduce a breaking change?

Does this require changes to learn.microsoft.com docs?

Type of change

Code quality checklist

pamelafox Feb 11, 2025

Choose a reason for hiding this comment

pamelafox Feb 11, 2025

Choose a reason for hiding this comment

Copilot AI left a comment

Choose a reason for hiding this comment

pamelafox commented Feb 11, 2025

mattgotteiner Feb 11, 2025

Choose a reason for hiding this comment

pamelafox Feb 11, 2025

Choose a reason for hiding this comment

mattgotteiner Feb 11, 2025

Choose a reason for hiding this comment

pamelafox commented Feb 11, 2025 •

edited

Loading