Skip to content

Evaluation workflow for GitHub Actions #2350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 40 commits into from
Feb 11, 2025
Merged
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
d593dc2
Add evaluation workflow
pamelafox Feb 10, 2025
a7bbd8b
Trying to trigger workflow
pamelafox Feb 10, 2025
d7bdddd
Remove conditional
pamelafox Feb 10, 2025
a8b2beb
Update workflow
pamelafox Feb 10, 2025
9fbf046
Add back old python file
pamelafox Feb 10, 2025
bfb1b9d
Merge pull request #1 from pamelafox/evalsci
pamelafox Feb 10, 2025
f28283c
New branch for eval
pamelafox Feb 10, 2025
f6dd98b
Fix uv
pamelafox Feb 10, 2025
0cac252
Remove python tests for now
pamelafox Feb 10, 2025
49b66e5
Merge pull request #2 from pamelafox/evalsci
pamelafox Feb 10, 2025
e564b24
New PR for eval
pamelafox Feb 10, 2025
a2c8469
Add debug
pamelafox Feb 10, 2025
9916bfc
Add workflow dispatch
pamelafox Feb 10, 2025
d7cc6fa
Merge pull request #3 from pamelafox/evalsci
pamelafox Feb 10, 2025
934129c
Add workflow dispatch
pamelafox Feb 10, 2025
68f9abe
Remove comment for now
pamelafox Feb 10, 2025
7c95d88
Add workflow push
pamelafox Feb 10, 2025
7b022b8
Add checkout
pamelafox Feb 10, 2025
f932ef9
Try azd env new first
pamelafox Feb 10, 2025
550ee3f
Try refresh
pamelafox Feb 10, 2025
feb7a00
Add env config
pamelafox Feb 10, 2025
36121f6
Fix the action vars
pamelafox Feb 10, 2025
1a3e00e
Fix local server start
pamelafox Feb 10, 2025
1050b50
Fix app run
pamelafox Feb 10, 2025
d07c263
logs pos
pamelafox Feb 10, 2025
f11813f
Run app directly
pamelafox Feb 10, 2025
a076539
nohup
pamelafox Feb 10, 2025
182c310
Log more
pamelafox Feb 10, 2025
13b3f78
Logger calls
pamelafox Feb 10, 2025
340a411
Fix log calls
pamelafox Feb 10, 2025
86bd5eb
Remove empty string values
pamelafox Feb 10, 2025
c51afed
Ask less questions
pamelafox Feb 10, 2025
a197f3c
Evaluate all questions
pamelafox Feb 10, 2025
062e9b8
Base on comment
pamelafox Feb 10, 2025
c4861fe
Base on comment
pamelafox Feb 10, 2025
d7b105d
Merge pull request #4 from pamelafox/evalsci
pamelafox Feb 10, 2025
f4a7334
Revert unneeded changes
pamelafox Feb 11, 2025
8d28207
Merge branch 'main' into formain
pamelafox Feb 11, 2025
c7dae8e
Add note, link eval docs in more places, link to videos
pamelafox Feb 11, 2025
b4f30eb
Merge branch 'formain' of https://github.com/pamelafox/azure-search-o…
pamelafox Feb 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/azure-dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,12 @@ jobs:
AZURE_OPENAI_GPT4V_DEPLOYMENT_CAPACITY: ${{ vars.AZURE_OPENAI_GPT4V_DEPLOYMENT_CAPACITY }}
AZURE_OPENAI_GPT4V_DEPLOYMENT_VERSION: ${{ vars.AZURE_OPENAI_GPT4V_DEPLOYMENT_VERSION }}
AZURE_OPENAI_GPT4V_DEPLOYMENT_SKU: ${{ vars.AZURE_OPENAI_GPT4V_DEPLOYMENT_SKU }}
USE_EVAL: ${{ vars.USE_EVAL }}
AZURE_OPENAI_EVAL_MODEL: ${{ vars.AZURE_OPENAI_EVAL_MODEL }}
AZURE_OPENAI_EVAL_MODEL_VERSION: ${{ vars.AZURE_OPENAI_EVAL_MODEL_VERSION }}
AZURE_OPENAI_EVAL_DEPLOYMENT: ${{ vars.AZURE_OPENAI_EVAL_DEPLOYMENT }}
AZURE_OPENAI_EVAL_DEPLOYMENT_SKU: ${{ vars.AZURE_OPENAI_EVAL_DEPLOYMENT_SKU }}
AZURE_OPENAI_EVAL_DEPLOYMENT_CAPACITY: ${{ vars.AZURE_OPENAI_EVAL_DEPLOYMENT_CAPACITY }}
AZURE_OPENAI_DISABLE_KEYS: ${{ vars.AZURE_OPENAI_DISABLE_KEYS }}
OPENAI_HOST: ${{ vars.OPENAI_HOST }}
OPENAI_API_KEY: ${{ vars.OPENAI_API_KEY }}
Expand Down
243 changes: 243 additions & 0 deletions .github/workflows/evaluate.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
name: Evaluate RAG answer flow

on:
issue_comment:
types: [created]

# Set up permissions for deploying with secretless Azure federated credentials
# https://learn.microsoft.com/azure/developer/github/connect-from-azure?tabs=azure-portal%2Clinux#set-up-azure-login-with-openid-connect-authentication
permissions:
id-token: write
contents: read
issues: write
pull-requests: write

jobs:
evaluate:
if: |
contains('["OWNER", "CONTRIBUTOR", "COLLABORATOR", "MEMBER"]', github.event.comment.author_association) &&
github.event.issue.pull_request &&
github.event.comment.body == '/evaluate'
runs-on: ubuntu-latest
env:
# azd required
AZURE_CLIENT_ID: ${{ vars.AZURE_CLIENT_ID }}
AZURE_TENANT_ID: ${{ vars.AZURE_TENANT_ID }}
AZURE_SUBSCRIPTION_ID: ${{ vars.AZURE_SUBSCRIPTION_ID }}
AZURE_ENV_NAME: ${{ vars.AZURE_ENV_NAME }}
AZURE_LOCATION: ${{ vars.AZURE_LOCATION }}
# project specific
AZURE_OPENAI_SERVICE: ${{ vars.AZURE_OPENAI_SERVICE }}
AZURE_OPENAI_LOCATION: ${{ vars.AZURE_OPENAI_LOCATION }}
AZURE_OPENAI_API_VERSION: ${{ vars.AZURE_OPENAI_API_VERSION }}
AZURE_OPENAI_RESOURCE_GROUP: ${{ vars.AZURE_OPENAI_RESOURCE_GROUP }}
AZURE_DOCUMENTINTELLIGENCE_SERVICE: ${{ vars.AZURE_DOCUMENTINTELLIGENCE_SERVICE }}
AZURE_DOCUMENTINTELLIGENCE_RESOURCE_GROUP: ${{ vars.AZURE_DOCUMENTINTELLIGENCE_RESOURCE_GROUP }}
AZURE_DOCUMENTINTELLIGENCE_SKU: ${{ vars.AZURE_DOCUMENTINTELLIGENCE_SKU }}
AZURE_DOCUMENTINTELLIGENCE_LOCATION: ${{ vars.AZURE_DOCUMENTINTELLIGENCE_LOCATION }}
AZURE_COMPUTER_VISION_SERVICE: ${{ vars.AZURE_COMPUTER_VISION_SERVICE }}
AZURE_COMPUTER_VISION_RESOURCE_GROUP: ${{ vars.AZURE_COMPUTER_VISION_RESOURCE_GROUP }}
AZURE_COMPUTER_VISION_LOCATION: ${{ vars.AZURE_COMPUTER_VISION_LOCATION }}
AZURE_COMPUTER_VISION_SKU: ${{ vars.AZURE_COMPUTER_VISION_SKU }}
AZURE_SEARCH_INDEX: ${{ vars.AZURE_SEARCH_INDEX }}
AZURE_SEARCH_SERVICE: ${{ vars.AZURE_SEARCH_SERVICE }}
AZURE_SEARCH_SERVICE_RESOURCE_GROUP: ${{ vars.AZURE_SEARCH_SERVICE_RESOURCE_GROUP }}
AZURE_SEARCH_SERVICE_LOCATION: ${{ vars.AZURE_SEARCH_SERVICE_LOCATION }}
AZURE_SEARCH_SERVICE_SKU: ${{ vars.AZURE_SEARCH_SERVICE_SKU }}
AZURE_SEARCH_QUERY_LANGUAGE: ${{ vars.AZURE_SEARCH_QUERY_LANGUAGE }}
AZURE_SEARCH_QUERY_SPELLER: ${{ vars.AZURE_SEARCH_QUERY_SPELLER }}
AZURE_SEARCH_SEMANTIC_RANKER: ${{ vars.AZURE_SEARCH_SEMANTIC_RANKER }}
AZURE_STORAGE_ACCOUNT: ${{ vars.AZURE_STORAGE_ACCOUNT }}
AZURE_STORAGE_RESOURCE_GROUP: ${{ vars.AZURE_STORAGE_RESOURCE_GROUP }}
AZURE_STORAGE_SKU: ${{ vars.AZURE_STORAGE_SKU }}
AZURE_APP_SERVICE_PLAN: ${{ vars.AZURE_APP_SERVICE_PLAN }}
AZURE_APP_SERVICE_SKU: ${{ vars.AZURE_APP_SERVICE_SKU }}
AZURE_APP_SERVICE: ${{ vars.AZURE_APP_SERVICE }}
AZURE_OPENAI_CHATGPT_MODEL: ${{ vars.AZURE_OPENAI_CHATGPT_MODEL }}
AZURE_OPENAI_CHATGPT_DEPLOYMENT: ${{ vars.AZURE_OPENAI_CHATGPT_DEPLOYMENT }}
AZURE_OPENAI_CHATGPT_DEPLOYMENT_CAPACITY: ${{ vars.AZURE_OPENAI_CHATGPT_DEPLOYMENT_CAPACITY }}
AZURE_OPENAI_CHATGPT_DEPLOYMENT_VERSION: ${{ vars.AZURE_OPENAI_CHATGPT_DEPLOYMENT_VERSION }}
AZURE_OPENAI_EMB_MODEL_NAME: ${{ vars.AZURE_OPENAI_EMB_MODEL_NAME }}
AZURE_OPENAI_EMB_DEPLOYMENT: ${{ vars.AZURE_OPENAI_EMB_DEPLOYMENT }}
AZURE_OPENAI_EMB_DEPLOYMENT_CAPACITY: ${{ vars.AZURE_OPENAI_EMB_DEPLOYMENT_CAPACITY }}
AZURE_OPENAI_EMB_DEPLOYMENT_VERSION: ${{ vars.AZURE_OPENAI_EMB_DEPLOYMENT_VERSION }}
AZURE_OPENAI_EMB_DIMENSIONS: ${{ vars.AZURE_OPENAI_EMB_DIMENSIONS }}
AZURE_OPENAI_GPT4V_MODEL: ${{ vars.AZURE_OPENAI_GPT4V_MODEL }}
AZURE_OPENAI_GPT4V_DEPLOYMENT: ${{ vars.AZURE_OPENAI_GPT4V_DEPLOYMENT }}
AZURE_OPENAI_GPT4V_DEPLOYMENT_CAPACITY: ${{ vars.AZURE_OPENAI_GPT4V_DEPLOYMENT_CAPACITY }}
AZURE_OPENAI_GPT4V_DEPLOYMENT_VERSION: ${{ vars.AZURE_OPENAI_GPT4V_DEPLOYMENT_VERSION }}
AZURE_OPENAI_GPT4V_DEPLOYMENT_SKU: ${{ vars.AZURE_OPENAI_GPT4V_DEPLOYMENT_SKU }}
USE_EVAL: ${{ vars.USE_EVAL }}
AZURE_OPENAI_EVAL_MODEL: ${{ vars.AZURE_OPENAI_EVAL_MODEL }}
AZURE_OPENAI_EVAL_MODEL_VERSION: ${{ vars.AZURE_OPENAI_EVAL_MODEL_VERSION }}
AZURE_OPENAI_EVAL_DEPLOYMENT: ${{ vars.AZURE_OPENAI_EVAL_DEPLOYMENT }}
AZURE_OPENAI_EVAL_DEPLOYMENT_SKU: ${{ vars.AZURE_OPENAI_EVAL_DEPLOYMENT_SKU }}
AZURE_OPENAI_EVAL_DEPLOYMENT_CAPACITY: ${{ vars.AZURE_OPENAI_EVAL_DEPLOYMENT_CAPACITY }}
AZURE_OPENAI_DISABLE_KEYS: ${{ vars.AZURE_OPENAI_DISABLE_KEYS }}
OPENAI_HOST: ${{ vars.OPENAI_HOST }}
OPENAI_API_KEY: ${{ vars.OPENAI_API_KEY }}
OPENAI_ORGANIZATION: ${{ vars.OPENAI_ORGANIZATION }}
AZURE_USE_APPLICATION_INSIGHTS: ${{ vars.AZURE_USE_APPLICATION_INSIGHTS }}
AZURE_APPLICATION_INSIGHTS: ${{ vars.AZURE_APPLICATION_INSIGHTS }}
AZURE_APPLICATION_INSIGHTS_DASHBOARD: ${{ vars.AZURE_APPLICATION_INSIGHTS_DASHBOARD }}
AZURE_LOG_ANALYTICS: ${{ vars.AZURE_LOG_ANALYTICS }}
USE_VECTORS: ${{ vars.USE_VECTORS }}
USE_GPT4V: ${{ vars.USE_GPT4V }}
AZURE_VISION_ENDPOINT: ${{ vars.AZURE_VISION_ENDPOINT }}
VISION_SECRET_NAME: ${{ vars.VISION_SECRET_NAME }}
ENABLE_LANGUAGE_PICKER: ${{ vars.ENABLE_LANGUAGE_PICKER }}
USE_SPEECH_INPUT_BROWSER: ${{ vars.USE_SPEECH_INPUT_BROWSER }}
USE_SPEECH_OUTPUT_BROWSER: ${{ vars.USE_SPEECH_OUTPUT_BROWSER }}
USE_SPEECH_OUTPUT_AZURE: ${{ vars.USE_SPEECH_OUTPUT_AZURE }}
AZURE_SPEECH_SERVICE: ${{ vars.AZURE_SPEECH_SERVICE }}
AZURE_SPEECH_SERVICE_RESOURCE_GROUP: ${{ vars.AZURE_SPEECH_RESOURCE_GROUP }}
AZURE_SPEECH_SERVICE_LOCATION: ${{ vars.AZURE_SPEECH_SERVICE_LOCATION }}
AZURE_SPEECH_SERVICE_SKU: ${{ vars.AZURE_SPEECH_SERVICE_SKU }}
AZURE_SPEECH_SERVICE_VOICE: ${{ vars.AZURE_SPEECH_SERVICE_VOICE }}
AZURE_KEY_VAULT_NAME: ${{ vars.AZURE_KEY_VAULT_NAME }}
AZURE_USE_AUTHENTICATION: ${{ vars.AZURE_USE_AUTHENTICATION }}
AZURE_ENFORCE_ACCESS_CONTROL: ${{ vars.AZURE_ENFORCE_ACCESS_CONTROL }}
AZURE_ENABLE_GLOBAL_DOCUMENT_ACCESS: ${{ vars.AZURE_ENABLE_GLOBAL_DOCUMENT_ACCESS }}
AZURE_ENABLE_UNAUTHENTICATED_ACCESS: ${{ vars.AZURE_ENABLE_UNAUTHENTICATED_ACCESS }}
AZURE_AUTH_TENANT_ID: ${{ vars.AZURE_AUTH_TENANT_ID }}
AZURE_SERVER_APP_ID: ${{ vars.AZURE_SERVER_APP_ID }}
AZURE_CLIENT_APP_ID: ${{ vars.AZURE_CLIENT_APP_ID }}
ALLOWED_ORIGIN: ${{ vars.ALLOWED_ORIGIN }}
AZURE_ADLS_GEN2_STORAGE_ACCOUNT: ${{ vars.AZURE_ADLS_GEN2_STORAGE_ACCOUNT }}
AZURE_ADLS_GEN2_FILESYSTEM_PATH: ${{ vars.AZURE_ADLS_GEN2_FILESYSTEM_PATH }}
AZURE_ADLS_GEN2_FILESYSTEM: ${{ vars.AZURE_ADLS_GEN2_FILESYSTEM }}
DEPLOYMENT_TARGET: ${{ vars.DEPLOYMENT_TARGET }}
AZURE_CONTAINER_APPS_WORKLOAD_PROFILE: ${{ vars.AZURE_CONTAINER_APPS_WORKLOAD_PROFILE }}
USE_CHAT_HISTORY_BROWSER: ${{ vars.USE_CHAT_HISTORY_BROWSER }}
USE_MEDIA_DESCRIBER_AZURE_CU: ${{ vars.USE_MEDIA_DESCRIBER_AZURE_CU }}
steps:

- name: Comment on pull request
uses: actions/github-script@v7
with:
script: |
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: "Starting evaluation! Check the Actions tab for progress, or wait for a comment with the results."
})

- name: Checkout pull request
uses: actions/checkout@v4
with:
ref: refs/pull/${{ github.event.issue.number }}/head

- name: Install uv
uses: astral-sh/setup-uv@v5
with:
enable-cache: true
version: "0.4.20"
cache-dependency-glob: "requirements**.txt"
python-version: "3.11"

- name: Setup node
uses: actions/setup-node@v4
with:
node-version: 18

- name: Install azd
uses: Azure/[email protected]

- name: Login to Azure with az CLI
uses: azure/login@v2
with:
client-id: ${{ env.AZURE_CLIENT_ID }}
tenant-id: ${{ env.AZURE_TENANT_ID }}
subscription-id: ${{ env.AZURE_SUBSCRIPTION_ID }}

- name: Set az account
uses: azure/CLI@v2
with:
inlineScript: |
az account set --subscription ${{env.AZURE_SUBSCRIPTION_ID}}

- name: Login to with Azure with azd (Federated Credentials)
if: ${{ env.AZURE_CLIENT_ID != '' }}
run: |
azd auth login `
--client-id "$Env:AZURE_CLIENT_ID" `
--federated-credential-provider "github" `
--tenant-id "$Env:AZURE_TENANT_ID"
shell: pwsh

- name: Refresh azd environment variables
run: |
azd env refresh -e $AZURE_ENV_NAME --no-prompt
env:
AZD_INITIAL_ENVIRONMENT_CONFIG: ${{ secrets.AZD_INITIAL_ENVIRONMENT_CONFIG }}

- name: Build frontend
run: |
cd ./app/frontend
npm install
npm run build

- name: Install dependencies
run: |
uv pip install -r requirements-dev.txt

- name: Run local server in background
run: |
cd app/backend
RUNNER_TRACKING_ID="" && (nohup python3 -m quart --app main:app run --port 50505 > serverlogs.out 2> serverlogs.err &)
cd ../..

- name: Install evaluate dependencies
run: |
uv pip install -r evals/requirements.txt

- name: Evaluate local RAG flow
run: |
python evals/evaluate.py --targeturl=http://127.0.0.1:50505/chat --resultsdir=evals/results/pr${{ github.event.issue.number }}

- name: Upload eval results as build artifact
if: ${{ success() }}
uses: actions/upload-artifact@v4
with:
name: eval_result
path: ./evals/results/pr${{ github.event.issue.number }}

- name: Upload server logs as build artifact
uses: actions/upload-artifact@v4
with:
name: server_logs
path: ./app/backend/serverlogs.out

- name: Upload server error logs as build artifact
uses: actions/upload-artifact@v4
with:
name: server_error_logs
path: ./app/backend/serverlogs.err

- name: Summarize results
if: ${{ success() }}
run: |
echo "## Evaluation results" >> eval-summary.md
python -m evaltools summary evals/results --output=markdown >> eval-summary.md
echo "## Answer differences across runs" >> run-diff.md
python -m evaltools diff evals/results/baseline evals/results/pr${{ github.event.issue.number }} --output=markdown >> run-diff.md
cat eval-summary.md >> $GITHUB_STEP_SUMMARY
cat run-diff.md >> $GITHUB_STEP_SUMMARY

- name: Comment on pull request
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const summaryPath = "eval-summary.md";
const summary = fs.readFileSync(summaryPath, 'utf8');
const runId = process.env.GITHUB_RUN_ID;
const repo = process.env.GITHUB_REPOSITORY;
const actionsUrl = `https://github.com/${repo}/actions/runs/${runId}`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `${summary}\n\n[Check the workflow run for more details](${actionsUrl}).`
})
6 changes: 3 additions & 3 deletions app/backend/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -420,7 +420,7 @@ async def setup_clients():
OPENAI_HOST = os.getenv("OPENAI_HOST", "azure")
OPENAI_CHATGPT_MODEL = os.environ["AZURE_OPENAI_CHATGPT_MODEL"]
OPENAI_EMB_MODEL = os.getenv("AZURE_OPENAI_EMB_MODEL_NAME", "text-embedding-ada-002")
OPENAI_EMB_DIMENSIONS = int(os.getenv("AZURE_OPENAI_EMB_DIMENSIONS", 1536))
OPENAI_EMB_DIMENSIONS = int(os.getenv("AZURE_OPENAI_EMB_DIMENSIONS") or 1536)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also results in a bad error if the env variable is an empty string

# Used with Azure OpenAI deployments
AZURE_OPENAI_SERVICE = os.getenv("AZURE_OPENAI_SERVICE")
AZURE_OPENAI_GPT4V_DEPLOYMENT = os.environ.get("AZURE_OPENAI_GPT4V_DEPLOYMENT")
Expand Down Expand Up @@ -450,8 +450,8 @@ async def setup_clients():
KB_FIELDS_CONTENT = os.getenv("KB_FIELDS_CONTENT", "content")
KB_FIELDS_SOURCEPAGE = os.getenv("KB_FIELDS_SOURCEPAGE", "sourcepage")

AZURE_SEARCH_QUERY_LANGUAGE = os.getenv("AZURE_SEARCH_QUERY_LANGUAGE", "en-us")
AZURE_SEARCH_QUERY_SPELLER = os.getenv("AZURE_SEARCH_QUERY_SPELLER", "lexicon")
AZURE_SEARCH_QUERY_LANGUAGE = os.getenv("AZURE_SEARCH_QUERY_LANGUAGE") or "en-us"
AZURE_SEARCH_QUERY_SPELLER = os.getenv("AZURE_SEARCH_QUERY_SPELLER") or "lexicon"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the fix for the weird search error

AZURE_SEARCH_SEMANTIC_RANKER = os.getenv("AZURE_SEARCH_SEMANTIC_RANKER", "free").lower()

AZURE_SPEECH_SERVICE_ID = os.getenv("AZURE_SPEECH_SERVICE_ID")
Expand Down
6 changes: 6 additions & 0 deletions azure.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,12 @@ pipeline:
- AZURE_OPENAI_EMB_DEPLOYMENT_VERSION
- AZURE_OPENAI_EMB_DEPLOYMENT_SKU
- AZURE_OPENAI_EMB_DIMENSIONS
- USE_EVAL
- AZURE_OPENAI_EVAL_MODEL
- AZURE_OPENAI_EVAL_MODEL_VERSION
- AZURE_OPENAI_EVAL_DEPLOYMENT
- AZURE_OPENAI_EVAL_DEPLOYMENT_SKU
- AZURE_OPENAI_EVAL_DEPLOYMENT_CAPACITY
- AZURE_OPENAI_GPT4V_MODEL
- AZURE_OPENAI_GPT4V_DEPLOYMENT
- AZURE_OPENAI_GPT4V_DEPLOYMENT_CAPACITY
Expand Down
16 changes: 14 additions & 2 deletions docs/evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,13 @@ Run the evaluation script by running the following command:
python evals/evaluate.py
```

🕰️ This may take a long time, possibly several hours, depending on the number of ground truth questions. You can specify `--numquestions` argument for a test run on a subset of the questions.
The options are:

* `numquestions`: The number of questions to evaluate. By default, this is all questions in the ground truth data.
* `resultsdir`: The directory to write the evaluation results. By default, this is a timestamped folder in `evals/results`. This option can also be specified in `eval_config.json`.
* `targeturl`: The URL of the running application to evaluate. By default, this is `http://localhost:50505`. This option can also be specified in `eval_config.json`.

🕰️ This may take a long time, possibly several hours, depending on the number of ground truth questions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokens allocated to deployment also affect this, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I'll add a note.


## Review the evaluation results

Expand All @@ -93,12 +99,18 @@ You can see a summary of results across all evaluation runs by running the follo
python -m evaltools summary evals/results
```

Compare answers across runs by running the following command:
Compare answers to the ground truth by running the following command:

```bash
python -m evaltools diff evals/results/baseline/
```

Compare answers across two runs by running the following command:

```bash
python -m evaltools diff evals/results/baseline/ evals/results/SECONDRUNHERE
```

## Run bulk evaluation on a PR

To run the evaluation on the changes in a PR, you can add a `/evaluate` comment to the PR. This will trigger the evaluation workflow to run the evaluation on the PR changes and will post the results to the PR.
24 changes: 24 additions & 0 deletions evals/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,28 @@
logger = logging.getLogger("ragapp")


class AnyCitationMetric(BaseMetric):
METRIC_NAME = "any_citation"

@classmethod
def evaluator_fn(cls, **kwargs):
def any_citation(*, response, **kwargs):
if response is None:
logger.warning("Received response of None, can't compute any_citation metric. Setting to -1.")
return {cls.METRIC_NAME: -1}
return {cls.METRIC_NAME: bool(re.search(r"\[([^\]]+)\.\w{3,4}(#page=\d+)*\]", response))}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice regex 👍


return any_citation

@classmethod
def get_aggregate_stats(cls, df):
df = df[df[cls.METRIC_NAME] != -1]
return {
"total": int(df[cls.METRIC_NAME].sum()),
"rate": round(df[cls.METRIC_NAME].mean(), 2),
}


class CitationsMatchedMetric(BaseMetric):
METRIC_NAME = "citations_matched"

Expand Down Expand Up @@ -80,6 +102,8 @@ def get_azure_credential():
openai_config = get_openai_config()

register_metric(CitationsMatchedMetric)
register_metric(AnyCitationMetric)

run_evaluate_from_config(
working_dir=Path(__file__).parent,
config_path="evaluate_config.json",
Expand Down
4 changes: 2 additions & 2 deletions evals/evaluate_config.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"testdata_path": "ground_truth.jsonl",
"results_dir": "results/experiment<TIMESTAMP>",
"requested_metrics": ["gpt_groundedness", "gpt_relevance", "answer_length", "latency", "citations_matched"],
"results_dir": "results/gpt-4o-mini",
"requested_metrics": ["gpt_groundedness", "gpt_relevance", "answer_length", "latency", "citations_matched", "any_citation"],
"target_url": "http://localhost:50505/chat",
"target_parameters": {
"overrides": {
Expand Down
4 changes: 2 additions & 2 deletions evals/results/baseline/config.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"testdata_path": "ground_truth.jsonl",
"results_dir": "results/experiment<TIMESTAMP>",
"requested_metrics": ["gpt_groundedness", "gpt_relevance", "answer_length", "latency", "citations_matched"],
"results_dir": "results/gpt-4o-mini",
"requested_metrics": ["gpt_groundedness", "gpt_relevance", "answer_length", "latency", "citations_matched", "any_citation"],
"target_url": "http://localhost:50505/chat",
"target_parameters": {
"overrides": {
Expand Down
Loading