Skip to content

Commit 5b9336f

Browse files
authored
Merge pull request #18 from changliu2/main
fixed typos in README and simulation notebook for clarity
2 parents e96516e + 8ccc946 commit 5b9336f

9 files changed

+286
-863
lines changed

README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -54,12 +54,12 @@ Click on "Settings" from the left menu of Azure AI Studio, scroll down to "Conne
5454

5555
Once you set up those parameters, run:
5656

57-
```bash
58-
# Note: make sure you run this command from the src/ directory so that your .env is written to the correct location (src/)
59-
cd src
60-
python provisioning/provision.py --export-env .env
57+
```bash
58+
# Note: make sure you run this command from the src/ directory so that your .env is written to the correct location (src/)
59+
cd src
60+
python provisioning/provision.py --export-env .env
6161

62-
```
62+
```
6363

6464
The script will check whether the resources you specified exist, otherwise it will create them. It will then construct a .env for you that references the provisioned or referenced resources, including your keys. Once the provisioning is complete, you'll be ready to move to step 3.
6565

@@ -73,9 +73,9 @@ This step uses vector search with Azure OpenAI embeddings (e.g., ada-002) to enc
7373

7474
- Cognitive Services OpenAI Contributor
7575
- Cognitive Services Contributor
76-
- (optionally if you need quota view) Cognitive Services Usages Reader
76+
- (optionally if you need AOAI quota view) Cognitive Services Usages Reader
7777

78-
Follow instructions on https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/role-based-access-control to add role assignment in your Azure OpenAI resource.
78+
Follow instructions on https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/role-based-access-control to add role assignment in your Azure OpenAI resource. Note that Cognitive Services Usages Reader needs to be set at the subscription level.
7979

8080
Next, run the following script designed to streamline index creation. It builds the search index locally, and publishes it to your AI Studio project in the cloud.
8181

@@ -140,11 +140,11 @@ python -m evaluation.evaluate --evaluation-name <evaluation_name>
140140
Examples:
141141
This command generates evaluations on a much larger test set and generates some built-in quality metrics such as groundedness and relevance, as well as a custom evaluator called "friendliness". Learn more about our built-in quality metrics [here](https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-metrics-built-in?tabs=warning#generation-quality-metrics).
142142
``` bash
143-
python -m evaluation.evaluate  --evaluation-name quality_evals_contoso_retail  --dataset-path=./evaluation/ContosoTestBuild.jsonl
143+
python -m evaluation.evaluate  --evaluation-name quality_evals_contoso_retail  --dataset-path=./evaluation/ContosoTestDataSmall.jsonl
144144
```
145145
This command generates one single custom evaluator called "Completeness" on a much larger test set.
146146
``` bash
147-
python -m evaluation.evaluate_completeness  --evaluation-name completeness_evals_contoso_retail  --dataset-path=./evaluation/evaluation_dataset.jsonl --cot
147+
python -m evaluation.evaluate_completeness  --evaluation-name completeness_evals_contoso_retail  --dataset-path=./evaluation/evaluation_dataset_small.jsonl
148148
```
149149
To run safety evaluations, you need to 1) simulate adversarial datasets (or provide your own) and 2) evaluate your copilot on the datasets.
150150

Lines changed: 93 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -1,94 +1,94 @@
1-
---
2-
name: Reasonableness
3-
description: Evaluates reasonableness score for QA scenario
4-
model:
5-
api: chat
6-
configuration:
7-
type: azure_openai
8-
azure_deployment: ${env:AZURE_OPENAI_EVALUATION_DEPLOYMENT}
9-
api_version: ${env:AZURE_OPENAI_API_VERSION}
10-
azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
11-
parameters:
12-
temperature: 0.0
13-
max_tokens: 100
14-
top_p: 1.0
15-
presence_penalty: 0
16-
frequency_penalty: 0
17-
seed: 0
18-
response_format:
19-
type: text
20-
21-
22-
inputs:
23-
question:
24-
type: string
25-
answer:
26-
type: string
27-
truth:
28-
type: string
29-
---
30-
system:
31-
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
32-
user:
33-
You are an expert specialized in quality and safety evaluation of responses from intelligent assistant systems to user queries. Given some inputs, your objective is to measure whether the generated answer is complete or not, in reference to the ground truth. The metric is based on the prompt template below, where an answer is considered complete if it doesn't miss a statement from the ground truth.
34-
35-
Use the following steps to respond to inputs.
36-
37-
Step 1: Extract all statements from TRUTH. If truth is an empty string, skip all remaining steps and output {"REASON": "No missing statements found.", "SCORE": 5}.
38-
39-
Step 2: Extract all statements from ANSWER.
40-
41-
Step 3: Pay extra attention to statements that involve numbers, dates, or proper nouns. Reason step-by-step and identify whether ANSWER misses any of the statements in TRUTH. Output those missing statements in REASON.
42-
43-
Step 4: Rate the completeness of ANSWER between one to five stars using the following scale:
44-
45-
One star: ANSWER is missing all of the statements in TRUTH.
46-
47-
Two stars: ANSWER has some statements, but it is missing all the critical statements necessary to answer the question.
48-
49-
Three stars: ANSWER has some statements, but it is missing some critical statements necessary to answer the question.
50-
51-
Four stars: ANSWER has most of the statements, but it is missing few statements which are not important to answer the question.
52-
53-
Five stars: ANSWER has all of the statements in the TRUTH.
54-
55-
Please assign a rating between 1 and 5 based on the completeness the response. Output the rating in SCORE.
56-
57-
Independent Examples:
58-
## Example Task #1 Input:
59-
{"QUESTION": "What color does TrailBlaze Hiking Pants come in?", "ANSWER": "Khaki", "TRUTH": "Khaki"}
60-
## Example Task #1 Output:
61-
{"REASON": "No missing statements found.", "SCORE": 5}
62-
## Example Task #2 Input:
63-
{"QUESTION": "What color does TrailBlaze Hiking Pants come in?", "ANSWER": "Red", "TRUTH": "Khaki"}
64-
## Example Task #2 Output:
65-
{"REASON": "missing statements: \n1. Khaki", "SCORE": 1}
66-
## Example Task #3 Input:
67-
{"QUESTION": "What purchases did Sarah Lee make and at what price point?", "ANSWER": "['TrailMaster X4 Tent: $$250', 'CozyNights Sleeping Bag: $$100', 'TrailBlaze Hiking Pants: $$75', 'RainGuard Hiking Jacket: $$110']", "TRUTH": "['TrailMaster X4 Tent: $$250', 'CozyNights Sleeping Bag: $$100', 'TrailBlaze Hiking Pants: $$75', 'TrekMaster Camping Chair: $$100', 'SkyView 2-Person Tent: $$200', 'RainGuard Hiking Jacket: $$110', 'CompactCook Camping Stove: $$60']"}
68-
## Example Task #3 Output:
69-
{"REASON": "missing statements: \n1. 'TrekMaster Camping Chair: $$100'\n2.'SkyView 2-Person Tent: $$200'\n3. 'CompactCook Camping Stove: $$60'", "SCORE": 3}
70-
## Example Task #4 Input:
71-
{"QUESTION": "How many TrailMaster X4 Tents did John Smith bought?", "ANSWER": "1", "TRUTH": "2"}
72-
## Example Task #4 Output:
73-
{"REASON": "missing statements: \n1. 2 tents were purchased by John Smith.", "SCORE": 1}
74-
## Example Task #5 Input:
75-
{"QUESTION": "How water-proof are TrailBlazeMaster pants?", "ANSWER": "They are perfectly water-proof in all weather conditions", "TRUTH": "They are mostly water-proof except for rare, extreme weather conditions like hurricanes."}
76-
## Example Task #5 Output:
77-
{"REASON": "missing statements: \n1. Rare, extreme weather conditions like hurricanes would make TrailBlazeMaster pants not water-proof.", "SCORE": 4}
78-
## Example Task #6 Input:
79-
{"QUESTION": "How water-proof are TrailBlazeMaster pants?", "ANSWER": "They are perfectly water-proof in all weather conditions", "TRUTH": "They are slightly water-proof."}
80-
## Example Task #6 Output:
81-
{"REASON": "missing statements: \n1. TrailBlazeMaster pants are only slightly water-proof.", "SCORE": 2}
82-
## Example Task #7 Input:
83-
{"QUESTION": "Is a Belgium a country?", "ANSWER": "Sorry I cannot assist with that.", "TRUTH": "Sorry, I cannot answer any questions unrelated to sports gear."}
84-
## Example Task #7 Output:
85-
{"REASON": "No missing statements found.", "SCORE": 5}
86-
## Example Task #8 Input:
87-
{"QUESTION": "Is a Belgium a country?", "ANSWER": "Sorry I cannot provide answers unrelated to sports/gear", "TRUTH": ""}
88-
## Example Task #8 Output:
89-
{"REASON": "No missing statements found.", "SCORE": 5}
90-
91-
## Actual Task Input:
92-
{"QUESTION": {{question}}, "ANSWER": {{answer}}, "TRUTH": {{truth}}}
93-
Reminder: The return values for each task should be an integer between 1 and 5. Do not repeat TRUTH, ANSWER or QUESTION.
1+
---
2+
name: Reasonableness
3+
description: Evaluates reasonableness score for QA scenario
4+
model:
5+
api: chat
6+
configuration:
7+
type: azure_openai
8+
azure_deployment: ${env:AZURE_OPENAI_EVALUATION_DEPLOYMENT}
9+
api_version: ${env:AZURE_OPENAI_API_VERSION}
10+
azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
11+
parameters:
12+
temperature: 0.0
13+
max_tokens: 100
14+
top_p: 1.0
15+
presence_penalty: 0
16+
frequency_penalty: 0
17+
seed: 0
18+
response_format:
19+
type: text
20+
21+
22+
inputs:
23+
question:
24+
type: string
25+
answer:
26+
type: string
27+
truth:
28+
type: string
29+
---
30+
system:
31+
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
32+
user:
33+
You are an expert specialized in quality and safety evaluation of responses from intelligent assistant systems to user queries. Given some inputs, your objective is to measure whether the generated answer is complete or not, in reference to the ground truth. The metric is based on the prompt template below, where an answer is considered complete if it doesn't miss a statement from the ground truth.
34+
35+
Use the following steps to respond to inputs.
36+
37+
Step 1: Extract all statements from TRUTH. If truth is an empty string, skip all remaining steps and output {"REASON": "No missing statements found.", "SCORE": 5}.
38+
39+
Step 2: Extract all statements from ANSWER.
40+
41+
Step 3: Pay extra attention to statements that involve numbers, dates, or proper nouns. Reason step-by-step and identify whether ANSWER misses any of the statements in TRUTH. Output those missing statements in REASON.
42+
43+
Step 4: Rate the completeness of ANSWER between one to five stars using the following scale:
44+
45+
One star: ANSWER is missing all of the statements in TRUTH.
46+
47+
Two stars: ANSWER has some statements, but it is missing all the critical statements necessary to answer the question.
48+
49+
Three stars: ANSWER has some statements, but it is missing some critical statements necessary to answer the question.
50+
51+
Four stars: ANSWER has most of the statements, but it is missing few statements which are not important to answer the question.
52+
53+
Five stars: ANSWER has all of the statements in the TRUTH.
54+
55+
Please assign a rating between 1 and 5 based on the completeness the response. Output the rating in SCORE.
56+
57+
Independent Examples:
58+
## Example Task #1 Input:
59+
{"QUESTION": "What color does TrailBlaze Hiking Pants come in?", "ANSWER": "Khaki", "TRUTH": "Khaki"}
60+
## Example Task #1 Output:
61+
{"REASON": "No missing statements found.", "SCORE": 5}
62+
## Example Task #2 Input:
63+
{"QUESTION": "What color does TrailBlaze Hiking Pants come in?", "ANSWER": "Red", "TRUTH": "Khaki"}
64+
## Example Task #2 Output:
65+
{"REASON": "missing statements: \n1. Khaki", "SCORE": 1}
66+
## Example Task #3 Input:
67+
{"QUESTION": "What purchases did Sarah Lee make and at what price point?", "ANSWER": "['TrailMaster X4 Tent: $$250', 'CozyNights Sleeping Bag: $$100', 'TrailBlaze Hiking Pants: $$75', 'RainGuard Hiking Jacket: $$110']", "TRUTH": "['TrailMaster X4 Tent: $$250', 'CozyNights Sleeping Bag: $$100', 'TrailBlaze Hiking Pants: $$75', 'TrekMaster Camping Chair: $$100', 'SkyView 2-Person Tent: $$200', 'RainGuard Hiking Jacket: $$110', 'CompactCook Camping Stove: $$60']"}
68+
## Example Task #3 Output:
69+
{"REASON": "missing statements: \n1. 'TrekMaster Camping Chair: $$100'\n2.'SkyView 2-Person Tent: $$200'\n3. 'CompactCook Camping Stove: $$60'", "SCORE": 3}
70+
## Example Task #4 Input:
71+
{"QUESTION": "How many TrailMaster X4 Tents did John Smith bought?", "ANSWER": "1", "TRUTH": "2"}
72+
## Example Task #4 Output:
73+
{"REASON": "missing statements: \n1. 2 tents were purchased by John Smith.", "SCORE": 1}
74+
## Example Task #5 Input:
75+
{"QUESTION": "How water-proof are TrailBlazeMaster pants?", "ANSWER": "They are perfectly water-proof in all weather conditions", "TRUTH": "They are mostly water-proof except for rare, extreme weather conditions like hurricanes."}
76+
## Example Task #5 Output:
77+
{"REASON": "missing statements: \n1. Rare, extreme weather conditions like hurricanes would make TrailBlazeMaster pants not water-proof.", "SCORE": 4}
78+
## Example Task #6 Input:
79+
{"QUESTION": "How water-proof are TrailBlazeMaster pants?", "ANSWER": "They are perfectly water-proof in all weather conditions", "TRUTH": "They are slightly water-proof."}
80+
## Example Task #6 Output:
81+
{"REASON": "missing statements: \n1. TrailBlazeMaster pants are only slightly water-proof.", "SCORE": 2}
82+
## Example Task #7 Input:
83+
{"QUESTION": "Is a Belgium a country?", "ANSWER": "Sorry I cannot assist with that.", "TRUTH": "Sorry, I cannot answer any questions unrelated to sports gear."}
84+
## Example Task #7 Output:
85+
{"REASON": "No missing statements found.", "SCORE": 5}
86+
## Example Task #8 Input:
87+
{"QUESTION": "Is a Belgium a country?", "ANSWER": "Sorry I cannot provide answers unrelated to sports/gear", "TRUTH": ""}
88+
## Example Task #8 Output:
89+
{"REASON": "No missing statements found.", "SCORE": 5}
90+
91+
## Actual Task Input:
92+
{"QUESTION": {{question}}, "ANSWER": {{answer}}, "TRUTH": {{truth}}}
93+
Reminder: The return values for each task should be an integer between 1 and 5. Do not repeat TRUTH, ANSWER or QUESTION.
9494
## Actual Task Output:

src/custom_evaluators/completeness.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313

1414

1515
class CompletenessEvaluator:
16-
def __init__(self, model_config: AzureOpenAIModelConfiguration, prompty_filename: str = "completeness.prompty"):
16+
def __init__(self, model_config: AzureOpenAIModelConfiguration):
1717
"""
1818
Initialize an evaluator configured for a specific Azure OpenAI model.
1919
@@ -37,7 +37,8 @@ def __init__(self, model_config: AzureOpenAIModelConfiguration, prompty_filename
3737

3838
prompty_model_config = {"configuration": model_config}
3939
current_dir = os.path.dirname(__file__)
40-
prompty_path = os.path.join(current_dir, prompty_filename)
40+
prompty_path = os.path.join(current_dir, "completeness.prompty")
41+
assert os.path.exists(prompty_path), f"Please specify a valid prompty file for completeness metric! The following path does not exist:\n{prompty_path}"
4142
self._flow = load_flow(source=prompty_path, model=prompty_model_config)
4243

4344
def __call__(self, *, question: str, answer: str, truth: str, **kwargs):

0 commit comments

Comments
 (0)