Skip to content

fixed typos in README and simulation notebook for clarity #18

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 28, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,12 +54,12 @@ Click on "Settings" from the left menu of Azure AI Studio, scroll down to "Conne

Once you set up those parameters, run:

```bash
# Note: make sure you run this command from the src/ directory so that your .env is written to the correct location (src/)
cd src
python provisioning/provision.py --export-env .env
```bash
# Note: make sure you run this command from the src/ directory so that your .env is written to the correct location (src/)
cd src
python provisioning/provision.py --export-env .env

```
```

The script will check whether the resources you specified exist, otherwise it will create them. It will then construct a .env for you that references the provisioned or referenced resources, including your keys. Once the provisioning is complete, you'll be ready to move to step 3.

Expand All @@ -73,9 +73,9 @@ This step uses vector search with Azure OpenAI embeddings (e.g., ada-002) to enc

- Cognitive Services OpenAI Contributor
- Cognitive Services Contributor
- (optionally if you need quota view) Cognitive Services Usages Reader
- (optionally if you need AOAI quota view) Cognitive Services Usages Reader

Follow instructions on https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/role-based-access-control to add role assignment in your Azure OpenAI resource.
Follow instructions on https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/role-based-access-control to add role assignment in your Azure OpenAI resource. Note that Cognitive Services Usages Reader needs to be set at the subscription level.

Next, run the following script designed to streamline index creation. It builds the search index locally, and publishes it to your AI Studio project in the cloud.

Expand Down Expand Up @@ -144,7 +144,7 @@ python -m evaluation.evaluate  --evaluation-name quality_evals_contoso_retail 
```
This command generates one single custom evaluator called "Completeness" on a much larger test set.
``` bash
python -m evaluation.evaluate_completeness  --evaluation-name completeness_evals_contoso_retail  --dataset-path=./evaluation/evaluation_dataset.jsonl --cot
python -m evaluation.evaluate_completeness  --evaluation-name completeness_evals_contoso_retail  --dataset-path=./evaluation/evaluation_dataset.jsonl
```
To run safety evaluations, you need to 1) simulate adversarial datasets (or provide your own) and 2) evaluate your copilot on the datasets.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,94 +1,94 @@
---
name: Reasonableness
description: Evaluates reasonableness score for QA scenario
model:
api: chat
configuration:
type: azure_openai
azure_deployment: ${env:AZURE_OPENAI_EVALUATION_DEPLOYMENT}
api_version: ${env:AZURE_OPENAI_API_VERSION}
azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
parameters:
temperature: 0.0
max_tokens: 100
top_p: 1.0
presence_penalty: 0
frequency_penalty: 0
seed: 0
response_format:
type: text
inputs:
question:
type: string
answer:
type: string
truth:
type: string
---
system:
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
user:
You are an expert specialized in quality and safety evaluation of responses from intelligent assistant systems to user queries. Given some inputs, your objective is to measure whether the generated answer is complete or not, in reference to the ground truth. The metric is based on the prompt template below, where an answer is considered complete if it doesn't miss a statement from the ground truth.
Use the following steps to respond to inputs.
Step 1: Extract all statements from TRUTH. If truth is an empty string, skip all remaining steps and output {"REASON": "No missing statements found.", "SCORE": 5}.
Step 2: Extract all statements from ANSWER.
Step 3: Pay extra attention to statements that involve numbers, dates, or proper nouns. Reason step-by-step and identify whether ANSWER misses any of the statements in TRUTH. Output those missing statements in REASON.
Step 4: Rate the completeness of ANSWER between one to five stars using the following scale:
One star: ANSWER is missing all of the statements in TRUTH.
Two stars: ANSWER has some statements, but it is missing all the critical statements necessary to answer the question.
Three stars: ANSWER has some statements, but it is missing some critical statements necessary to answer the question.
Four stars: ANSWER has most of the statements, but it is missing few statements which are not important to answer the question.
Five stars: ANSWER has all of the statements in the TRUTH.
Please assign a rating between 1 and 5 based on the completeness the response. Output the rating in SCORE.
Independent Examples:
## Example Task #1 Input:
{"QUESTION": "What color does TrailBlaze Hiking Pants come in?", "ANSWER": "Khaki", "TRUTH": "Khaki"}
## Example Task #1 Output:
{"REASON": "No missing statements found.", "SCORE": 5}
## Example Task #2 Input:
{"QUESTION": "What color does TrailBlaze Hiking Pants come in?", "ANSWER": "Red", "TRUTH": "Khaki"}
## Example Task #2 Output:
{"REASON": "missing statements: \n1. Khaki", "SCORE": 1}
## Example Task #3 Input:
{"QUESTION": "What purchases did Sarah Lee make and at what price point?", "ANSWER": "['TrailMaster X4 Tent: $$250', 'CozyNights Sleeping Bag: $$100', 'TrailBlaze Hiking Pants: $$75', 'RainGuard Hiking Jacket: $$110']", "TRUTH": "['TrailMaster X4 Tent: $$250', 'CozyNights Sleeping Bag: $$100', 'TrailBlaze Hiking Pants: $$75', 'TrekMaster Camping Chair: $$100', 'SkyView 2-Person Tent: $$200', 'RainGuard Hiking Jacket: $$110', 'CompactCook Camping Stove: $$60']"}
## Example Task #3 Output:
{"REASON": "missing statements: \n1. 'TrekMaster Camping Chair: $$100'\n2.'SkyView 2-Person Tent: $$200'\n3. 'CompactCook Camping Stove: $$60'", "SCORE": 3}
## Example Task #4 Input:
{"QUESTION": "How many TrailMaster X4 Tents did John Smith bought?", "ANSWER": "1", "TRUTH": "2"}
## Example Task #4 Output:
{"REASON": "missing statements: \n1. 2 tents were purchased by John Smith.", "SCORE": 1}
## Example Task #5 Input:
{"QUESTION": "How water-proof are TrailBlazeMaster pants?", "ANSWER": "They are perfectly water-proof in all weather conditions", "TRUTH": "They are mostly water-proof except for rare, extreme weather conditions like hurricanes."}
## Example Task #5 Output:
{"REASON": "missing statements: \n1. Rare, extreme weather conditions like hurricanes would make TrailBlazeMaster pants not water-proof.", "SCORE": 4}
## Example Task #6 Input:
{"QUESTION": "How water-proof are TrailBlazeMaster pants?", "ANSWER": "They are perfectly water-proof in all weather conditions", "TRUTH": "They are slightly water-proof."}
## Example Task #6 Output:
{"REASON": "missing statements: \n1. TrailBlazeMaster pants are only slightly water-proof.", "SCORE": 2}
## Example Task #7 Input:
{"QUESTION": "Is a Belgium a country?", "ANSWER": "Sorry I cannot assist with that.", "TRUTH": "Sorry, I cannot answer any questions unrelated to sports gear."}
## Example Task #7 Output:
{"REASON": "No missing statements found.", "SCORE": 5}
## Example Task #8 Input:
{"QUESTION": "Is a Belgium a country?", "ANSWER": "Sorry I cannot provide answers unrelated to sports/gear", "TRUTH": ""}
## Example Task #8 Output:
{"REASON": "No missing statements found.", "SCORE": 5}
## Actual Task Input:
{"QUESTION": {{question}}, "ANSWER": {{answer}}, "TRUTH": {{truth}}}
Reminder: The return values for each task should be an integer between 1 and 5. Do not repeat TRUTH, ANSWER or QUESTION.
---
name: Reasonableness
description: Evaluates reasonableness score for QA scenario
model:
api: chat
configuration:
type: azure_openai
azure_deployment: ${env:AZURE_OPENAI_EVALUATION_DEPLOYMENT}
api_version: ${env:AZURE_OPENAI_API_VERSION}
azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
parameters:
temperature: 0.0
max_tokens: 100
top_p: 1.0
presence_penalty: 0
frequency_penalty: 0
seed: 0
response_format:
type: text


inputs:
question:
type: string
answer:
type: string
truth:
type: string
---
system:
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
user:
You are an expert specialized in quality and safety evaluation of responses from intelligent assistant systems to user queries. Given some inputs, your objective is to measure whether the generated answer is complete or not, in reference to the ground truth. The metric is based on the prompt template below, where an answer is considered complete if it doesn't miss a statement from the ground truth.

Use the following steps to respond to inputs.

Step 1: Extract all statements from TRUTH. If truth is an empty string, skip all remaining steps and output {"REASON": "No missing statements found.", "SCORE": 5}.

Step 2: Extract all statements from ANSWER.

Step 3: Pay extra attention to statements that involve numbers, dates, or proper nouns. Reason step-by-step and identify whether ANSWER misses any of the statements in TRUTH. Output those missing statements in REASON.

Step 4: Rate the completeness of ANSWER between one to five stars using the following scale:

One star: ANSWER is missing all of the statements in TRUTH.

Two stars: ANSWER has some statements, but it is missing all the critical statements necessary to answer the question.

Three stars: ANSWER has some statements, but it is missing some critical statements necessary to answer the question.

Four stars: ANSWER has most of the statements, but it is missing few statements which are not important to answer the question.

Five stars: ANSWER has all of the statements in the TRUTH.

Please assign a rating between 1 and 5 based on the completeness the response. Output the rating in SCORE.

Independent Examples:
## Example Task #1 Input:
{"QUESTION": "What color does TrailBlaze Hiking Pants come in?", "ANSWER": "Khaki", "TRUTH": "Khaki"}
## Example Task #1 Output:
{"REASON": "No missing statements found.", "SCORE": 5}
## Example Task #2 Input:
{"QUESTION": "What color does TrailBlaze Hiking Pants come in?", "ANSWER": "Red", "TRUTH": "Khaki"}
## Example Task #2 Output:
{"REASON": "missing statements: \n1. Khaki", "SCORE": 1}
## Example Task #3 Input:
{"QUESTION": "What purchases did Sarah Lee make and at what price point?", "ANSWER": "['TrailMaster X4 Tent: $$250', 'CozyNights Sleeping Bag: $$100', 'TrailBlaze Hiking Pants: $$75', 'RainGuard Hiking Jacket: $$110']", "TRUTH": "['TrailMaster X4 Tent: $$250', 'CozyNights Sleeping Bag: $$100', 'TrailBlaze Hiking Pants: $$75', 'TrekMaster Camping Chair: $$100', 'SkyView 2-Person Tent: $$200', 'RainGuard Hiking Jacket: $$110', 'CompactCook Camping Stove: $$60']"}
## Example Task #3 Output:
{"REASON": "missing statements: \n1. 'TrekMaster Camping Chair: $$100'\n2.'SkyView 2-Person Tent: $$200'\n3. 'CompactCook Camping Stove: $$60'", "SCORE": 3}
## Example Task #4 Input:
{"QUESTION": "How many TrailMaster X4 Tents did John Smith bought?", "ANSWER": "1", "TRUTH": "2"}
## Example Task #4 Output:
{"REASON": "missing statements: \n1. 2 tents were purchased by John Smith.", "SCORE": 1}
## Example Task #5 Input:
{"QUESTION": "How water-proof are TrailBlazeMaster pants?", "ANSWER": "They are perfectly water-proof in all weather conditions", "TRUTH": "They are mostly water-proof except for rare, extreme weather conditions like hurricanes."}
## Example Task #5 Output:
{"REASON": "missing statements: \n1. Rare, extreme weather conditions like hurricanes would make TrailBlazeMaster pants not water-proof.", "SCORE": 4}
## Example Task #6 Input:
{"QUESTION": "How water-proof are TrailBlazeMaster pants?", "ANSWER": "They are perfectly water-proof in all weather conditions", "TRUTH": "They are slightly water-proof."}
## Example Task #6 Output:
{"REASON": "missing statements: \n1. TrailBlazeMaster pants are only slightly water-proof.", "SCORE": 2}
## Example Task #7 Input:
{"QUESTION": "Is a Belgium a country?", "ANSWER": "Sorry I cannot assist with that.", "TRUTH": "Sorry, I cannot answer any questions unrelated to sports gear."}
## Example Task #7 Output:
{"REASON": "No missing statements found.", "SCORE": 5}
## Example Task #8 Input:
{"QUESTION": "Is a Belgium a country?", "ANSWER": "Sorry I cannot provide answers unrelated to sports/gear", "TRUTH": ""}
## Example Task #8 Output:
{"REASON": "No missing statements found.", "SCORE": 5}

## Actual Task Input:
{"QUESTION": {{question}}, "ANSWER": {{answer}}, "TRUTH": {{truth}}}
Reminder: The return values for each task should be an integer between 1 and 5. Do not repeat TRUTH, ANSWER or QUESTION.
## Actual Task Output:
5 changes: 3 additions & 2 deletions src/custom_evaluators/completeness.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@


class CompletenessEvaluator:
def __init__(self, model_config: AzureOpenAIModelConfiguration, prompty_filename: str = "completeness.prompty"):
def __init__(self, model_config: AzureOpenAIModelConfiguration):
"""
Initialize an evaluator configured for a specific Azure OpenAI model.

Expand All @@ -37,7 +37,8 @@ def __init__(self, model_config: AzureOpenAIModelConfiguration, prompty_filename

prompty_model_config = {"configuration": model_config}
current_dir = os.path.dirname(__file__)
prompty_path = os.path.join(current_dir, prompty_filename)
prompty_path = os.path.join(current_dir, "completeness.prompty")
assert os.path.exists(prompty_path), f"Please specify a valid prompty file for completeness metric! The following path does not exist:\n{prompty_path}"
self._flow = load_flow(source=prompty_path, model=prompty_model_config)

def __call__(self, *, question: str, answer: str, truth: str, **kwargs):
Expand Down
Loading