Azure-Samples · qubitron · Jun 28, 2024 · Jun 24, 2024 · Jun 25, 2024 · Jun 28, 2024
diff --git a/README.md b/README.md
@@ -54,12 +54,12 @@ Click on "Settings" from the left menu of Azure AI Studio, scroll down to "Conne
 
 Once you set up those parameters, run:
 
-    ```bash
-    # Note: make sure you run this command from the src/ directory so that your .env is written to the correct location (src/)
-    cd src
-    python provisioning/provision.py --export-env .env
+```bash
+# Note: make sure you run this command from the src/ directory so that your .env is written to the correct location (src/)
+cd src
+python provisioning/provision.py --export-env .env
 
-    ```
+```
 
 The script will check whether the resources you specified exist, otherwise it will create them. It will then construct a .env for you that references the provisioned or referenced resources, including your keys. Once the provisioning is complete, you'll be ready to move to step 3.
 
@@ -73,9 +73,9 @@ This step uses vector search with Azure OpenAI embeddings (e.g., ada-002) to enc
 
     - Cognitive Services OpenAI Contributor
     - Cognitive Services Contributor
-    - (optionally if you need quota view) Cognitive Services Usages Reader
+    - (optionally if you need AOAI quota view) Cognitive Services Usages Reader
 
-Follow instructions on https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/role-based-access-control to add role assignment in your Azure OpenAI resource.
+Follow instructions on https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/role-based-access-control to add role assignment in your Azure OpenAI resource. Note that Cognitive Services Usages Reader needs to be set at the subscription level. 
 
 Next, run the following script designed to streamline index creation. It builds the search index locally, and publishes it to your AI Studio project in the cloud.
 
@@ -144,7 +144,7 @@ python -m evaluation.evaluate  --evaluation-name quality_evals_contoso_retail
 ```
 This command generates one single custom evaluator called "Completeness" on a much larger test set.
 ``` bash
-python -m evaluation.evaluate_completeness  --evaluation-name completeness_evals_contoso_retail  --dataset-path=./evaluation/evaluation_dataset.jsonl --cot
+python -m evaluation.evaluate_completeness  --evaluation-name completeness_evals_contoso_retail  --dataset-path=./evaluation/evaluation_dataset.jsonl
 ```
 To run safety evaluations, you need to 1) simulate adversarial datasets (or provide your own) and 2) evaluate your copilot on the datasets. 
 

diff --git a/...ustom_evaluators/completeness_cot.prompty → src/custom_evaluators/completeness.prompty b/...ustom_evaluators/completeness_cot.prompty → src/custom_evaluators/completeness.prompty
@@ -1,94 +1,94 @@
----
-name: Reasonableness
-description: Evaluates reasonableness score for QA scenario
-model:
-  api: chat
-  configuration:
-    type: azure_openai
-    azure_deployment: ${env:AZURE_OPENAI_EVALUATION_DEPLOYMENT}
-    api_version: ${env:AZURE_OPENAI_API_VERSION}
-    azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
-  parameters:
-    temperature: 0.0
-    max_tokens: 100
-    top_p: 1.0
-    presence_penalty: 0
-    frequency_penalty: 0
-    seed: 0
-    response_format:
-      type: text
-
-
-inputs:
-  question:
-    type: string
-  answer:
-    type: string
-  truth:
-    type: string
----
-system:
-You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
-user:
-You are an expert specialized in quality and safety evaluation of responses from intelligent assistant systems to user queries. Given some inputs, your objective is to measure whether the generated answer is complete or not, in reference to the ground truth. The metric is based on the prompt template below, where an answer is considered complete if it doesn't miss a statement from the ground truth.  
-
-Use the following steps to respond to inputs.  
-
-Step 1: Extract all statements from TRUTH. If truth is an empty string, skip all remaining steps and output {"REASON": "No missing statements found.", "SCORE": 5}. 
-
-Step 2: Extract all statements from ANSWER.
-
-Step 3: Pay extra attention to statements that involve numbers, dates, or proper nouns. Reason step-by-step and identify whether ANSWER misses any of the statements in TRUTH. Output those missing statements in REASON.
-
-Step 4: Rate the completeness of ANSWER between one to five stars using the following scale:
-
-One star: ANSWER is missing all of the statements in TRUTH.
-
-Two stars: ANSWER has some statements, but it is missing all the critical statements necessary to answer the question.
-
-Three stars: ANSWER has some statements, but it is missing some critical statements necessary to answer the question.
-
-Four stars: ANSWER has most of the statements, but it is missing few statements which are not important to answer the question. 
-
-Five stars: ANSWER has all of the statements in the TRUTH. 
-
-Please assign a rating between 1 and 5 based on the completeness the response. Output the rating in SCORE.
-
-Independent Examples:
-## Example Task #1 Input:
-{"QUESTION": "What color does TrailBlaze Hiking Pants come in?", "ANSWER": "Khaki", "TRUTH": "Khaki"}
-## Example Task #1 Output:
-{"REASON": "No missing statements found.", "SCORE": 5}
-## Example Task #2 Input:
-{"QUESTION": "What color does TrailBlaze Hiking Pants come in?", "ANSWER": "Red", "TRUTH": "Khaki"}
-## Example Task #2 Output:
-{"REASON": "missing statements: \n1. Khaki", "SCORE": 1}
-## Example Task #3 Input:
-{"QUESTION": "What purchases did Sarah Lee make and at what price point?", "ANSWER": "['TrailMaster X4 Tent: $$250', 'CozyNights Sleeping Bag: $$100', 'TrailBlaze Hiking Pants: $$75', 'RainGuard Hiking Jacket: $$110']", "TRUTH": "['TrailMaster X4 Tent: $$250', 'CozyNights Sleeping Bag: $$100', 'TrailBlaze Hiking Pants: $$75', 'TrekMaster Camping Chair: $$100', 'SkyView 2-Person Tent: $$200', 'RainGuard Hiking Jacket: $$110', 'CompactCook Camping Stove: $$60']"}
-## Example Task #3 Output:
-{"REASON": "missing statements: \n1. 'TrekMaster Camping Chair: $$100'\n2.'SkyView 2-Person Tent: $$200'\n3. 'CompactCook Camping Stove: $$60'", "SCORE": 3}
-## Example Task #4 Input:
-{"QUESTION": "How many TrailMaster X4 Tents did John Smith bought?", "ANSWER": "1", "TRUTH": "2"}
-## Example Task #4 Output:
-{"REASON": "missing statements: \n1. 2 tents were purchased by John Smith.", "SCORE": 1}
-## Example Task #5 Input:
-{"QUESTION": "How water-proof are TrailBlazeMaster pants?", "ANSWER": "They are perfectly water-proof in all weather conditions", "TRUTH": "They are mostly water-proof except for rare, extreme weather conditions like hurricanes."}
-## Example Task #5 Output:
-{"REASON": "missing statements: \n1. Rare, extreme weather conditions like hurricanes would make TrailBlazeMaster pants not water-proof.", "SCORE": 4}
-## Example Task #6 Input:
-{"QUESTION": "How water-proof are TrailBlazeMaster pants?", "ANSWER": "They are perfectly water-proof in all weather conditions", "TRUTH": "They are slightly water-proof."}
-## Example Task #6 Output:
-{"REASON": "missing statements: \n1. TrailBlazeMaster pants are only slightly water-proof.", "SCORE": 2}
-## Example Task #7 Input:
-{"QUESTION": "Is a Belgium a country?", "ANSWER": "Sorry I cannot assist with that.", "TRUTH": "Sorry, I cannot answer any questions unrelated to sports gear."}
-## Example Task #7 Output:
-{"REASON": "No missing statements found.", "SCORE": 5}
-## Example Task #8 Input:
-{"QUESTION": "Is a Belgium a country?", "ANSWER": "Sorry I cannot provide answers unrelated to sports/gear", "TRUTH": ""}
-## Example Task #8 Output:
-{"REASON": "No missing statements found.", "SCORE": 5}
-
-## Actual Task Input:
-{"QUESTION": {{question}}, "ANSWER": {{answer}}, "TRUTH": {{truth}}}
-Reminder: The return values for each task should be an integer between 1 and 5. Do not repeat TRUTH, ANSWER or QUESTION.
+---
+name: Reasonableness
+description: Evaluates reasonableness score for QA scenario
+model:
+  api: chat
+  configuration:
+    type: azure_openai
+    azure_deployment: ${env:AZURE_OPENAI_EVALUATION_DEPLOYMENT}
+    api_version: ${env:AZURE_OPENAI_API_VERSION}
+    azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
+  parameters:
+    temperature: 0.0
+    max_tokens: 100
+    top_p: 1.0
+    presence_penalty: 0
+    frequency_penalty: 0
+    seed: 0
+    response_format:
+      type: text
+
+
+inputs:
+  question:
+    type: string
+  answer:
+    type: string
+  truth:
+    type: string
+---
+system:
+You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
+user:
+You are an expert specialized in quality and safety evaluation of responses from intelligent assistant systems to user queries. Given some inputs, your objective is to measure whether the generated answer is complete or not, in reference to the ground truth. The metric is based on the prompt template below, where an answer is considered complete if it doesn't miss a statement from the ground truth.  
+
+Use the following steps to respond to inputs.  
+
+Step 1: Extract all statements from TRUTH. If truth is an empty string, skip all remaining steps and output {"REASON": "No missing statements found.", "SCORE": 5}. 
+
+Step 2: Extract all statements from ANSWER.
+
+Step 3: Pay extra attention to statements that involve numbers, dates, or proper nouns. Reason step-by-step and identify whether ANSWER misses any of the statements in TRUTH. Output those missing statements in REASON.
+
+Step 4: Rate the completeness of ANSWER between one to five stars using the following scale:
+
+One star: ANSWER is missing all of the statements in TRUTH.
+
+Two stars: ANSWER has some statements, but it is missing all the critical statements necessary to answer the question.
+
+Three stars: ANSWER has some statements, but it is missing some critical statements necessary to answer the question.
+
+Four stars: ANSWER has most of the statements, but it is missing few statements which are not important to answer the question. 
+
+Five stars: ANSWER has all of the statements in the TRUTH. 
+
+Please assign a rating between 1 and 5 based on the completeness the response. Output the rating in SCORE.
+
+Independent Examples:
+## Example Task #1 Input:
+{"QUESTION": "What color does TrailBlaze Hiking Pants come in?", "ANSWER": "Khaki", "TRUTH": "Khaki"}
+## Example Task #1 Output:
+{"REASON": "No missing statements found.", "SCORE": 5}
+## Example Task #2 Input:
+{"QUESTION": "What color does TrailBlaze Hiking Pants come in?", "ANSWER": "Red", "TRUTH": "Khaki"}
+## Example Task #2 Output:
+{"REASON": "missing statements: \n1. Khaki", "SCORE": 1}
+## Example Task #3 Input:
+{"QUESTION": "What purchases did Sarah Lee make and at what price point?", "ANSWER": "['TrailMaster X4 Tent: $$250', 'CozyNights Sleeping Bag: $$100', 'TrailBlaze Hiking Pants: $$75', 'RainGuard Hiking Jacket: $$110']", "TRUTH": "['TrailMaster X4 Tent: $$250', 'CozyNights Sleeping Bag: $$100', 'TrailBlaze Hiking Pants: $$75', 'TrekMaster Camping Chair: $$100', 'SkyView 2-Person Tent: $$200', 'RainGuard Hiking Jacket: $$110', 'CompactCook Camping Stove: $$60']"}
+## Example Task #3 Output:
+{"REASON": "missing statements: \n1. 'TrekMaster Camping Chair: $$100'\n2.'SkyView 2-Person Tent: $$200'\n3. 'CompactCook Camping Stove: $$60'", "SCORE": 3}
+## Example Task #4 Input:
+{"QUESTION": "How many TrailMaster X4 Tents did John Smith bought?", "ANSWER": "1", "TRUTH": "2"}
+## Example Task #4 Output:
+{"REASON": "missing statements: \n1. 2 tents were purchased by John Smith.", "SCORE": 1}
+## Example Task #5 Input:
+{"QUESTION": "How water-proof are TrailBlazeMaster pants?", "ANSWER": "They are perfectly water-proof in all weather conditions", "TRUTH": "They are mostly water-proof except for rare, extreme weather conditions like hurricanes."}
+## Example Task #5 Output:
+{"REASON": "missing statements: \n1. Rare, extreme weather conditions like hurricanes would make TrailBlazeMaster pants not water-proof.", "SCORE": 4}
+## Example Task #6 Input:
+{"QUESTION": "How water-proof are TrailBlazeMaster pants?", "ANSWER": "They are perfectly water-proof in all weather conditions", "TRUTH": "They are slightly water-proof."}
+## Example Task #6 Output:
+{"REASON": "missing statements: \n1. TrailBlazeMaster pants are only slightly water-proof.", "SCORE": 2}
+## Example Task #7 Input:
+{"QUESTION": "Is a Belgium a country?", "ANSWER": "Sorry I cannot assist with that.", "TRUTH": "Sorry, I cannot answer any questions unrelated to sports gear."}
+## Example Task #7 Output:
+{"REASON": "No missing statements found.", "SCORE": 5}
+## Example Task #8 Input:
+{"QUESTION": "Is a Belgium a country?", "ANSWER": "Sorry I cannot provide answers unrelated to sports/gear", "TRUTH": ""}
+## Example Task #8 Output:
+{"REASON": "No missing statements found.", "SCORE": 5}
+
+## Actual Task Input:
+{"QUESTION": {{question}}, "ANSWER": {{answer}}, "TRUTH": {{truth}}}
+Reminder: The return values for each task should be an integer between 1 and 5. Do not repeat TRUTH, ANSWER or QUESTION.
 ## Actual Task Output:
diff --git a/src/custom_evaluators/completeness.py b/src/custom_evaluators/completeness.py
@@ -13,7 +13,7 @@
 
 
 class CompletenessEvaluator:
-    def __init__(self, model_config: AzureOpenAIModelConfiguration, prompty_filename: str = "completeness.prompty"):
+    def __init__(self, model_config: AzureOpenAIModelConfiguration):
         """
         Initialize an evaluator configured for a specific Azure OpenAI model.
 
@@ -37,7 +37,8 @@ def __init__(self, model_config: AzureOpenAIModelConfiguration, prompty_filename
 
         prompty_model_config = {"configuration": model_config}
         current_dir = os.path.dirname(__file__)
-        prompty_path = os.path.join(current_dir, prompty_filename)
+        prompty_path = os.path.join(current_dir, "completeness.prompty")
+        assert os.path.exists(prompty_path), f"Please specify a valid prompty file for completeness metric! The following path does not exist:\n{prompty_path}"
         self._flow = load_flow(source=prompty_path, model=prompty_model_config)
 
     def __call__(self, *, question: str, answer: str, truth: str, **kwargs):