You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The script will check whether the resources you specified exist, otherwise it will create them. It will then construct a .env for you that references the provisioned or referenced resources, including your keys. Once the provisioning is complete, you'll be ready to move to step 3.
65
65
@@ -73,9 +73,9 @@ This step uses vector search with Azure OpenAI embeddings (e.g., ada-002) to enc
73
73
74
74
- Cognitive Services OpenAI Contributor
75
75
- Cognitive Services Contributor
76
-
- (optionally if you need quota view) Cognitive Services Usages Reader
76
+
- (optionally if you need AOAI quota view) Cognitive Services Usages Reader
77
77
78
-
Follow instructions on https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/role-based-access-control to add role assignment in your Azure OpenAI resource.
78
+
Follow instructions on https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/role-based-access-control to add role assignment in your Azure OpenAI resource. Note that Cognitive Services Usages Reader needs to be set at the subscription level.
79
79
80
80
Next, run the following script designed to streamline index creation. It builds the search index locally, and publishes it to your AI Studio project in the cloud.
This command generates evaluations on a much larger test set and generates some built-in quality metrics such as groundedness and relevance, as well as a custom evaluator called "friendliness". Learn more about our built-in quality metrics [here](https://learn.microsoft.com/en-us/azure/ai-studio/concepts/evaluation-metrics-built-in?tabs=warning#generation-quality-metrics).
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
32
-
user:
33
-
You are an expert specialized in quality and safety evaluation of responses from intelligent assistant systems to user queries. Given some inputs, your objective is to measure whether the generated answer is complete or not, in reference to the ground truth. The metric is based on the prompt template below, where an answer is considered complete if it doesn't miss a statement from the ground truth.
34
-
35
-
Use the following steps to respond to inputs.
36
-
37
-
Step 1: Extract all statements from TRUTH. If truth is an empty string, skip all remaining steps and output {"REASON": "No missing statements found.", "SCORE": 5}.
38
-
39
-
Step 2: Extract all statements from ANSWER.
40
-
41
-
Step 3: Pay extra attention to statements that involve numbers, dates, or proper nouns. Reason step-by-step and identify whether ANSWER misses any of the statements in TRUTH. Output those missing statements in REASON.
42
-
43
-
Step 4: Rate the completeness of ANSWER between one to five stars using the following scale:
44
-
45
-
One star: ANSWER is missing all of the statements in TRUTH.
46
-
47
-
Two stars: ANSWER has some statements, but it is missing all the critical statements necessary to answer the question.
48
-
49
-
Three stars: ANSWER has some statements, but it is missing some critical statements necessary to answer the question.
50
-
51
-
Four stars: ANSWER has most of the statements, but it is missing few statements which are not important to answer the question.
52
-
53
-
Five stars: ANSWER has all of the statements in the TRUTH.
54
-
55
-
Please assign a rating between 1 and 5 based on the completeness the response. Output the rating in SCORE.
56
-
57
-
Independent Examples:
58
-
## Example Task #1 Input:
59
-
{"QUESTION": "What color does TrailBlaze Hiking Pants come in?", "ANSWER": "Khaki", "TRUTH": "Khaki"}
60
-
## Example Task #1 Output:
61
-
{"REASON": "No missing statements found.", "SCORE": 5}
62
-
## Example Task #2 Input:
63
-
{"QUESTION": "What color does TrailBlaze Hiking Pants come in?", "ANSWER": "Red", "TRUTH": "Khaki"}
{"QUESTION": "How many TrailMaster X4 Tents did John Smith bought?", "ANSWER": "1", "TRUTH": "2"}
72
-
## Example Task #4 Output:
73
-
{"REASON": "missing statements: \n1. 2 tents were purchased by John Smith.", "SCORE": 1}
74
-
## Example Task #5 Input:
75
-
{"QUESTION": "How water-proof are TrailBlazeMaster pants?", "ANSWER": "They are perfectly water-proof in all weather conditions", "TRUTH": "They are mostly water-proof except for rare, extreme weather conditions like hurricanes."}
76
-
## Example Task #5 Output:
77
-
{"REASON": "missing statements: \n1. Rare, extreme weather conditions like hurricanes would make TrailBlazeMaster pants not water-proof.", "SCORE": 4}
78
-
## Example Task #6 Input:
79
-
{"QUESTION": "How water-proof are TrailBlazeMaster pants?", "ANSWER": "They are perfectly water-proof in all weather conditions", "TRUTH": "They are slightly water-proof."}
80
-
## Example Task #6 Output:
81
-
{"REASON": "missing statements: \n1. TrailBlazeMaster pants are only slightly water-proof.", "SCORE": 2}
82
-
## Example Task #7 Input:
83
-
{"QUESTION": "Is a Belgium a country?", "ANSWER": "Sorry I cannot assist with that.", "TRUTH": "Sorry, I cannot answer any questions unrelated to sports gear."}
84
-
## Example Task #7 Output:
85
-
{"REASON": "No missing statements found.", "SCORE": 5}
86
-
## Example Task #8 Input:
87
-
{"QUESTION": "Is a Belgium a country?", "ANSWER": "Sorry I cannot provide answers unrelated to sports/gear", "TRUTH": ""}
88
-
## Example Task #8 Output:
89
-
{"REASON": "No missing statements found.", "SCORE": 5}
You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
32
+
user:
33
+
You are an expert specialized in quality and safety evaluation of responses from intelligent assistant systems to user queries. Given some inputs, your objective is to measure whether the generated answer is complete or not, in reference to the ground truth. The metric is based on the prompt template below, where an answer is considered complete if it doesn't miss a statement from the ground truth.
34
+
35
+
Use the following steps to respond to inputs.
36
+
37
+
Step 1: Extract all statements from TRUTH. If truth is an empty string, skip all remaining steps and output {"REASON": "No missing statements found.", "SCORE": 5}.
38
+
39
+
Step 2: Extract all statements from ANSWER.
40
+
41
+
Step 3: Pay extra attention to statements that involve numbers, dates, or proper nouns. Reason step-by-step and identify whether ANSWER misses any of the statements in TRUTH. Output those missing statements in REASON.
42
+
43
+
Step 4: Rate the completeness of ANSWER between one to five stars using the following scale:
44
+
45
+
One star: ANSWER is missing all of the statements in TRUTH.
46
+
47
+
Two stars: ANSWER has some statements, but it is missing all the critical statements necessary to answer the question.
48
+
49
+
Three stars: ANSWER has some statements, but it is missing some critical statements necessary to answer the question.
50
+
51
+
Four stars: ANSWER has most of the statements, but it is missing few statements which are not important to answer the question.
52
+
53
+
Five stars: ANSWER has all of the statements in the TRUTH.
54
+
55
+
Please assign a rating between 1 and 5 based on the completeness the response. Output the rating in SCORE.
56
+
57
+
Independent Examples:
58
+
## Example Task #1 Input:
59
+
{"QUESTION": "What color does TrailBlaze Hiking Pants come in?", "ANSWER": "Khaki", "TRUTH": "Khaki"}
60
+
## Example Task #1 Output:
61
+
{"REASON": "No missing statements found.", "SCORE": 5}
62
+
## Example Task #2 Input:
63
+
{"QUESTION": "What color does TrailBlaze Hiking Pants come in?", "ANSWER": "Red", "TRUTH": "Khaki"}
{"QUESTION": "How many TrailMaster X4 Tents did John Smith bought?", "ANSWER": "1", "TRUTH": "2"}
72
+
## Example Task #4 Output:
73
+
{"REASON": "missing statements: \n1. 2 tents were purchased by John Smith.", "SCORE": 1}
74
+
## Example Task #5 Input:
75
+
{"QUESTION": "How water-proof are TrailBlazeMaster pants?", "ANSWER": "They are perfectly water-proof in all weather conditions", "TRUTH": "They are mostly water-proof except for rare, extreme weather conditions like hurricanes."}
76
+
## Example Task #5 Output:
77
+
{"REASON": "missing statements: \n1. Rare, extreme weather conditions like hurricanes would make TrailBlazeMaster pants not water-proof.", "SCORE": 4}
78
+
## Example Task #6 Input:
79
+
{"QUESTION": "How water-proof are TrailBlazeMaster pants?", "ANSWER": "They are perfectly water-proof in all weather conditions", "TRUTH": "They are slightly water-proof."}
80
+
## Example Task #6 Output:
81
+
{"REASON": "missing statements: \n1. TrailBlazeMaster pants are only slightly water-proof.", "SCORE": 2}
82
+
## Example Task #7 Input:
83
+
{"QUESTION": "Is a Belgium a country?", "ANSWER": "Sorry I cannot assist with that.", "TRUTH": "Sorry, I cannot answer any questions unrelated to sports gear."}
84
+
## Example Task #7 Output:
85
+
{"REASON": "No missing statements found.", "SCORE": 5}
86
+
## Example Task #8 Input:
87
+
{"QUESTION": "Is a Belgium a country?", "ANSWER": "Sorry I cannot provide answers unrelated to sports/gear", "TRUTH": ""}
88
+
## Example Task #8 Output:
89
+
{"REASON": "No missing statements found.", "SCORE": 5}
0 commit comments