Skip to content

Commit 5f03aae

Browse files
thekaranacharyazsimjeeirgolic
authored
Create PII Filter validator (#395)
* Add PII filter: v0 * Update PII filter * Update class docstring, add more entities to filter, bugfix and update some comments * Move logic to helper function, update DEV_REQUIREMENTS, add integration tests * Fix linting * Write code according to Python 3.9 * Add mocks for AnalyzerEngine and AnonymizerEngine * Change | to Union * Add notebook example demo for PIIFilter * Add package imports * Init pydantic_utils v2 * Adjust tests for pydantic2 * pydantic: Allow Dict/List field types (fix #319) * test_validators: Fix SimilarToList validator test * CI: handle pydantic v1 and v2 separately * list => List * parsing_utils: Type ignore * pydantic_utils/v2: safe BareModel * fix Makefile for poetry * fix ci cache for pydantic versions * Remove setup.py * Update pyproject and poetry * Linting fixes * Remove types remain intact changes * Strong type results to covaraiant Sequence, as suggested by Pyright * Fix linting issues * Change casting * Add else condition for pii_entities to avoid unbound errors for entities_to_filter --------- Co-authored-by: zsimjee <[email protected]> Co-authored-by: Rafael Irgolic <[email protected]>
1 parent 91c5a1a commit 5f03aae

File tree

6 files changed

+1272
-5
lines changed

6 files changed

+1272
-5
lines changed

docs/examples/check_for_pii.ipynb

Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Check whether an LLM response contains PII (Personally Identifiable Information)\n",
8+
"\n",
9+
"**Using the `PIIFilter` validator**\n",
10+
"\n",
11+
"This is a simple check that looks for the presence of a few common PII patterns\n",
12+
"It is not intended to be a comprehensive check for PII and to be a quick check that can be used to filter out responses that are likely to contain PII. It uses the Microsoft Presidio library to check for PII.\n"
13+
]
14+
},
15+
{
16+
"cell_type": "code",
17+
"execution_count": 1,
18+
"metadata": {},
19+
"outputs": [
20+
{
21+
"name": "stdout",
22+
"output_type": "stream",
23+
"text": [
24+
"\n",
25+
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3\u001b[0m\n",
26+
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
27+
"\n",
28+
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3\u001b[0m\n",
29+
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
30+
"\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n",
31+
"You can now load the package via spacy.load('en_core_web_lg')\n"
32+
]
33+
}
34+
],
35+
"source": [
36+
"# Install the necessary packages\n",
37+
"! pip install presidio-analyzer presidio-anonymizer -q\n",
38+
"! python -m spacy download en_core_web_lg -q"
39+
]
40+
},
41+
{
42+
"cell_type": "code",
43+
"execution_count": 2,
44+
"metadata": {},
45+
"outputs": [],
46+
"source": [
47+
"# Import the guardrails package\n",
48+
"import guardrails as gd\n",
49+
"from guardrails.validators import PIIFilter\n",
50+
"from rich import print"
51+
]
52+
},
53+
{
54+
"cell_type": "code",
55+
"execution_count": 3,
56+
"metadata": {},
57+
"outputs": [
58+
{
59+
"name": "stderr",
60+
"output_type": "stream",
61+
"text": [
62+
"/Users/karanacharya/guardrails-ai/guardrails/guardrails/rail.py:115: UserWarning: Prompt must be provided during __call__.\n",
63+
" warnings.warn(\"Prompt must be provided during __call__.\")\n"
64+
]
65+
}
66+
],
67+
"source": [
68+
"# Create Guard object with this validator\n",
69+
"# One can specify either pre-defined set of PII or SPI (Sensitive Personal Information) entities by passing in the `pii` or `spi` argument respectively.\n",
70+
"# It can be passed either durring intialization or later through the metadata argument in parse method.\n",
71+
"\n",
72+
"# One can also pass in a list of entities supported by Presidio to the `pii_entities` argument.\n",
73+
"guard = gd.Guard.from_string(\n",
74+
" validators=[PIIFilter(pii_entities=\"pii\", on_fail=\"fix\")],\n",
75+
" description=\"testmeout\",\n",
76+
")"
77+
]
78+
},
79+
{
80+
"cell_type": "code",
81+
"execution_count": 4,
82+
"metadata": {},
83+
"outputs": [
84+
{
85+
"data": {
86+
"text/html": [
87+
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">My email address is <span style=\"font-weight: bold\">&lt;</span><span style=\"color: #ff00ff; text-decoration-color: #ff00ff; font-weight: bold\">EMAIL_ADDRESS</span><span style=\"color: #000000; text-decoration-color: #000000\">&gt;, and my phone number is &lt;PHONE_NUMBER</span><span style=\"font-weight: bold\">&gt;</span>\n",
88+
"</pre>\n"
89+
],
90+
"text/plain": [
91+
"My email address is \u001b[1m<\u001b[0m\u001b[1;95mEMAIL_ADDRESS\u001b[0m\u001b[39m>, and my phone number is <PHONE_NUMBER\u001b[0m\u001b[1m>\u001b[0m\n"
92+
]
93+
},
94+
"metadata": {},
95+
"output_type": "display_data"
96+
}
97+
],
98+
"source": [
99+
"# Parse the text\n",
100+
"text = \"My email address is [email protected], and my phone number is 1234567890\"\n",
101+
"output = guard.parse(\n",
102+
" llm_output=text,\n",
103+
")\n",
104+
"\n",
105+
"# Print the output\n",
106+
"print(output)"
107+
]
108+
},
109+
{
110+
"cell_type": "markdown",
111+
"metadata": {},
112+
"source": [
113+
"Here, both EMAIL_ADDRESS and PHONE_NUMBER are detected as PII.\n"
114+
]
115+
},
116+
{
117+
"cell_type": "code",
118+
"execution_count": 5,
119+
"metadata": {},
120+
"outputs": [
121+
{
122+
"data": {
123+
"text/html": [
124+
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">My email address is <span style=\"font-weight: bold\">&lt;</span><span style=\"color: #ff00ff; text-decoration-color: #ff00ff; font-weight: bold\">EMAIL_ADDRESS</span><span style=\"font-weight: bold\">&gt;</span>, and my phone number is <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1234567890</span>\n",
125+
"</pre>\n"
126+
],
127+
"text/plain": [
128+
"My email address is \u001b[1m<\u001b[0m\u001b[1;95mEMAIL_ADDRESS\u001b[0m\u001b[1m>\u001b[0m, and my phone number is \u001b[1;36m1234567890\u001b[0m\n"
129+
]
130+
},
131+
"metadata": {},
132+
"output_type": "display_data"
133+
}
134+
],
135+
"source": [
136+
"# Let's test with passing through metadata for the same guard object\n",
137+
"# This will take precendence over the entities passed in during initialization\n",
138+
"output = guard.parse(\n",
139+
" llm_output=text,\n",
140+
" metadata={\"pii_entities\": [\"EMAIL_ADDRESS\"]},\n",
141+
")\n",
142+
"\n",
143+
"# Print the output\n",
144+
"print(output)"
145+
]
146+
},
147+
{
148+
"cell_type": "markdown",
149+
"metadata": {},
150+
"source": [
151+
"As you can see here, only EMAIL_ADDRESS is detected as PII, and the PHONE_NUMBER is not detected as PII.\n"
152+
]
153+
},
154+
{
155+
"cell_type": "code",
156+
"execution_count": 6,
157+
"metadata": {},
158+
"outputs": [],
159+
"source": [
160+
"# Let's try with SPI entities\n",
161+
"# Create a new guard object\n",
162+
"guard = gd.Guard.from_string(\n",
163+
" validators=[PIIFilter(pii_entities=\"spi\", on_fail=\"fix\")],\n",
164+
" description=\"testmeout\",\n",
165+
")"
166+
]
167+
},
168+
{
169+
"cell_type": "code",
170+
"execution_count": 7,
171+
"metadata": {},
172+
"outputs": [
173+
{
174+
"data": {
175+
"text/html": [
176+
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">My email address is [email protected], and my account number is <span style=\"font-weight: bold\">&lt;</span><span style=\"color: #ff00ff; text-decoration-color: #ff00ff; font-weight: bold\">US_BANK_NUMBER</span><span style=\"font-weight: bold\">&gt;</span>.\n",
177+
"</pre>\n"
178+
],
179+
"text/plain": [
180+
"My email address is [email protected], and my account number is \u001b[1m<\u001b[0m\u001b[1;95mUS_BANK_NUMBER\u001b[0m\u001b[1m>\u001b[0m.\n"
181+
]
182+
},
183+
"metadata": {},
184+
"output_type": "display_data"
185+
}
186+
],
187+
"source": [
188+
"# Parse text\n",
189+
"text = \"My email address is [email protected], and my account number is 1234789012367654.\"\n",
190+
"\n",
191+
"output = guard.parse(\n",
192+
" llm_output=text,\n",
193+
")\n",
194+
"\n",
195+
"# Print the output\n",
196+
"print(output)"
197+
]
198+
},
199+
{
200+
"cell_type": "markdown",
201+
"metadata": {},
202+
"source": [
203+
"Here, only the US_BANK_NUMBER is detected as PII, as specified in the \"spi\" entities. Refer to the documentation for more information on the \"pii\" and \"spi\" entities. Obviosuly, you can pass in any [Presidio-supported entities](https://microsoft.github.io/presidio/supported_entities/) through the metadata.\n"
204+
]
205+
},
206+
{
207+
"cell_type": "code",
208+
"execution_count": 8,
209+
"metadata": {},
210+
"outputs": [
211+
{
212+
"data": {
213+
"text/html": [
214+
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">My ITIN is <span style=\"font-weight: bold\">&lt;</span><span style=\"color: #ff00ff; text-decoration-color: #ff00ff; font-weight: bold\">US_ITIN</span><span style=\"color: #000000; text-decoration-color: #000000\">&gt; and my driver's license number is &lt;US_DRIVER_LICENSE</span><span style=\"font-weight: bold\">&gt;</span>\n",
215+
"</pre>\n"
216+
],
217+
"text/plain": [
218+
"My ITIN is \u001b[1m<\u001b[0m\u001b[1;95mUS_ITIN\u001b[0m\u001b[39m> and my driver's license number is <US_DRIVER_LICENSE\u001b[0m\u001b[1m>\u001b[0m\n"
219+
]
220+
},
221+
"metadata": {},
222+
"output_type": "display_data"
223+
}
224+
],
225+
"source": [
226+
"# Another example\n",
227+
"text = \"My ITIN is 923756789 and my driver's license number is 87651239\"\n",
228+
"\n",
229+
"output = guard.parse(\n",
230+
" llm_output=text,\n",
231+
" metadata={\"pii_entities\": [\"US_ITIN\", \"US_DRIVER_LICENSE\"]},\n",
232+
")\n",
233+
"\n",
234+
"# Print the output\n",
235+
"print(output)"
236+
]
237+
},
238+
{
239+
"cell_type": "markdown",
240+
"metadata": {},
241+
"source": [
242+
"#### In this way, any PII entity that you want to check for can be passed in through the metadata and masked by Guardrails for your LLM outputs. Of-course, like all other examples, you can integrate this into your own code and workflows through the complete Guard execution.\n"
243+
]
244+
}
245+
],
246+
"metadata": {
247+
"kernelspec": {
248+
"display_name": "guard-venv",
249+
"language": "python",
250+
"name": "python3"
251+
},
252+
"language_info": {
253+
"codemirror_mode": {
254+
"name": "ipython",
255+
"version": 3
256+
},
257+
"file_extension": ".py",
258+
"mimetype": "text/x-python",
259+
"name": "python",
260+
"nbconvert_exporter": "python",
261+
"pygments_lexer": "ipython3",
262+
"version": "3.11.6"
263+
}
264+
},
265+
"nbformat": 4,
266+
"nbformat_minor": 2
267+
}

0 commit comments

Comments
 (0)