-
Notifications
You must be signed in to change notification settings - Fork 398
Create PII Filter validator #395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Changes from 11 commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
3f09305
Add PII filter: v0
thekaranacharya 1972f2d
Update PII filter
thekaranacharya 604683e
Update class docstring, add more entities to filter, bugfix and updat…
thekaranacharya 7f9ef72
Update files
thekaranacharya 0913a21
Move logic to helper function, update DEV_REQUIREMENTS, add integrati…
thekaranacharya 1a5c0a5
Fix linting
thekaranacharya e18e05a
Write code according to Python 3.9
thekaranacharya 59363d6
Add mocks for AnalyzerEngine and AnonymizerEngine
thekaranacharya cdacd19
Change | to Union
thekaranacharya 93219f8
Add notebook example demo for PIIFilter
thekaranacharya 51e3a32
Add package imports
thekaranacharya 588a597
Merge branch 'main' into karan/pii
thekaranacharya 273883f
Merge branch 'main' into karan/pii
thekaranacharya 484a335
merge main in
zsimjee 8bffe25
Init pydantic_utils v2
irgolic 9d3746c
Adjust tests for pydantic2
irgolic 69e8779
pydantic: Allow Dict/List field types (fix #319)
irgolic b2a0c6b
test_validators: Fix SimilarToList validator test
irgolic 62d7775
CI: handle pydantic v1 and v2 separately
irgolic 3a3dd91
list => List
irgolic 2277216
parsing_utils: Type ignore
irgolic c6c519a
pydantic_utils/v2: safe BareModel
irgolic b006955
fix Makefile for poetry
irgolic 7be1f1c
fix ci cache for pydantic versions
irgolic d557192
Remove setup.py
thekaranacharya 52db68c
Update pyproject and poetry
thekaranacharya 00d1b0f
Linting fixes
thekaranacharya 8016808
Remove types remain intact changes
thekaranacharya a0b604a
Strong type results to covaraiant Sequence, as suggested by Pyright
thekaranacharya 52d9f07
Fix linting issues
thekaranacharya ce25f45
Change casting
thekaranacharya 117ab5a
Add else condition for pii_entities to avoid unbound errors for entit…
thekaranacharya File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,267 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Check whether an LLM response contains PII (Personally Identifiable Information)\n", | ||
"\n", | ||
"**Using the `PIIFilter` validator**\n", | ||
"\n", | ||
"This is a simple check that looks for the presence of a few common PII patterns\n", | ||
"It is not intended to be a comprehensive check for PII and to be a quick check that can be used to filter out responses that are likely to contain PII. It uses the Microsoft Presidio library to check for PII.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"\n", | ||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3\u001b[0m\n", | ||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", | ||
"\n", | ||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m23.2.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m23.3\u001b[0m\n", | ||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", | ||
"\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n", | ||
"You can now load the package via spacy.load('en_core_web_lg')\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Install the necessary packages\n", | ||
"! pip install presidio-analyzer presidio-anonymizer -q\n", | ||
"! python -m spacy download en_core_web_lg -q" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Import the guardrails package\n", | ||
"import guardrails as gd\n", | ||
"from guardrails.validators import PIIFilter\n", | ||
"from rich import print" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"/Users/karanacharya/guardrails-ai/guardrails/guardrails/rail.py:115: UserWarning: Prompt must be provided during __call__.\n", | ||
" warnings.warn(\"Prompt must be provided during __call__.\")\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Create Guard object with this validator\n", | ||
"# One can specify either pre-defined set of PII or SPI (Sensitive Personal Information) entities by passing in the `pii` or `spi` argument respectively.\n", | ||
"# It can be passed either durring intialization or later through the metadata argument in parse method.\n", | ||
"\n", | ||
"# One can also pass in a list of entities supported by Presidio to the `pii_entities` argument.\n", | ||
"guard = gd.Guard.from_string(\n", | ||
" validators=[PIIFilter(pii_entities=\"pii\", on_fail=\"fix\")],\n", | ||
" description=\"testmeout\",\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/html": [ | ||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">My email address is <span style=\"font-weight: bold\"><</span><span style=\"color: #ff00ff; text-decoration-color: #ff00ff; font-weight: bold\">EMAIL_ADDRESS</span><span style=\"color: #000000; text-decoration-color: #000000\">>, and my phone number is <PHONE_NUMBER</span><span style=\"font-weight: bold\">></span>\n", | ||
"</pre>\n" | ||
], | ||
"text/plain": [ | ||
"My email address is \u001b[1m<\u001b[0m\u001b[1;95mEMAIL_ADDRESS\u001b[0m\u001b[39m>, and my phone number is <PHONE_NUMBER\u001b[0m\u001b[1m>\u001b[0m\n" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
} | ||
], | ||
"source": [ | ||
"# Parse the text\n", | ||
"text = \"My email address is [email protected], and my phone number is 1234567890\"\n", | ||
"output = guard.parse(\n", | ||
" llm_output=text,\n", | ||
")\n", | ||
"\n", | ||
"# Print the output\n", | ||
"print(output)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Here, both EMAIL_ADDRESS and PHONE_NUMBER are detected as PII.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/html": [ | ||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">My email address is <span style=\"font-weight: bold\"><</span><span style=\"color: #ff00ff; text-decoration-color: #ff00ff; font-weight: bold\">EMAIL_ADDRESS</span><span style=\"font-weight: bold\">></span>, and my phone number is <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1234567890</span>\n", | ||
"</pre>\n" | ||
], | ||
"text/plain": [ | ||
"My email address is \u001b[1m<\u001b[0m\u001b[1;95mEMAIL_ADDRESS\u001b[0m\u001b[1m>\u001b[0m, and my phone number is \u001b[1;36m1234567890\u001b[0m\n" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
} | ||
], | ||
"source": [ | ||
"# Let's test with passing through metadata for the same guard object\n", | ||
"# This will take precendence over the entities passed in during initialization\n", | ||
"output = guard.parse(\n", | ||
" llm_output=text,\n", | ||
" metadata={\"pii_entities\": [\"EMAIL_ADDRESS\"]},\n", | ||
")\n", | ||
"\n", | ||
"# Print the output\n", | ||
"print(output)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"As you can see here, only EMAIL_ADDRESS is detected as PII, and the PHONE_NUMBER is not detected as PII.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Let's try with SPI entities\n", | ||
"# Create a new guard object\n", | ||
"guard = gd.Guard.from_string(\n", | ||
" validators=[PIIFilter(pii_entities=\"spi\", on_fail=\"fix\")],\n", | ||
" description=\"testmeout\",\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/html": [ | ||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">My email address is [email protected], and my account number is <span style=\"font-weight: bold\"><</span><span style=\"color: #ff00ff; text-decoration-color: #ff00ff; font-weight: bold\">US_BANK_NUMBER</span><span style=\"font-weight: bold\">></span>.\n", | ||
"</pre>\n" | ||
], | ||
"text/plain": [ | ||
"My email address is [email protected], and my account number is \u001b[1m<\u001b[0m\u001b[1;95mUS_BANK_NUMBER\u001b[0m\u001b[1m>\u001b[0m.\n" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
} | ||
], | ||
"source": [ | ||
"# Parse text\n", | ||
"text = \"My email address is [email protected], and my account number is 1234789012367654.\"\n", | ||
"\n", | ||
"output = guard.parse(\n", | ||
" llm_output=text,\n", | ||
")\n", | ||
"\n", | ||
"# Print the output\n", | ||
"print(output)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Here, only the US_BANK_NUMBER is detected as PII, as specified in the \"spi\" entities. Refer to the documentation for more information on the \"pii\" and \"spi\" entities. Obviosuly, you can pass in any [Presidio-supported entities](https://microsoft.github.io/presidio/supported_entities/) through the metadata.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/html": [ | ||
"<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">My ITIN is <span style=\"font-weight: bold\"><</span><span style=\"color: #ff00ff; text-decoration-color: #ff00ff; font-weight: bold\">US_ITIN</span><span style=\"color: #000000; text-decoration-color: #000000\">> and my driver's license number is <US_DRIVER_LICENSE</span><span style=\"font-weight: bold\">></span>\n", | ||
"</pre>\n" | ||
], | ||
"text/plain": [ | ||
"My ITIN is \u001b[1m<\u001b[0m\u001b[1;95mUS_ITIN\u001b[0m\u001b[39m> and my driver's license number is <US_DRIVER_LICENSE\u001b[0m\u001b[1m>\u001b[0m\n" | ||
] | ||
}, | ||
"metadata": {}, | ||
"output_type": "display_data" | ||
} | ||
], | ||
"source": [ | ||
"# Another example\n", | ||
"text = \"My ITIN is 923756789 and my driver's license number is 87651239\"\n", | ||
"\n", | ||
"output = guard.parse(\n", | ||
" llm_output=text,\n", | ||
" metadata={\"pii_entities\": [\"US_ITIN\", \"US_DRIVER_LICENSE\"]},\n", | ||
")\n", | ||
"\n", | ||
"# Print the output\n", | ||
"print(output)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### In this way, any PII entity that you want to check for can be passed in through the metadata and masked by Guardrails for your LLM outputs. Of-course, like all other examples, you can integrate this into your own code and workflows through the complete Guard execution.\n" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "guard-venv", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.6" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.