Skip to content

Commit 6298034

Browse files
authored
[text analytics] add back PII endpoint (Azure#12673)
1 parent 7823466 commit 6298034

File tree

82 files changed

+7454
-35
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

82 files changed

+7454
-35
lines changed

sdk/textanalytics/azure-ai-textanalytics/CHANGELOG.md

+2
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,10 @@
22

33
## 5.0.1 (Unreleased)
44

5+
**New features**
56
- We are now targeting the service's v3.1-preview.1 API as the default. If you would like to still use version v3.0 of the service,
67
pass in `v3.0` to the kwarg `api_version` when creating your TextAnalyticsClient
8+
- We have added an API `recognize_pii_entities` which returns entities containing personal information for a batch of documents. Only available for API version v3.1-preview.1 and up.
79

810
## 5.0.0 (2020-07-27)
911

sdk/textanalytics/azure-ai-textanalytics/README.md

+35
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ Text Analytics is a cloud-based service that provides advanced natural language
44
* Sentiment Analysis
55
* Named Entity Recognition
66
* Linked Entity Recognition
7+
* Personally Identifiable Information (PII) Entity Recognition
78
* Language Detection
89
* Key Phrase Extraction
910

@@ -184,6 +185,7 @@ The following section provides several code snippets covering some of the most c
184185
* [Analyze Sentiment](#analyze-sentiment "Analyze sentiment")
185186
* [Recognize Entities](#recognize-entities "Recognize entities")
186187
* [Recognize Linked Entities](#recognize-linked-entities "Recognize linked entities")
188+
* [Recognize PII Entities](#recognize-pii-entities "Recognize pii entities")
187189
* [Extract Key Phrases](#extract-key-phrases "Extract key phrases")
188190
* [Detect Language](#detect-language "Detect language")
189191

@@ -290,6 +292,35 @@ The returned response is a heterogeneous list of result and error objects: list[
290292
Please refer to the service documentation for a conceptual discussion of [entity linking][linked_entity_recognition]
291293
and [supported types][linked_entities_categories].
292294

295+
### Recognize PII entities
296+
[recognize_pii_entities][recognize_pii_entities] recognizes and categorizes Personally Identifiable Information (PII) entities in its input text, such as
297+
Social Security Numbers, bank account information, credit card numbers, and more. This endpoint is only available for v3.1-preview.1 and up.
298+
299+
```python
300+
from azure.core.credentials import AzureKeyCredential
301+
from azure.ai.textanalytics import TextAnalyticsClient, ApiVersion
302+
303+
credential = AzureKeyCredential("<api_key>")
304+
endpoint="https://<region>.api.cognitive.microsoft.com/"
305+
306+
text_analytics_client = TextAnalyticsClient(endpoint, credential)
307+
308+
documents = [
309+
"The employee's SSN is 859-98-0987.",
310+
"The employee's phone number is 555-555-5555."
311+
]
312+
response = text_analytics_client.recognize_pii_entities(documents, language="en")
313+
result = [doc for doc in response if not doc.is_error]
314+
for doc in result:
315+
for entity in doc.entities:
316+
print("Entity: \t", entity.text, "\tCategory: \t", entity.category,
317+
"\tConfidence Score: \t", entity.confidence_score)
318+
```
319+
320+
The returned response is a heterogeneous list of result and error objects: list[[RecognizePiiEntitiesResult][recognize_pii_entities_result], [DocumentError][document_error]]
321+
322+
Please refer to the service documentation for [supported PII entity types][pii_entity_categories].
323+
293324
### Extract key phrases
294325
[extract_key_phrases][extract_key_phrases] determines the main talking points in its input text. For example, for the input text "The food was delicious and there were wonderful staff", the API returns: "food" and "wonderful staff".
295326

@@ -412,6 +443,7 @@ Authenticate the client with a Cognitive Services/Text Analytics API key or a to
412443
In a batch of documents:
413444
* Analyze sentiment: [sample_analyze_sentiment.py][analyze_sentiment_sample] ([async version][analyze_sentiment_sample_async])
414445
* Recognize entities: [sample_recognize_entities.py][recognize_entities_sample] ([async version][recognize_entities_sample_async])
446+
* Recognize personally identifiable information: [sample_recognize_pii_entities.py](`https://github.com/Azure/azure-sdk-for-python/blob/master/sdk/textanalytics/azure-ai-textanalytics/samples/sample_recognize_pii_entities.py`)([async version](`https://github.com/Azure/azure-sdk-for-python/blob/master/sdk/textanalytics/azure-ai-textanalytics/samples/async_samples/sample_recognize_pii_entities_async.py`))
415447
* Recognize linked entities: [sample_recognize_linked_entities.py][recognize_linked_entities_sample] ([async version][recognize_linked_entities_sample_async])
416448
* Extract key phrases: [sample_extract_key_phrases.py][extract_key_phrases_sample] ([async version][extract_key_phrases_sample_async])
417449
* Detect language: [sample_detect_language.py][detect_language_sample] ([async version][detect_language_sample_async])
@@ -458,6 +490,7 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
458490
[document_error]: https://aka.ms/azsdk-python-textanalytics-documenterror
459491
[detect_language_result]: https://aka.ms/azsdk-python-textanalytics-detectlanguageresult
460492
[recognize_entities_result]: https://aka.ms/azsdk-python-textanalytics-recognizeentitiesresult
493+
[recognize_pii_entities_result]: https://aka.ms/azsdk-python-textanalytics-recognizepiientitiesresult
461494
[recognize_linked_entities_result]: https://aka.ms/azsdk-python-textanalytics-recognizelinkedentitiesresult
462495
[analyze_sentiment_result]: https://aka.ms/azsdk-python-textanalytics-analyzesentimentresult
463496
[extract_key_phrases_result]: https://aka.ms/azsdk-python-textanalytics-extractkeyphrasesresult
@@ -467,6 +500,7 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
467500

468501
[analyze_sentiment]: https://aka.ms/azsdk-python-textanalytics-analyzesentiment
469502
[recognize_entities]: https://aka.ms/azsdk-python-textanalytics-recognizeentities
503+
[recognize_pii_entities]: https://aka.ms/azsdk-python-textanalytics-recognizepiientities
470504
[recognize_linked_entities]: https://aka.ms/azsdk-python-textanalytics-recognizelinkedentities
471505
[extract_key_phrases]: https://aka.ms/azsdk-python-textanalytics-extractkeyphrases
472506
[detect_language]: https://aka.ms/azsdk-python-textanalytics-detectlanguage
@@ -477,6 +511,7 @@ This project has adopted the [Microsoft Open Source Code of Conduct][code_of_con
477511
[key_phrase_extraction]: https://docs.microsoft.com/azure/cognitive-services/text-analytics/how-tos/text-analytics-how-to-keyword-extraction
478512
[linked_entities_categories]: https://docs.microsoft.com/azure/cognitive-services/text-analytics/named-entity-types?tabs=general
479513
[linked_entity_recognition]: https://docs.microsoft.com/azure/cognitive-services/text-analytics/how-tos/text-analytics-how-to-entity-linking
514+
[pii_entity_categories]: https://docs.microsoft.com/azure/cognitive-services/text-analytics/named-entity-types?tabs=personal
480515
[named_entity_recognition]: https://docs.microsoft.com/azure/cognitive-services/text-analytics/how-tos/text-analytics-how-to-entity-linking
481516
[named_entity_categories]: https://docs.microsoft.com/azure/cognitive-services/text-analytics/named-entity-types?tabs=general
482517

sdk/textanalytics/azure-ai-textanalytics/azure/ai/textanalytics/__init__.py

+6-2
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,9 @@
2525
LinkedEntityMatch,
2626
TextDocumentBatchStatistics,
2727
SentenceSentiment,
28-
SentimentConfidenceScores
28+
SentimentConfidenceScores,
29+
RecognizePiiEntitiesResult,
30+
PiiEntity
2931
)
3032

3133
__all__ = [
@@ -48,7 +50,9 @@
4850
'LinkedEntityMatch',
4951
'TextDocumentBatchStatistics',
5052
'SentenceSentiment',
51-
'SentimentConfidenceScores'
53+
'SentimentConfidenceScores',
54+
'RecognizePiiEntitiesResult',
55+
'PiiEntity',
5256
]
5357

5458
__version__ = VERSION

sdk/textanalytics/azure-ai-textanalytics/azure/ai/textanalytics/_models.py

+73-6
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ class RecognizeEntitiesResult(DictMixin):
102102
:vartype entities:
103103
list[~azure.ai.textanalytics.CategorizedEntity]
104104
:ivar warnings: Warnings encountered while processing document. Results will still be returned
105-
if there are warnings, but they may not be fully accurate.
105+
if there are warnings, but they may not be fully accurate.
106106
:vartype warnings: list[~azure.ai.textanalytics.TextAnalyticsWarning]
107107
:ivar statistics: If show_stats=true was specified in the request this
108108
field will contain information about the document payload.
@@ -124,6 +124,40 @@ def __repr__(self):
124124
.format(self.id, repr(self.entities), repr(self.warnings), repr(self.statistics), self.is_error)[:1024]
125125

126126

127+
class RecognizePiiEntitiesResult(DictMixin):
128+
"""RecognizePiiEntitiesResult is a result object which contains
129+
the recognized Personally Identifiable Information (PII) entities
130+
from a particular document.
131+
132+
:ivar str id: Unique, non-empty document identifier that matches the
133+
document id that was passed in with the request. If not specified
134+
in the request, an id is assigned for the document.
135+
:ivar entities: Recognized PII entities in the document.
136+
:vartype entities:
137+
list[~azure.ai.textanalytics.PiiEntity]
138+
:ivar warnings: Warnings encountered while processing document. Results will still be returned
139+
if there are warnings, but they may not be fully accurate.
140+
:vartype warnings: list[~azure.ai.textanalytics.TextAnalyticsWarning]
141+
:ivar statistics: If show_stats=true was specified in the request this
142+
field will contain information about the document payload.
143+
:vartype statistics:
144+
~azure.ai.textanalytics.TextDocumentStatistics
145+
:ivar bool is_error: Boolean check for error item when iterating over list of
146+
results. Always False for an instance of a RecognizePiiEntitiesResult.
147+
"""
148+
149+
def __init__(self, **kwargs):
150+
self.id = kwargs.get("id", None)
151+
self.entities = kwargs.get("entities", None)
152+
self.warnings = kwargs.get("warnings", [])
153+
self.statistics = kwargs.get("statistics", None)
154+
self.is_error = False
155+
156+
def __repr__(self):
157+
return "RecognizePiiEntitiesResult(id={}, entities={}, warnings={}, statistics={}, is_error={})" \
158+
.format(self.id, repr(self.entities), repr(self.warnings), repr(self.statistics), self.is_error)[:1024]
159+
160+
127161
class DetectLanguageResult(DictMixin):
128162
"""DetectLanguageResult is a result object which contains
129163
the detected language of a particular document.
@@ -135,7 +169,7 @@ class DetectLanguageResult(DictMixin):
135169
:ivar primary_language: The primary language detected in the document.
136170
:vartype primary_language: ~azure.ai.textanalytics.DetectedLanguage
137171
:ivar warnings: Warnings encountered while processing document. Results will still be returned
138-
if there are warnings, but they may not be fully accurate.
172+
if there are warnings, but they may not be fully accurate.
139173
:vartype warnings: list[~azure.ai.textanalytics.TextAnalyticsWarning]
140174
:ivar statistics: If show_stats=true was specified in the request this
141175
field will contain information about the document payload.
@@ -193,6 +227,39 @@ def __repr__(self):
193227
self.text, self.category, self.subcategory, self.confidence_score
194228
)[:1024]
195229

230+
class PiiEntity(DictMixin):
231+
"""PiiEntity contains information about a Personally Identifiable
232+
Information (PII) entity found in text.
233+
234+
:ivar str text: Entity text as appears in the request.
235+
:ivar str category: Entity category, such as Financial Account
236+
Identification/Social Security Number/Phone Number, etc.
237+
:ivar str subcategory: Entity subcategory, such as Credit Card/EU
238+
Phone number/ABA Routing Numbers, etc.
239+
:ivar float confidence_score: Confidence score between 0 and 1 of the extracted
240+
entity.
241+
"""
242+
243+
def __init__(self, **kwargs):
244+
self.text = kwargs.get('text', None)
245+
self.category = kwargs.get('category', None)
246+
self.subcategory = kwargs.get('subcategory', None)
247+
self.confidence_score = kwargs.get('confidence_score', None)
248+
249+
@classmethod
250+
def _from_generated(cls, entity):
251+
return cls(
252+
text=entity.text,
253+
category=entity.category,
254+
subcategory=entity.subcategory,
255+
confidence_score=entity.confidence_score,
256+
)
257+
258+
def __repr__(self):
259+
return "PiiEntity(text={}, category={}, subcategory={}, confidence_score={})".format(
260+
self.text, self.category, self.subcategory, self.confidence_score
261+
)[:1024]
262+
196263

197264
class TextAnalyticsError(DictMixin):
198265
"""TextAnalyticsError contains the error code, message, and
@@ -274,7 +341,7 @@ class ExtractKeyPhrasesResult(DictMixin):
274341
in the input document.
275342
:vartype key_phrases: list[str]
276343
:ivar warnings: Warnings encountered while processing document. Results will still be returned
277-
if there are warnings, but they may not be fully accurate.
344+
if there are warnings, but they may not be fully accurate.
278345
:vartype warnings: list[~azure.ai.textanalytics.TextAnalyticsWarning]
279346
:ivar statistics: If show_stats=true was specified in the request this
280347
field will contain information about the document payload.
@@ -308,7 +375,7 @@ class RecognizeLinkedEntitiesResult(DictMixin):
308375
:vartype entities:
309376
list[~azure.ai.textanalytics.LinkedEntity]
310377
:ivar warnings: Warnings encountered while processing document. Results will still be returned
311-
if there are warnings, but they may not be fully accurate.
378+
if there are warnings, but they may not be fully accurate.
312379
:vartype warnings: list[~azure.ai.textanalytics.TextAnalyticsWarning]
313380
:ivar statistics: If show_stats=true was specified in the request this
314381
field will contain information about the document payload.
@@ -344,7 +411,7 @@ class AnalyzeSentimentResult(DictMixin):
344411
'neutral', 'negative', 'mixed'
345412
:vartype sentiment: str
346413
:ivar warnings: Warnings encountered while processing document. Results will still be returned
347-
if there are warnings, but they may not be fully accurate.
414+
if there are warnings, but they may not be fully accurate.
348415
:vartype warnings: list[~azure.ai.textanalytics.TextAnalyticsWarning]
349416
:ivar statistics: If show_stats=true was specified in the request this
350417
field will contain information about the document payload.
@@ -429,7 +496,7 @@ def __init__(self, **kwargs):
429496
def __getattr__(self, attr):
430497
result_set = set()
431498
result_set.update(
432-
RecognizeEntitiesResult().keys()
499+
RecognizeEntitiesResult().keys() + RecognizePiiEntitiesResult().keys()
433500
+ DetectLanguageResult().keys() + RecognizeLinkedEntitiesResult().keys()
434501
+ AnalyzeSentimentResult().keys() + ExtractKeyPhrasesResult().keys()
435502
)

sdk/textanalytics/azure-ai-textanalytics/azure/ai/textanalytics/_response_handlers.py

+12-1
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,9 @@
2424
DocumentError,
2525
SentimentConfidenceScores,
2626
TextAnalyticsError,
27-
TextAnalyticsWarning
27+
TextAnalyticsWarning,
28+
RecognizePiiEntitiesResult,
29+
PiiEntity,
2830
)
2931

3032
def _get_too_many_documents_error(obj):
@@ -162,3 +164,12 @@ def sentiment_result(sentiment):
162164
confidence_scores=SentimentConfidenceScores._from_generated(sentiment.confidence_scores), # pylint: disable=protected-access
163165
sentences=[SentenceSentiment._from_generated(s) for s in sentiment.sentences], # pylint: disable=protected-access
164166
)
167+
168+
@prepare_result
169+
def pii_entities_result(entity):
170+
return RecognizePiiEntitiesResult(
171+
id=entity.id,
172+
entities=[PiiEntity._from_generated(e) for e in entity.entities], # pylint: disable=protected-access
173+
warnings=[TextAnalyticsWarning._from_generated(w) for w in entity.warnings], # pylint: disable=protected-access
174+
statistics=TextDocumentStatistics._from_generated(entity.statistics), # pylint: disable=protected-access
175+
)

sdk/textanalytics/azure-ai-textanalytics/azure/ai/textanalytics/_text_analytics_client.py

+74-1
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,8 @@
2222
linked_entities_result,
2323
key_phrases_result,
2424
sentiment_result,
25-
language_result
25+
language_result,
26+
pii_entities_result
2627
)
2728

2829
if TYPE_CHECKING:
@@ -36,6 +37,7 @@
3637
ExtractKeyPhrasesResult,
3738
AnalyzeSentimentResult,
3839
DocumentError,
40+
RecognizePiiEntitiesResult,
3941
)
4042

4143

@@ -222,6 +224,77 @@ def recognize_entities( # type: ignore
222224
except HttpResponseError as error:
223225
process_batch_error(error)
224226

227+
@distributed_trace
228+
def recognize_pii_entities( # type: ignore
229+
self,
230+
documents, # type: Union[List[str], List[TextDocumentInput], List[Dict[str, str]]]
231+
**kwargs # type: Any
232+
):
233+
# type: (...) -> List[Union[RecognizePiiEntitiesResult, DocumentError]]
234+
"""Recognize entities containing personal information for a batch of documents.
235+
236+
Returns a list of personal information entities ("SSN",
237+
"Bank Account", etc) in the document. For the list of supported entity types,
238+
check https://aka.ms/tanerpii
239+
240+
See https://docs.microsoft.com/azure/cognitive-services/text-analytics/overview#data-limits
241+
for document length limits, maximum batch size, and supported text encoding.
242+
243+
:param documents: The set of documents to process as part of this batch.
244+
If you wish to specify the ID and language on a per-item basis you must
245+
use as input a list[:class:`~azure.ai.textanalytics.TextDocumentInput`] or a list of
246+
dict representations of :class:`~azure.ai.textanalytics.TextDocumentInput`, like
247+
`{"id": "1", "language": "en", "text": "hello world"}`.
248+
:type documents:
249+
list[str] or list[~azure.ai.textanalytics.TextDocumentInput] or
250+
list[dict[str, str]]
251+
:keyword str language: The 2 letter ISO 639-1 representation of language for the
252+
entire batch. For example, use "en" for English; "es" for Spanish etc.
253+
If not set, uses "en" for English as default. Per-document language will
254+
take precedence over whole batch language. See https://aka.ms/talangs for
255+
supported languages in Text Analytics API.
256+
:keyword str model_version: This value indicates which model will
257+
be used for scoring, e.g. "latest", "2019-10-01". If a model-version
258+
is not specified, the API will default to the latest, non-preview version.
259+
:keyword bool show_stats: If set to true, response will contain document level statistics.
260+
:return: The combined list of :class:`~azure.ai.textanalytics.RecognizePiiEntitiesResult`
261+
and :class:`~azure.ai.textanalytics.DocumentError` in the order the original documents
262+
were passed in.
263+
:rtype: list[~azure.ai.textanalytics.RecognizePiiEntitiesResult,
264+
~azure.ai.textanalytics.DocumentError]
265+
:raises ~azure.core.exceptions.HttpResponseError or TypeError or ValueError or NotImplementedError:
266+
267+
.. admonition:: Example:
268+
269+
.. literalinclude:: ../samples/sample_recognize_pii_entities.py
270+
:start-after: [START batch_recognize_pii_entities]
271+
:end-before: [END batch_recognize_pii_entities]
272+
:language: python
273+
:dedent: 8
274+
:caption: Recognize personally identifiable information entities in a batch of documents.
275+
"""
276+
language_arg = kwargs.pop("language", None)
277+
language = language_arg if language_arg is not None else self._default_language
278+
docs = _validate_batch_input(documents, "language", language)
279+
model_version = kwargs.pop("model_version", None)
280+
show_stats = kwargs.pop("show_stats", False)
281+
try:
282+
return self._client.entities_recognition_pii(
283+
documents=docs,
284+
model_version=model_version,
285+
show_stats=show_stats,
286+
cls=kwargs.pop("cls", pii_entities_result),
287+
**kwargs
288+
)
289+
except AttributeError as error:
290+
if "'TextAnalyticsClient' object has no attribute 'entities_recognition_pii'" in str(error):
291+
raise NotImplementedError(
292+
"'recognize_pii_entities' endpoint is only available for API version v3.1-preview.1 and up"
293+
)
294+
raise error
295+
except HttpResponseError as error:
296+
process_batch_error(error)
297+
225298
@distributed_trace
226299
def recognize_linked_entities( # type: ignore
227300
self,

0 commit comments

Comments
 (0)