title | titleSuffix | description | author | manager | ms.service | ms.topic | ms.date | ms.author |
---|---|---|---|---|---|---|---|---|
Read model OCR data extraction - Document Intelligence |
Azure AI services |
Extract print and handwritten text from scanned and digital documents with Document Intelligence's Read OCR model. |
laujan |
nitinme |
azure-ai-document-intelligence |
conceptual |
10/07/2024 |
lajanuar |
::: moniker range="doc-intel-4.0.0"
[!INCLUDE preview-version-notice]
This content applies to: v4.0 (preview) | Previous versions:
v3.1 (GA)
v3.0 (GA)
This content applies to: v4.0 (preview) | Previous versions:
v3.1 (GA)
v3.0 (GA)
Note
For extracting text from external images like labels, street signs, and posters, use the Azure AI Image Analysis v4.0 Read feature optimized for general, non-document images with a performance-enhanced synchronous API that makes it easier to embed OCR in your user experience scenarios.
Document Intelligence Read Optical Character Recognition (OCR) model runs at a higher resolution than Azure AI Vision Read and extracts print and handwritten text from PDF documents and scanned images. It also includes support for extracting text from Microsoft Word, Excel, PowerPoint, and HTML documents. It detects paragraphs, text lines, words, locations, and languages. The Read model is the underlying OCR engine for other Document Intelligence prebuilt models like Layout, General Document, Invoice, Receipt, Identity (ID) document, Health insurance card, W2 in addition to custom models.
Optical Character Recognition (OCR) for documents is optimized for large text-heavy documents in multiple file formats and global languages. It includes features like higher-resolution scanning of document images for better handling of smaller and dense text; paragraph detection; and fillable form management. OCR capabilities also include advanced scenarios like single character boxes and accurate extraction of key fields commonly found in invoices, receipts, and other prebuilt scenarios.
Document Intelligence v4.0 (2024-07-31-preview) supports the following tools, applications, and libraries:
Feature | Resources | Model ID |
---|---|---|
Read OCR model | • Document Intelligence Studio • REST API • C# SDK • Python SDK • Java SDK • JavaScript SDK |
prebuilt-read |
[!INCLUDE input requirements]
Try extracting text from forms and documents using the Document Intelligence Studio. You need the following assets:
-
An Azure subscription—you can create one for free.
-
A Document Intelligence instance in the Azure portal. You can use the free pricing tier (
F0
) to try the service. After your resource deploys, select Go to resource to get your key and endpoint.:::image type="content" source="../media/containers/keys-and-endpoint.png" alt-text="Screenshot of keys and endpoint location in the Azure portal.":::
Note
Currently, Document Intelligence Studio doesn't support Microsoft Word, Excel, PowerPoint, and HTML file formats.
Sample document processed with Document Intelligence Studio
:::image type="content" source="../media/studio/form-recognizer-studio-read-v3p2-updated.png" alt-text="Screenshot of Read processing in Document Intelligence Studio.":::
-
On the Document Intelligence Studio home page, select Read.
-
You can analyze the sample document or upload your own files.
-
Select the Run analysis button and, if necessary, configure the Analyze options:
:::image type="content" source="../media/studio/run-analysis-analyze-options.png" alt-text="Screenshot of Run analysis and Analyze options buttons in the Document Intelligence Studio.":::
[!div class="nextstepaction"] Try Document Intelligence Studio.
See our Language Support—document analysis models page for a complete list of supported languages.
Note
Microsoft Word and HTML file are supported in v4.0. Compared with PDF and images, below features are not supported:
- There are no angle, width/height and unit with each page object.
- For each object detected, there is no bounding polygon or bounding region.
- Page range (
pages
) is not supported as a parameter. - No
lines
object.
The searchable PDF capability enables you to convert an analog PDF, such as scanned-image PDF files, to a PDF with embedded text. The embedded text enables deep text search within the PDF's extracted content by overlaying the detected text entities on top of the image files.
Important
- Currently, the searchable PDF capability is only supported by Read OCR model
prebuilt-read
. When using this feature, please specify themodelId
asprebuilt-read
, as other model types will return error for this preview version. - Searchable PDF is included with the 2024-07-31-preview
prebuilt-read
model with no additional cost for generating a searchable PDF output.
To use searchable PDF, make a POST
request using the Analyze
operation and specify the output format as pdf
:
POST /documentModels/prebuilt-read:analyze?output=pdf
{...}
202
Poll for completion of the Analyze
operation. Once the operation is complete, issue a GET
request to retrieve the PDF format of the Analyze
operation results.
Upon successful completion, the PDF can be retrieved and downloaded as application/pdf
. This operation allows direct downloading of the embedded text form of PDF instead of Base64-encoded JSON.
// Monitor the operation until completion.
GET /documentModels/prebuilt-read/analyzeResults/{resultId}
200
{...}
// Upon successful completion, retrieve the PDF as application/pdf.
GET /documentModels/prebuilt-read/analyzeResults/{resultId}/pdf
200 OK
Content-Type: application/pdf
The pages collection is a list of pages within the document. Each page is represented sequentially within the document and includes the orientation angle indicating if the page is rotated and the width and height (dimensions in pixels). The page units in the model output are computed as shown:
File format | Computed page unit | Total pages |
---|---|---|
Images (JPEG/JPG, PNG, BMP, HEIF) | Each image = 1 page unit | Total images |
Each page in the PDF = 1 page unit | Total pages in the PDF | |
TIFF | Each image in the TIFF = 1 page unit | Total images in the TIFF |
Word (DOCX) | Up to 3,000 characters = 1 page unit, embedded or linked images not supported | Total pages of up to 3,000 characters each |
Excel (XLSX) | Each worksheet = 1 page unit, embedded or linked images not supported | Total worksheets |
PowerPoint (PPTX) | Each slide = 1 page unit, embedded or linked images not supported | Total slides |
HTML | Up to 3,000 characters = 1 page unit, embedded or linked images not supported | Total pages of up to 3,000 characters each |
# Analyze pages.
for page in result.pages:
print(f"----Analyzing document from page #{page.page_number}----")
print(f"Page has width: {page.width} and height: {page.height}, measured with unit: {page.unit}")
[!div class="nextstepaction"] View samples on GitHub.
"pages": [
{
"pageNumber": 1,
"angle": 0,
"width": 915,
"height": 1190,
"unit": "pixel",
"words": [],
"lines": [],
"spans": []
}
]
For large multi-page PDF documents, use the pages
query parameter to indicate specific page numbers or page ranges for text extraction.
The Read OCR model in Document Intelligence extracts all identified blocks of text in the paragraphs
collection as a top level object under analyzeResults
. Each entry in this collection represents a text block and includes the extracted text ascontent
and the bounding polygon
coordinates. The span
information points to the text fragment within the top-level content
property that contains the full text from the document.
"paragraphs": [
{
"spans": [],
"boundingRegions": [],
"content": "While healthcare is still in the early stages of its Al journey, we are seeing pharmaceutical and other life sciences organizations making major investments in Al and related technologies.\" TOM LAWRY | National Director for Al, Health and Life Sciences | Microsoft"
}
]
The Read OCR model extracts print and handwritten style text as lines
and words
. The model outputs bounding polygon
coordinates and confidence
for the extracted words. The styles
collection includes any handwritten style for lines if detected along with the spans pointing to the associated text. This feature applies to supported handwritten languages.
For Microsoft Word, Excel, PowerPoint, and HTML, Document Intelligence Read model v3.1 and later versions extracts all embedded text as is. Texts are extrated as words and paragraphs. Embedded images aren't supported.
# Analyze lines.
if page.lines:
for line_idx, line in enumerate(page.lines):
words = get_words(page, line)
print(
f"...Line # {line_idx} has {len(words)} words and text '{line.content}' within bounding polygon '{line.polygon}'"
)
# Analyze words.
for word in words:
print(f"......Word '{word.content}' has a confidence of {word.confidence}")
[!div class="nextstepaction"] View samples on GitHub.
"words": [
{
"content": "While",
"polygon": [],
"confidence": 0.997,
"span": {}
},
],
"lines": [
{
"content": "While healthcare is still in the early stages of its Al journey, we",
"polygon": [],
"spans": [],
}
]
The response includes classifying whether each text line is of handwriting style or not, along with a confidence score. For more information, see handwritten language support. The following example shows an example JSON snippet.
"styles": [
{
"confidence": 0.95,
"spans": [
{
"offset": 509,
"length": 24
}
"isHandwritten": true
]
}
If you enabled the font/style addon capability, you also get the font/style result as part of the styles
object.
Complete a Document Intelligence quickstart:
[!div class="checklist"]
Explore our REST API:
[!div class="nextstepaction"] Document Intelligence API v4.0
Find more samples on GitHub:
[!div class="nextstepaction"] Read model.
::: moniker-end
::: moniker range="doc-intel-3.1.0"
This content applies to: v3.1 (GA) | Latest version:
v4.0 (preview) | Previous versions:
v3.0
::: moniker-end
::: moniker range="doc-intel-3.0.0"
This content applies to: v3.0 (GA) | Latest versions:
v4.0 (preview)
v3.1
::: moniker-end
::: moniker range="<=doc-intel-3.1.0"
Note
For extracting text from external images like labels, street signs, and posters, use the Azure AI Image Analysis v4.0 Read feature optimized for general, non-document images with a performance-enhanced synchronous API that makes it easier to embed OCR in your user experience scenarios.
Document Intelligence Read Optical Character Recognition (OCR) model runs at a higher resolution than Azure AI Vision Read and extracts print and handwritten text from PDF documents and scanned images. It also includes support for extracting text from Microsoft Word, Excel, PowerPoint, and HTML documents. It detects paragraphs, text lines, words, locations, and languages. The Read model is the underlying OCR engine for other Document Intelligence prebuilt models like Layout, General Document, Invoice, Receipt, Identity (ID) document, Health insurance card, W2 in addition to custom models.
Optical Character Recognition (OCR) for documents is optimized for large text-heavy documents in multiple file formats and global languages. It includes features like higher-resolution scanning of document images for better handling of smaller and dense text; paragraph detection; and fillable form management. OCR capabilities also include advanced scenarios like single character boxes and accurate extraction of key fields commonly found in invoices, receipts, and other prebuilt scenarios.
::: moniker-end
::: moniker range="doc-intel-3.1.0"
Document Intelligence v3.1 supports the following tools, applications, and libraries:
Feature | Resources | Model ID |
---|---|---|
Read OCR model | • Document Intelligence Studio • REST API • C# SDK • Python SDK • Java SDK • JavaScript SDK |
prebuilt-read |
::: moniker-end
::: moniker range="doc-intel-3.0.0"
Document Intelligence v3.0 supports the following tools, applications, and libraries:
Feature | Resources | Model ID |
---|---|---|
Read OCR model | • Document Intelligence Studio • REST API • C# SDK • Python SDK • Java SDK • JavaScript SDK |
prebuilt-read |
::: moniker-end
::: moniker range="<= doc-intel-3.1.0"
[!INCLUDE input requirements]
Try extracting text from forms and documents using the Document Intelligence Studio. You need the following assets:
-
An Azure subscription—you can create one for free.
-
A Document Intelligence instance in the Azure portal. You can use the free pricing tier (
F0
) to try the service. After your resource deploys, select Go to resource to get your key and endpoint.
:::image type="content" source="../media/containers/keys-and-endpoint.png" alt-text="Screenshot of keys and endpoint location in the Azure portal.":::
Note
Currently, Document Intelligence Studio doesn't support Microsoft Word, Excel, PowerPoint, and HTML file formats.
Sample document processed with Document Intelligence Studio
:::image type="content" source="../media/studio/form-recognizer-studio-read-v3p2-updated.png" alt-text="Screenshot of Read processing in Document Intelligence Studio.":::
-
On the Document Intelligence Studio home page, select Read.
-
You can analyze the sample document or upload your own files.
-
Select the Run analysis button and, if necessary, configure the Analyze options:
:::image type="content" source="../media/studio/run-analysis-analyze-options.png" alt-text="Screenshot of Run analysis and Analyze options buttons in the Document Intelligence Studio.":::
[!div class="nextstepaction"] Try Document Intelligence Studio.
See our Language Support—document analysis models page for a complete list of supported languages.
Note
Microsoft Word and HTML file are supported in v3.1 and later versions. Compared with PDF and images, below features are not supported:
- There are no angle, width/height and unit with each page object.
- For each object detected, there is no bounding polygon or bounding region.
- Page range (
pages
) is not supported as a parameter. - No
lines
object.
The searchable PDF capability enables you to convert an analog PDF, such as scanned-image PDF files, to a PDF with embedded text. The embedded text enables deep text search within the PDF's extracted content by overlaying the detected text entities on top of the image files.
Important
- Currently, the searchable PDF capability is only supported by Read OCR model
prebuilt-read
. When using this feature, please specify themodelId
asprebuilt-read
, as other model types will return error for this preview version. - Searchable PDF is included with the 2024-07-31-preview
prebuilt-read
model with no additional cost for generating a searchable PDF output.- Searchable PDF currently only supports PDF files as input. Support for other file types, such as image files, will be available later.
To use searchable PDF, make a POST
request using the Analyze
operation and specify the output format as pdf
:
POST /documentModels/prebuilt-read:analyze?output=pdf
{...}
202
Poll for completion of the Analyze
operation. Once the operation is complete, issue a GET
request to retrieve the PDF format of the Analyze
operation results.
Upon successful completion, the PDF can be retrieved and downloaded as application/pdf
. This operation allows direct downloading of the embedded text form of PDF instead of Base64-encoded JSON.
// Monitor the operation until completion.
GET /documentModels/prebuilt-read/analyzeResults/{resultId}
200
{...}
// Upon successful completion, retrieve the PDF as application/pdf.
GET /documentModels/prebuilt-read/analyzeResults/{resultId}/pdf
200 OK
Content-Type: application/pdf
The pages collection is a list of pages within the document. Each page is represented sequentially within the document and includes the orientation angle indicating if the page is rotated and the width and height (dimensions in pixels). The page units in the model output are computed as shown:
File format | Computed page unit | Total pages |
---|---|---|
Images (JPEG/JPG, PNG, BMP, HEIF) | Each image = 1 page unit | Total images |
Each page in the PDF = 1 page unit | Total pages in the PDF | |
TIFF | Each image in the TIFF = 1 page unit | Total images in the TIFF |
Word (DOCX) | Up to 3,000 characters = 1 page unit, embedded or linked images not supported | Total pages of up to 3,000 characters each |
Excel (XLSX) | Each worksheet = 1 page unit, embedded or linked images not supported | Total worksheets |
PowerPoint (PPTX) | Each slide = 1 page unit, embedded or linked images not supported | Total slides |
HTML | Up to 3,000 characters = 1 page unit, embedded or linked images not supported | Total pages of up to 3,000 characters each |
::: moniker-end
::: moniker range="doc-intel-2.1.0 || doc-intel-3.0.0"
"pages": [
{
"pageNumber": 1,
"angle": 0,
"width": 915,
"height": 1190,
"unit": "pixel",
"words": [],
"lines": [],
"spans": []
}
]
::: moniker-end
::: moniker range="doc-intel-3.1.0"
# Analyze pages.
for page in result.pages:
print(f"----Analyzing document from page #{page.page_number}----")
print(
f"Page has width: {page.width} and height: {page.height}, measured with unit: {page.unit}"
)
[!div class="nextstepaction"] View samples on GitHub.
"pages": [
{
"pageNumber": 1,
"angle": 0,
"width": 915,
"height": 1190,
"unit": "pixel",
"words": [],
"lines": [],
"spans": []
}
]
::: moniker-end
::: moniker range="<=doc-intel-3.1.0"
For large multi-page PDF documents, use the pages
query parameter to indicate specific page numbers or page ranges for text extraction.
The Read OCR model in Document Intelligence extracts all identified blocks of text in the paragraphs
collection as a top level object under analyzeResults
. Each entry in this collection represents a text block and includes the extracted text ascontent
and the bounding polygon
coordinates. The span
information points to the text fragment within the top-level content
property that contains the full text from the document.
"paragraphs": [
{
"spans": [],
"boundingRegions": [],
"content": "While healthcare is still in the early stages of its Al journey, we are seeing pharmaceutical and other life sciences organizations making major investments in Al and related technologies.\" TOM LAWRY | National Director for Al, Health and Life Sciences | Microsoft"
}
]
The Read OCR model extracts print and handwritten style text as lines
and words
. The model outputs bounding polygon
coordinates and confidence
for the extracted words. The styles
collection includes any handwritten style for lines if detected along with the spans pointing to the associated text. This feature applies to supported handwritten languages.
For Microsoft Word, Excel, PowerPoint, and HTML, Document Intelligence Read model v3.1 and later versions extracts all embedded text as is. Texts are extrated as words and paragraphs. Embedded images aren't supported.
::: moniker-end
::: moniker range="doc-intel-2.1.0 || doc-intel-3.0.0"
"words": [
{
"content": "While",
"polygon": [],
"confidence": 0.997,
"span": {}
},
],
"lines": [
{
"content": "While healthcare is still in the early stages of its Al journey, we",
"polygon": [],
"spans": [],
}
]
::: moniker-end
::: moniker range="doc-intel-3.1.0"
# Analyze lines.
for line_idx, line in enumerate(page.lines):
words = line.get_words()
print(
f"...Line # {line_idx} has {len(words)} words and text '{line.content}' within bounding polygon '{format_polygon(line.polygon)}'"
)
# Analyze words.
for word in words:
print(
f"......Word '{word.content}' has a confidence of {word.confidence}"
)
[!div class="nextstepaction"] View samples on GitHub.
"words": [
{
"content": "While",
"polygon": [],
"confidence": 0.997,
"span": {}
},
],
"lines": [
{
"content": "While healthcare is still in the early stages of its Al journey, we",
"polygon": [],
"spans": [],
}
]
::: moniker-end
::: moniker range="<=doc-intel-3.1.0"
The response includes classifying whether each text line is of handwriting style or not, along with a confidence score. For more information, see handwritten language support. The following example shows an example JSON snippet.
"styles": [
{
"confidence": 0.95,
"spans": [
{
"offset": 509,
"length": 24
}
"isHandwritten": true
]
}
If you enabled the font/style addon capability, you also get the font/style result as part of the styles
object.
Complete a Document Intelligence quickstart:
[!div class="checklist"]
Explore our REST API:
[!div class="nextstepaction"] Document Intelligence API v4.0
::: moniker-end
::: moniker range="doc-intel-3.1.0"
Find more samples on GitHub:
[!div class="nextstepaction"] Read model.
::: moniker-end