Regex Extractor
Document Extraction using Regular Expressions
Last updated
Document Extraction using Regular Expressions
Last updated
Regex Extractor use Regular Expressions, aka regex to identify and extract the required fields from documents. It is useful in the case where a suitable skill model is available and the required field is not available. But it necessitates the user to be familiar with regular expressions. Regex Extractor uses PCRE Regex Engine. The user is free to choose how they model the regex for extracting a field. But the common practice is to match the pattern of value or of both the key and value, then take only the value part.
Document Extraction Workflow with Regex Extractor is created in two steps.
Creating a regex extractor definition to specify the field names and its types. Here the regular expressions to identify and extract each field is configured.
Creating a Document AI client with Document AI credentials in the Studio Workflow and running the Run Extractor
activity.
The following example explains creation of a document extraction workflow using a Regex Extractor.
Creating a Document Extraction workflow begins by creating the Document Extractor Definition file. The definition file contains information about the type of extraction, the fields that need to be extracted and how each field is identified and extracted.
The Extractor Definition is created by using the Create Extractor Window. To launch the Create Extractor Window. The Create Extractor Window is launched from Add Item Window. Click on Add Item
button in the project pane.
In the Add Item Window, click AI Skills-> Document AI -> New Document Extractor.
Users can either load and edit an already existing definition or create a new definition. For creating the definition user should select an Extractor Type from the available types.
Click on Create
-> Regex
.
In the Configure Client
section set the Document AI Endpoint and API Key, then click apply.
In the Configure Fields
section add new fields by clicking the add button. Let’s add two fields Total
and Date
with output format Number
and Date
. The field definition differs for each type of extractor.
Launch the Configure Extractor Window by clicking on the Launch Extractor Trainer
button.
In the Files
section, click on the Browse
button to select the folder containing sample files.
Check the Enable Text Preprocessing
checkbox to include text preprocessing for the document.
In the Text Preprocessing
section, click on Apply OCR
. This will create a Text View
Tab which contains the OCR text in the selected document.
Add preprocessors to remove or replace unwanted lines, text or characters from the document. The extraction will be applied on the preprocessed document. In this example we will add a remove lines preprocessor.
Click on the Add
dropdown button in the Text Preprocessing
section and select Line
-> Remove Lines
. This will add a Remove Lines
preprocessor to the preprocessors list. Select With line numbers
option. This will remove the line with line number 1
. Make sure the checkbox is checked to select it for preprocessing. Then click Apply
button. This will apply the preprocessing on the currently selected document with all selected preprocessors.
Applying the preprocessing will create a Preprocessed
Tab with the text after preprocessing is applied.
In the Extractors
section, add the regex to identify and extract the fields. Select Ignore case
regex option to Date
field to make the matching case-insensitive. Click Apply
to test the regex on the current document. Click Apply All
to apply the extraction on all documents.
The Results
section will show the result of the extraction.
Click Save Changes
to save the definition in a definition file.
Three activities are required for creating the Extraction Workflow. Create Document AI Client, Preprocess Document and Run Extractor. This example also uses Show Validation Window activity to view the extracted data.
Add Create Document AI Client
Activity to the Workflow. Click on the Configure button to launch the Create Document AI Client Window
. In the Client Authorization
section provide the Document AI Endpoint and API Key. In the Available Services
section set the provider as Visualyze
. The Extraction Type
should be selected according to the created extractor definition. Since the created extractor is a Regex Extractor
, it needs text extraction. Set the Extraction Type
as Text
.
Save the configuration.
Add Preprocess Document
activity to the Workflow. Assign the DocAIClient
variable from the Create Document AI Client
activity to the Document AI Client
property. Set a PDF or image file path to the Input File
property.
Add Run Extractor
activity to the Workflow. In the Document AI Client
property assign the DocAIClient
variable. In the Processed Document
property assign the ProcessedDocument
variable from Preprocess Document
activity. In the Extractor Definition
property assign the path to the created Extractor Definition file.
Finally add Show Validation Window
and add the ExtractionResult
variable from the Run Extractor
activity to the Extraction Result
property. Run the workflow. The extraction will be applied to the selected file and the results will be displayed on the Validation Window
.