Regex Extractor
Document Extraction using Regular Expressions
Last updated
Document Extraction using Regular Expressions
Last updated
Regex Extractor use Regular Expressions, aka regex to identify and extract the required fields from documents. It is useful in the case where a suitable skill model is available and the required field is not available. But it necessitates the user to be familiar with regular expressions. Regex Extractor uses PCRE Regex Engine. The user is free to choose how they model the regex for extracting a field. But the common practice is to match the pattern of value or of both the key and value, then take only the value part.
Document Extraction Workflow with Regex Extractor is created in two steps.
Creating a regex extractor definition to specify the field names and its types. Here the regular expressions to identify and extract each field is configured.
Creating a Document AI client with Document AI credentials in the Studio Workflow and running the activity.
The following example explains creation of a document extraction workflow using a Regex Extractor.
Creating a Document Extraction workflow begins by creating the Document Extractor Definition file. The definition file contains information about the type of extraction, the fields that need to be extracted and how each field is identified and extracted.
The Extractor Definition is created by using the . To launch the Create Extractor Window. The Create Extractor Window is launched from Add Item Window. Click on Add Item
button in the project pane.
In the Add Item Window, click AI Skills-> Document AI -> New Document Extractor.
Users can either load and edit an already existing definition or create a new definition. For creating the definition user should select an Extractor Type from the available types.
Click on Create
-> Regex
.
In the Configure Client
section set the Document AI Endpoint and API Key, then click apply.
In the Configure Fields
section add new fields by clicking the add button. Let’s add two fields Total
and Date
with output format Number
and Date
. The field definition differs for each type of extractor.
In the Files
section, click on the Browse
button to select the folder containing sample files.
Check the Enable Text Preprocessing
checkbox to include text preprocessing for the document.
In the Text Preprocessing
section, click on Apply OCR
. This will create a Text View
Tab which contains the OCR text in the selected document.
Add preprocessors to remove or replace unwanted lines, text or characters from the document. The extraction will be applied on the preprocessed document. In this example we will add a remove lines preprocessor.
Applying the preprocessing will create a Preprocessed
Tab with the text after preprocessing is applied.
In the Extractors
section, add the regex to identify and extract the fields. Select Ignore case
regex option to Date
field to make the matching case-insensitive. Click Apply
to test the regex on the current document. Click Apply All
to apply the extraction on all documents.
The Results
section will show the result of the extraction.
Click Save Changes
to save the definition in a definition file.
Save the configuration.
Launch the by clicking on the Launch Extractor Trainer
button.
Click on the Add
dropdown button in the section and select Line
-> Remove Lines
. This will add a Remove Lines
preprocessor to the preprocessors list. Select With line numbers
option. This will remove the line with line number 1
. Make sure the checkbox is checked to select it for preprocessing. Then click Apply
button. This will apply the preprocessing on the currently selected document with all selected preprocessors.
Three activities are required for creating the Extraction Workflow. , and . This example also uses activity to view the extracted data.
Add Activity to the Workflow. Click on the Configure button to launch the . In the Client Authorization
section provide the Document AI Endpoint and API Key. In the Available Services
section set the provider as Visualyze
. The Extraction Type
should be selected according to the created extractor definition. Since the created extractor is a Regex Extractor
, it needs text extraction. Set the Extraction Type
as Text
.
Add activity to the Workflow. Assign the DocAIClient
variable from the Create Document AI Client
activity to the Document AI Client
property. Set a PDF or image file path to the Input File
property.
Add activity to the Workflow. In the Document AI Client
property assign the DocAIClient
variable. In the Processed Document
property assign the variable from Preprocess Document
activity. In the Extractor Definition
property assign the path to the created Extractor Definition file.
Finally add and add the variable from the Run Extractor
activity to the Extraction Result
property. Run the workflow. The extraction will be applied to the selected file and the results will be displayed on the .