Regex Extractor

Document Extraction using Regular Expressions

Regex Extractor use Regular Expressions, aka regex to identify and extract the required fields from documents. It is useful in the case where a suitable skill model is available and the required field is not available. But it necessitates the user to be familiar with regular expressions. Regex Extractor uses PCRE Regex Engine. The user is free to choose how they model the regex for extracting a field. But the common practice is to match the pattern of value or of both the key and value, then take only the value part.

Creating a Document Extraction Workflow using Regex Extractor

Document Extraction Workflow with Regex Extractor is created in two steps.

  1. Creating a regex extractor definition to specify the field names and its types. Here the regular expressions to identify and extract each field is configured.

  2. Creating a Document AI client with Document AI credentials in the Studio Workflow and running the Run Extractor activity.

The following example explains creation of a document extraction workflow using a Regex Extractor.

Creating the Extractor Definition.

Launching the Create Classifier Window

Creating a Document Extraction workflow begins by creating the Document Extractor Definition file. The definition file contains information about the type of extraction, the fields that need to be extracted and how each field is identified and extracted.

The Extractor Definition is created by using the Create Extractor Window. To launch the Create Extractor Window. The Create Extractor Window is launched from Add Item Window. Click on Add Item button in the project pane.

In the Add Item Window, click AI Skills-> Document AI -> New Document Extractor.

Creating the definition

Users can either load and edit an already existing definition or create a new definition. For creating the definition user should select an Extractor Type from the available types.

Click on Create -> Regex.

In the Configure Client section set the Document AI Endpoint and API Key, then click apply.

In the Configure Fields section add new fields by clicking the add button. Let’s add two fields Total and Date with output format Number and Date. The field definition differs for each type of extractor.

Launch the Configure Extractor Window by clicking on the Launch Extractor Trainer button.

Configuring the Sample Files

In the Files section, click on the Browse button to select the folder containing sample files.

Preprocessing the document

Check the Enable Text Preprocessing checkbox to include text preprocessing for the document.

In the Text Preprocessing section, click on Apply OCR. This will create a Text View Tab which contains the OCR text in the selected document.

Add preprocessors to remove or replace unwanted lines, text or characters from the document. The extraction will be applied on the preprocessed document. In this example we will add a remove lines preprocessor.

Click on the Add dropdown button in the Text Preprocessing section and select Line -> Remove Lines. This will add a Remove Lines preprocessor to the preprocessors list. Select With line numbers option. This will remove the line with line number 1. Make sure the checkbox is checked to select it for preprocessing. Then click Apply button. This will apply the preprocessing on the currently selected document with all selected preprocessors.

Applying the preprocessing will create a Preprocessed Tab with the text after preprocessing is applied.

Configuring the Extractor

In the Extractors section, add the regex to identify and extract the fields. Select Ignore case regex option to Date field to make the matching case-insensitive. Click Apply to test the regex on the current document. Click Apply All to apply the extraction on all documents.

The Results section will show the result of the extraction.

Click Save Changes to save the definition in a definition file.

Creating the Workflow

Three activities are required for creating the Extraction Workflow. Create Document AI Client, Preprocess Document and Run Extractor. This example also uses Show Validation Window activity to view the extracted data.

Configuring Create Document AI Client activity

Add Create Document AI Client Activity to the Workflow. Click on the Configure button to launch the Create Document AI Client Window. In the Client Authorization section provide the Document AI Endpoint and API Key. In the Available Services section set the provider as Visualyze. The Extraction Type should be selected according to the created extractor definition. Since the created extractor is a Regex Extractor, it needs text extraction. Set the Extraction Type as Text.

Save the configuration.

Configuring Preprocess Document activity

Add Preprocess Document activity to the Workflow. Assign the DocAIClient variable from the Create Document AI Client activity to the Document AI Client property. Set a PDF or image file path to the Input File property.

Configuring Run Extractor activity

Add Run Extractor activity to the Workflow. In the Document AI Client property assign the DocAIClient variable. In the Processed Document property assign the ProcessedDocument variable from Preprocess Document activity. In the Extractor Definition property assign the path to the created Extractor Definition file.

Configuring Show Validation Window Activity

Finally add Show Validation Window and add the ExtractionResult variable from the Run Extractor activity to the Extraction Result property. Run the workflow. The extraction will be applied to the selected file and the results will be displayed on the Validation Window.

Last updated