Form Extractor

Document Extraction using key-value pairs

Form extractor uses Machine Learning models to identify and extract keyvalue pairs from the documents. Form extractor is used over Skill Extractor if the available skill models are not sufficient for extracting required fields. Form extractor requires additional configuration to map identified keys in the document with the user defined fields to extract the values. This is because for each document, the keys are slightly different for the same field. For example the field Invoice Number can be given as Invoice Number in one document and as Invoice No. in another. Thus the user should map the keys in document with the Field using keywords. In the above example user can use Invoice, Number and No as the keywords for mapping.

Creating a Document Extraction Workflow using Form Extractor

Document Extraction Workflow with Form Extractor is created in two steps.

  1. Creating a form extractor definition to specify the field names and its types. Keywords to map the keys in the document are configured in the definition.

  2. Creating a Document AI client with Document AI credentials in the Studio Workflow and running the Run Extractor activity with the Form definition.

Creating the Extractor Definition.

Launching the Create Extractor Window

The Extractor Definition is created by using the Create Extractor Window. To launch the Create Extractor Window. The Create Extractor Window is launched from Add Item Window. Click on Add Item button in the project pane.

In the Add Item Window, click AI Skills-> Document AI -> New Document Extractor.

Creating the Definition

Users can either load and edit an already existing definition or create a new definition. Let’s create a new Form Extractor definition.

Click on Create -> Form.

In the Configure Client section set the Document AI Endpoint and API Key, then click Apply.

In the Configure Fields section add new fields by clicking the add button. Let’s add two fields InvoiceTotal and InvoiceDate with types Text and Date.

Launch the Configure Extractor Window by clicking on the Launch Extractor Trainer button.

Adding Sample Files

In the Files section, click on the Browse button to select the folder containing sample files.

Configuring the Regex Extractor

In the Extractors section, add the keywords to identify the fields. Click Add button to add new Keywords to InvoiceTotal. Add the keyword total and keep the toggle button to ‘Included’ state. This will match all the keys in the document which contains the word total and includes them for mapping with the InvoiceTotal field .

Add another keyword sub,due and set the toggle to excluded. This will match any key in the document with words sub and due but excludes that key from mapping with the field. In the options check the Match Any option. This will make sure that only one of the keywords - due and sub is required for excluding the match.

Click Add button to add new Keywords to InvoiceDate. Add the keyword date and keep the toggle button to Included state. This will match all the keys in the document which contains the word date and includes them for mapping with the InvoiceDate field .

Add another keyword due and set the toggle to Excluded. This will match any key in the document with the word due then excludes the key from matching.

For all the keywords across both the fields check the option Ignore Case to make sure that the match is case insensitive. Click on Apply All button to apply the extraction on all the documents.

The Results section will show the result of the extraction.

For the first document the InvoiceDate output is 12/07 instead of 12/07/20121. Remember that the type of InvoiceDate was set as Date. This will make sure that the output is a valid date.

Click Save Changes to save the definition as a definition file.

Creating the Document Extraction Workflow

Three activities are required for creating the Extraction Workflow. Create Document AI Client, Preprocess Document and Run Extractor. This example also uses Show Validation Window activity to view the extracted data.

Configuring Create Document AI Client Activity

Add Create Document AI Client Activity to the Workflow. Click on the Configure button to launch the Create Document AI Client Window. In the Client Authorization section provide the Document AI Endpoint and API Key. In the Available Services section set the provider as Visualyze. The Extraction Type should be selected according to the created extractor definition. Set the Extraction Type as Form.

Save the configuration.

Configuring Preprocess Document Activity

Add Preprocess Document activity to the Workflow. Assign the DocAIClient variable from the Create Document AI Client activity to the Document AI Client property. Set a PDF or image file path to the Input File property.

Configuring Run Extractor Activity

Add Run Extractor activity to the Workflow. In the Document AI Client property assign the DocAIClient variable. In the Processed Document property assign the ProcessedDocument variable from Preprocess Document activity. In the Extractor Definition property assign the path to the created Extractor Definition file.

Configuring Show Validation Window Activity

Finally add Show Validation Window activity and add the ExtractionResult variable from the Run Extractor activity to the Extraction Result property. Run the workflow. The extraction will be applied to the selected file and the results will be displayed on the ValidationWindow.

Last updated