Form Extractor
Document Extraction using key-value pairs
Form extractor uses Machine Learning models to identify and extract keyvalue pairs from the documents. Form extractor is used over Skill Extractor if the available skill models are not sufficient for extracting required fields. Form extractor requires additional configuration to map identified keys in the document with the user defined fields to extract the values. This is because for each document, the keys are slightly different for the same field. For example the field Invoice Number
can be given as Invoice Number
in one document and as Invoice No.
in another. Thus the user should map the keys in document with the Field using keywords. In the above example user can use Invoice
, Number
and No
as the keywords for mapping.
Creating a Document Extraction Workflow using Form Extractor
Document Extraction Workflow with Form Extractor is created in two steps.
Creating a form extractor definition to specify the field names and its types. Keywords to map the keys in the document are configured in the definition.
Creating a Document AI client with Document AI credentials in the Studio Workflow and running the
Run Extractor activity
with the Form definition.
Creating the Extractor Definition.
Launching the Create Extractor Window
The Extractor Definition is created by using the Create Extractor Window. To launch the Create Extractor Window. The Create Extractor Window is launched from Add Item Window. Click on Add Item
button in the project pane.
In the Add Item Window, click AI Skills
-> Document AI
-> New Document Extractor
.
Creating the Definition
Users can either load and edit an already existing definition or create a new definition. Let’s create a new Form Extractor
definition.
Click on Create
-> Form
.
In the Configure Client
section set the Document AI Endpoint and API Key, then click Apply
.
In the Configure Fields
section add new fields by clicking the add button. Let’s add two fields InvoiceTotal
and InvoiceDate
with types Text
and Date
.
Launch the Configure Extractor Window by clicking on the Launch Extractor Trainer
button.
Adding Sample Files
In the Files
section, click on the Browse
button to select the folder containing sample files.
Configuring the Regex Extractor
In the Extractors
section, add the keywords to identify the fields. Click Add
button to add new Keywords to InvoiceTotal
. Add the keyword total
and keep the toggle button to ‘Included’ state. This will match all the keys in the document which contains the word total and includes them for mapping with the InvoiceTotal
field .
Add another keyword sub,due
and set the toggle to excluded. This will match any key in the document with words sub
and due
but excludes that key from mapping with the field. In the options check the Match Any
option. This will make sure that only one of the keywords - due
and sub
is required for excluding the match.
Click Add
button to add new Keywords to InvoiceDate
. Add the keyword date
and keep the toggle button to Included
state. This will match all the keys in the document which contains the word date
and includes them for mapping with the InvoiceDate
field .
Add another keyword due
and set the toggle to Excluded
. This will match any key in the document with the word due
then excludes the key from matching.
For all the keywords across both the fields check the option Ignore Case
to make sure that the match is case insensitive. Click on Apply All
button to apply the extraction on all the documents.
The Results
section will show the result of the extraction.
For the first document the InvoiceDate
output is 12/07
instead of 12/07/20121
. Remember that the type of InvoiceDate
was set as Date
. This will make sure that the output is a valid date.
Click Save Changes
to save the definition as a definition file.
Creating the Document Extraction Workflow
Three activities are required for creating the Extraction Workflow. Create Document AI Client, Preprocess Document and Run Extractor. This example also uses Show Validation Window activity to view the extracted data.
Configuring Create Document AI Client Activity
Add Create Document AI Client Activity
to the Workflow. Click on the Configure button to launch the Create Document AI Client Window
. In the Client Authorization
section provide the Document AI Endpoint and API Key. In the Available Services
section set the provider as Visualyze.
The Extraction Type
should be selected according to the created extractor definition. Set the Extraction Type as Form
.
Save the configuration.
Configuring Preprocess Document Activity
Add Preprocess Document
activity to the Workflow. Assign the DocAIClient
variable from the Create Document AI Client
activity to the Document AI Client
property. Set a PDF or image file path to the Input File
property.
Configuring Run Extractor Activity
Add Run Extractor
activity to the Workflow. In the Document AI Client
property assign the DocAIClient
variable. In the Processed Document
property assign the ProcessedDocument
variable from Preprocess Document
activity. In the Extractor Definition
property assign the path to the created Extractor Definition file.
Configuring Show Validation Window Activity
Finally add Show Validation Window
activity and add the ExtractionResult
variable from the Run Extractor
activity to the Extraction Result
property. Run the workflow. The extraction will be applied to the selected file and the results will be displayed on the ValidationWindow
.
Last updated