# Regex Extractor

Regex Extractor use Regular Expressions, aka regex to identify and extract the required fields from documents. It is useful in the case where a suitable skill model is available and the required field is not available. But it necessitates the user to be familiar with regular expressions. Regex Extractor uses PCRE Regex Engine. The user is free to choose how they model the regex for extracting a field. But the common practice is to match the pattern of value or of both the key and value, then take only the value part.

## Creating a Document Extraction Workflow using Regex Extractor

Document Extraction Workflow with Regex Extractor is created in two steps.

1. Creating a regex extractor definition to specify the field names and its types. Here the regular expressions to identify and extract each field is configured.
2. Creating a Document AI client with Document AI credentials in the Studio Workflow and running the [`Run Extractor`](https://docs.visualyze.ai/rpa-studio/document-ai/tasks/run-extractor) activity.

The following example explains creation of a document extraction workflow using a Regex Extractor.

### Creating the Extractor Definition.

#### Launching the Create Classifier Window

Creating a Document Extraction workflow begins by creating the Document Extractor Definition file. The definition file contains information about the type of extraction, the fields that need to be extracted and how each field is identified and extracted.

The Extractor Definition is created by using the [Create Extractor Window](https://docs.visualyze.ai/getting-started/rpa-studio/document-ai/document-extractor/create-extractor-window). To launch the Create Extractor Window. The Create Extractor Window is launched from Add Item Window. Click on `Add Item` button in the project pane.

![](https://1935494318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M927xRZSEM57y2sBcm3%2Fuploads%2FyoGFUs3E0WvMmMzgW92c%2Fimage.png?alt=media\&token=6f64f380-724b-4fe4-9b40-d14fc964f438)

In the Add Item Window, click AI Skills-> Document AI -> New Document Extractor.

#### Creating the definition

Users can either load and edit an already existing definition or create a new definition. For creating the definition user should select an Extractor Type from the available types.

Click on `Create` -> `Regex`.

In the `Configure Client` section set the Document AI Endpoint and API Key, then click apply.

<div align="left"><img src="https://1935494318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M927xRZSEM57y2sBcm3%2Fuploads%2F21UjqWuGmD5bEbvy5Mkz%2Fimage.png?alt=media&#x26;token=5aafee16-bff2-4916-8d48-2abe904dcc5d" alt=""></div>

In the `Configure Fields` section add new fields by clicking the add button. Let’s add two fields `Total` and `Date` with output format `Number` and `Date`. The field definition differs for each type of extractor.

![](https://1935494318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M927xRZSEM57y2sBcm3%2Fuploads%2FGiRu8Wy88GJ3nHAmP2GV%2Fimage.png?alt=media\&token=6d5ddc51-33ef-45f3-9737-532ebaf479d0)

Launch the [Configure Extractor Window](https://docs.visualyze.ai/getting-started/rpa-studio/document-ai/document-extractor/configure-extractor-window) by clicking on the `Launch Extractor Trainer` button.&#x20;

#### Configuring the Sample Files

In the `Files` section, click on the `Browse` button to select the folder containing sample files.

![](https://1935494318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M927xRZSEM57y2sBcm3%2Fuploads%2FIuqc2sYlx2kfxTZZ7gug%2Fimage.png?alt=media\&token=4818621e-c9c4-43ef-af07-1e066d95f39f)

#### Preprocessing the document

Check the `Enable Text Preprocessing` checkbox to include text preprocessing for the document.

![](https://1935494318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M927xRZSEM57y2sBcm3%2Fuploads%2FT2xMvdRXrONrO1LeRQ18%2Fimage.png?alt=media\&token=d479fba3-b92d-4a50-bae1-9606417894c0)

In the `Text Preprocessing` section, click on `Apply OCR`. This will create a `Text View` Tab which contains the OCR text in the selected document.

![](https://1935494318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M927xRZSEM57y2sBcm3%2Fuploads%2FWwB2sJxEPbT1SZxKHxvv%2Fimage.png?alt=media\&token=97c2bc61-e4c2-46cd-8999-8d59842fa067)

![](https://1935494318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M927xRZSEM57y2sBcm3%2Fuploads%2FlzhXvdQAQdfNPmECKbqn%2Fimage.png?alt=media\&token=bf6e8226-dc18-46ab-81af-8e9a58244d2f)

Add preprocessors to remove or replace unwanted lines, text or characters from the document. The extraction will be applied on the preprocessed document. In this example we will add a remove lines preprocessor.

Click on the `Add` dropdown button in the [`Text Preprocessing`](https://docs.visualyze.ai/getting-started/rpa-studio/document-ai/text-preprocessing) section and select `Line` -> `Remove Lines`. This will add a `Remove Lines` preprocessor to the preprocessors list. Select `With line numbers` option. This will remove the line with line number `1`. Make sure the checkbox is checked to select it for preprocessing. Then click `Apply` button. This will apply the preprocessing on the currently selected document with all selected preprocessors.

![](https://1935494318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M927xRZSEM57y2sBcm3%2Fuploads%2FrtryXmkvLLJ0JhtP79Vg%2Fimage.png?alt=media\&token=404e9109-142d-4949-ba4a-270be60acd1e)

Applying the preprocessing will create a `Preprocessed` Tab with the text after preprocessing is applied.

![](https://1935494318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M927xRZSEM57y2sBcm3%2Fuploads%2FhpktTAwNx4WQWldj53Er%2Fimage.png?alt=media\&token=5dd0a732-8e26-4bd2-b4cd-e5fa448843fc)

#### Configuring the Extractor

In the `Extractors` section, add the regex to identify and extract the fields. Select `Ignore case` regex option to `Date` field to make the matching case-insensitive. Click `Apply` to test the regex on the current document. Click `Apply All` to apply the extraction on all documents.

![](https://1935494318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M927xRZSEM57y2sBcm3%2Fuploads%2FIxaRhLW9EsedU9tdFNYE%2Fimage.png?alt=media\&token=635335f3-1769-481d-a259-8d0a34f91740)

The `Results` section will show the result of the extraction.

![](https://1935494318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M927xRZSEM57y2sBcm3%2Fuploads%2F3HGjhwB5rsOlX9cDVrkr%2Fimage.png?alt=media\&token=ce2ae1e7-f9dc-46fa-a34e-d203c00508ea)

Click `Save Changes` to save the definition in a definition file.

### Creating the Workflow

Three activities are required for creating the Extraction Workflow. [Create Document AI Client](https://docs.visualyze.ai/rpa-studio/document-ai/create-document-ai-client), [Preprocess Document](https://docs.visualyze.ai/rpa-studio/document-ai/tasks/preprocess-document) and [Run Extractor](https://docs.visualyze.ai/rpa-studio/document-ai/tasks/run-extractor). This example also uses [Show Validation Window](https://docs.visualyze.ai/rpa-studio/document-ai/tasks/show-validation-window) activity to view the extracted data.

![](https://1935494318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M927xRZSEM57y2sBcm3%2Fuploads%2FuajXDPXkoS93bM3KSO5p%2Fimage.png?alt=media\&token=e51dc6e1-1590-4122-ab37-ca788883fe0e)

#### Configuring Create Document AI Client activity

Add [`Create Document AI Client`](https://docs.visualyze.ai/rpa-studio/document-ai/create-document-ai-client) Activity to the Workflow. Click on the Configure button to launch the [`Create Document AI Client Window`](https://docs.visualyze.ai/getting-started/rpa-studio/editor-windows/create-document-ai-client-window). In the `Client Authorization` section provide the Document AI Endpoint and API Key. In the `Available Services` section set the provider as `Visualyze`. The `Extraction Type` should be selected according to the created extractor definition. Since the created extractor is a `Regex Extractor`, it needs text extraction. Set the `Extraction Type` as `Text`.

![](https://1935494318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M927xRZSEM57y2sBcm3%2Fuploads%2FJjlc1tAzoGmTqEmkxMLL%2Fimage.png?alt=media\&token=4ead5b7c-eb3b-4ad8-a59d-ddc2dfd6c10e)

![](https://1935494318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M927xRZSEM57y2sBcm3%2Fuploads%2F1grRsnhmIG163SCwIKaP%2Fimage.png?alt=media\&token=4a6a48e2-b6a6-4517-9cb0-0cb29c02fc6b)

Save the configuration.

#### Configuring Preprocess Document activity

Add [`Preprocess Document`](https://docs.visualyze.ai/rpa-studio/document-ai/tasks/preprocess-document) activity to the Workflow. Assign the `DocAIClient` variable from the `Create Document AI Client` activity to the `Document AI Client` property. Set a PDF or image file path to the `Input File` property.

![](https://1935494318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M927xRZSEM57y2sBcm3%2Fuploads%2F0M3qaVdDTRdqA0w9eQGa%2Fimage.png?alt=media\&token=a16d5cb9-4763-4804-8685-a82905bdbb08)

#### Configuring Run Extractor activity

Add [`Run Extractor`](https://docs.visualyze.ai/rpa-studio/document-ai/tasks/run-extractor) activity to the Workflow. In the `Document AI Client` property assign the `DocAIClient` variable. In the `Processed Document` property assign the [`ProcessedDocument`](https://docs.visualyze.ai/getting-started/variables/activity-variables#processeddocument) variable from `Preprocess Document` activity. In the `Extractor Definition` property assign the path to the created Extractor Definition file.

![](https://1935494318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M927xRZSEM57y2sBcm3%2Fuploads%2FBSQZGE3wm3QY6bI7BkIo%2Fimage.png?alt=media\&token=93d1f1de-569d-4f53-bb29-1119dc0fc20b)

#### Configuring Show Validation Window Activity

Finally add [`Show Validation Window`](https://docs.visualyze.ai/rpa-studio/document-ai/tasks/show-validation-window) and add the [`ExtractionResult`](https://docs.visualyze.ai/getting-started/variables/activity-variables#extractionresult) variable from the `Run Extractor` activity to the `Extraction Result` property. Run the workflow. The extraction will be applied to the selected file and the results will be displayed on the [`Validation Window`](https://docs.visualyze.ai/getting-started/rpa-studio/document-ai/document-validation/validation-window).

![](https://1935494318-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-M927xRZSEM57y2sBcm3%2Fuploads%2FqOMbTwA8hyaAOrEZsYyM%2Fimage.png?alt=media\&token=97b3f8e0-854c-410d-8ad7-766159ee5a35)
