Keyword Classifier

Document Classification using keywords in text.

The Keyword classifier uses user-specified keywords to classify a document. The classifier works well if every document in a class have at-least one unique keyword which distinguish them from other classes. The keyword classifier also supports regular expressions which means instead of unique keywords, unique patterns can be used to classify documents. PCRE is the Regex Engine used in Keyword Classifier. User can also specify a word or pattern for a class as a keyword which can never belong to the class. The keyword classifier works for text documents. For image and pdf documents, OCR is applied before classification.

Creating Document Classification Workflow using Keyword Classifier.

The workflow is created in two stages

  1. Creating the Classifier Definition which defines the class labels and how each document is classified.

  2. Creating Document AI client with Document AI credentials in the workflow and running the Classify Document Activity.

The following example explains creation of the classification workflow.

Creating the Classifier Definition.

Launching the Create Classifier Window

The Keyword Classifier Definition contains the labels and keywords for each class. The Definition is created by using the Create Classifier Window. To launch the Create Classifier Window. The Create Classifier Window is launched from Add Item Window. Click on ‘Add Item’ button in the project pane.

In the Add Item Window, click AI Skills-> Document AI -> New Document Classifier.

Creating the Definition

Users can either load and edit an already existing definition or create a new definition. For creating a new definition user should select a Classifier Type from the available types.

Click on Create -> Keyword.

In the ‘Configure Client’ section set the Document AI Endpoint and API Key, then click apply.

In the Configure Classes section add classes by clicking the add button. Add two classes Invoice and Report.

Configuring with sample files

Click Sample Files button to select the sample folder for that class. Select sample folder for Invoice class. Samples for Report will be selected later.

Launch the Configure Classifier Window by clicking on the Launch Classifier Trainer button.

The Files section will show the sample files for each class. Add samples to the report class by clicking the Browse button in Report label.

Preprocessing the document

Check the Enable Text Preprocessing checkbox to enable text preprocessing.

In the Text Preprocessing section, click on Apply OCR. This will create a Text View Tab which contains the OCR text in the selected document.

Add preprocessors to remove or replace unwanted lines, text or characters from the document. The classification is done only after the preprocessing is applied. In this example we will add a remove lines preprocessor.

Click on the ‘Add’ dropdown button in the Text Preprocessing section and select Line -> Remove Lines. This will add a Remove Lines preprocessor to the preprocessors list. Select With line numbers option. This will remove the line with line number ‘1’. Make sure the checkbox is checked to select it for preprocessing. Then click Apply button. This will apply the preprocessing on the currently selected document with all selected preprocessors.

Applying the preprocessing will create a Preprocessed Tab with the text after preprocessing is applied.

Configuring the Keyword Classifier

For each class, Keywords must be added to compare it with the document. In the Classes section, click on the Add dropdown button on each class and add Regex or Word. For the Invoice class add a Word keyword. For the Report class add a Regex keyword and a Word keyword. Select Ignore case Regex Option to Regex keyword to make the matching case-insensitive. Uncheck the Included toggle to exclude the keyword. This makes sure that a document with Invoice word is never classified as a Report. Click Apply to test the classification on the current document. Click Apply All to apply the classification on all documents.

The Results section will show the result of the classification.

Click Save Changes to save the definition in a definition file.

Creating the Document Classification Workflow

The document classification workflow needs three activities- Create DocumentAI Client, Preprocess Document and Classify Document.

Configuring Create Document AI Client Activity

Add Create Document AI Client Activity to the Workflow. Click on the Configure button to launch the Create Document AI Client Window. In the Client Authorization section provide the Document AI Endpoint and API Key. In the Available Services section set the provider as Visualyze. The Extraction Type should be selected according to the created classifier definition. For Keyword Classifier text extraction is required. Set the Extraction Type as Text.

Save the configuration.

Configuring Preprocess document Activity

Add Preprocess Document activity to the Workflow. Assign the DocAIClient variable from the Create Document AI Client activity to the Document AI Client property. Set a PDF or image file path to the Input File property.

Configuring Classify Document Activity

Add Classify Document activity to the Workflow. In the Document AI Client property assign the DocAIClient variable. In the Processed Document property assign the ProcessedDocument variable from Preprocess Document activity. In the Classifier Definition property assign the path to the created Classifier Definition file.

Run the workflow, a Classification Result variable is created if the Workflow executes successfully.

Last updated