Keyword Classifier
Document Classification using keywords in text.
Last updated
Document Classification using keywords in text.
Last updated
The Keyword classifier uses user-specified keywords to classify a document. The classifier works well if every document in a class have at-least one unique keyword which distinguish them from other classes. The keyword classifier also supports regular expressions which means instead of unique keywords, unique patterns can be used to classify documents. PCRE is the Regex Engine used in Keyword Classifier. User can also specify a word or pattern for a class as a keyword which can never belong to the class. The keyword classifier works for text documents. For image and pdf documents, OCR is applied before classification.
The workflow is created in two stages
Creating the Classifier Definition which defines the class labels and how each document is classified.
Creating Document AI client with Document AI credentials in the workflow and running the Classify Document Activity.
The following example explains creation of the classification workflow.
The Keyword Classifier Definition contains the labels and keywords for each class. The Definition is created by using the Create Classifier Window
. To launch the Create Classifier Window. The Create Classifier Window is launched from Add Item Window. Click on ‘Add Item’ button in the project pane.
In the Add Item Window, click AI Skills-> Document AI -> New Document Classifier.
Users can either load and edit an already existing definition or create a new definition. For creating a new definition user should select a Classifier Type from the available types.
Click on Create -> Keyword.
In the ‘Configure Client’ section set the Document AI Endpoint and API Key, then click apply.
In the Configure Classes
section add classes by clicking the add button. Add two classes Invoice
and Report
.
Click Sample Files
button to select the sample folder for that class. Select sample folder for Invoice
class. Samples for Report
will be selected later.
Launch the Configure Classifier Window
by clicking on the Launch Classifier Trainer
button.
The Files
section will show the sample files for each class. Add samples to the report class by clicking the Browse
button in Report label.
Check the Enable Text Preprocessing
checkbox to enable text preprocessing.
In the Text Preprocessing
section, click on Apply OCR
. This will create a Text View
Tab which contains the OCR text in the selected document.
Add preprocessors to remove or replace unwanted lines, text or characters from the document. The classification is done only after the preprocessing is applied. In this example we will add a remove lines preprocessor.
Click on the ‘Add’ dropdown button in the Text Preprocessing
section and select Line
-> Remove Lines
. This will add a Remove Lines
preprocessor to the preprocessors list. Select With line numbers
option. This will remove the line with line number ‘1’. Make sure the checkbox is checked to select it for preprocessing. Then click Apply
button. This will apply the preprocessing on the currently selected document with all selected preprocessors.
Applying the preprocessing will create a Preprocessed
Tab with the text after preprocessing is applied.
For each class, Keywords must be added to compare it with the document. In the Classes
section, click on the Add
dropdown button on each class and add Regex
or Word
. For the Invoice
class add a Word keyword. For the Report
class add a Regex keyword and a Word keyword. Select Ignore case
Regex Option to Regex keyword to make the matching case-insensitive. Uncheck the Included
toggle to exclude the keyword. This makes sure that a document with Invoice
word is never classified as a Report. Click Apply
to test the classification on the current document. Click Apply All
to apply the classification on all documents.
The Results
section will show the result of the classification.
Click Save Changes
to save the definition in a definition file.
The document classification workflow needs three activities- Create DocumentAI Client
, Preprocess Document
and Classify Document
.
Add Create Document AI Client
Activity to the Workflow. Click on the Configure
button to launch the Create Document AI Client
Window. In the Client Authorization
section provide the Document AI Endpoint and API Key. In the Available Services
section set the provider as Visualyze
. The Extraction Type
should be selected according to the created classifier definition. For Keyword Classifier
text extraction is required. Set the Extraction Type
as Text
.
Save the configuration.
Add Preprocess Document
activity to the Workflow. Assign the DocAIClient
variable from the Create Document AI Client
activity to the Document AI Client
property. Set a PDF or image file path to the Input File
property.
Add Classify Document
activity to the Workflow. In the Document AI Client
property assign the DocAIClient
variable. In the Processed Document
property assign the ProcessedDocument
variable from Preprocess Document
activity. In the Classifier Definition
property assign the path to the created Classifier Definition file.
Run the workflow, a Classification Result
variable is created if the Workflow executes successfully.