Text Preprocessing

Remove unwanted content from text

Text preprocessing is done to remove unwanted characters, words or lines from the document before the extraction or classification is done. Although it is not a necessary step , adding text preprocessors will increase the accuracy of the task. Text preprocessors can be added only in Regex Extractor and Keyword Classifier.

Preprocessing is applied in the order they appears in the preprocessors list. If there are two preprocessors, then the second preprocessor gets the output of the first preprocessor.

Line

  1. Remove Lines Removes the specified lines.

    • Contains Removes the lines which contains the specified words.

    • Starts With Removes the lines which starts with the specified word.

    • Ends With Removes the lines which ends with the specified word.

    • With line numbers Remove all the lines with the specified line number. For example to remove the First, third and Ninth lines specify the input as 1,9,10.

Text

  1. Replace Text Replaces the text or the text matched by the regular expression with the specified text.

  2. Remove Text Removes the text or the text matched by the regular expression.

  3. Remove from list Removes the list of words selected from a text file.

Character

  1. Remove Character Removes all the specified characters from the document text. For example to remove all ‘$’ , ‘5’ and ‘#’ characters, specify the input as ‘$#5’.

Last updated