Text data requirements

This page describes the requirements for training models on text data.

Classification

Data requirements

  • At least 20, and no more than 1,000,000, training documents.

  • At least 2, and no more than 5000, unique category labels.

  • Must apply each label to at least 10 documents. For multi-label classification, you can apply one or multiple labels to a document.

Best practices for text data used to train AutoML models

  • Use training data that is as varied as the data on which predictions will be made. Include different lengths of documents, documents authored by different people, documents that use different wording or style, and so on.

  • When using multi-label classification, apply all relevant labels to each document. For example, if you are labeling documents that provide details about pharmaceuticals, you might have a document that is labeled with Dosage and Side Effects.

  • Use documents that can be easily categorized by a human reader. AutoML models can't generally predict category labels that humans can't assign. So, if a human can't be trained to assign a label by reading a document, your model likely can't be trained to do it either.

  • Provide as many training documents per label as possible. You can improve the confidence scores from your model by using more examples per label. Train a model using 50 examples per label and evaluate the results. Add more examples and retrain until you meet your accuracy targets, which might require hundreds or even 1000 examples per label.

  • The model works best when there are at most 100 times more documents for the most common label than for the least common label. We recommend removing very low-frequency labels.

  • If you have a large number of documents that don't currently match your labels, filter them out so that your model doesn't skew predictions to an out-of-domain label. For example, you could have a filtering model that predicts whether a document fits within the current set of labels or is out of the domain. After filtering, you would have another model that classifies in-domain documents only.

Entity Extraction

Entity extraction training data consists of documents that are annotated with labels that identify the types of entities that you want your model to identify. For example, you might create an entity extraction model to identify specialized terminology in legal documents or patents. Annotations specify the locations of the entities that you're labeling and the labels themselves.

If you're annotating structured or semi-structured documents for a dataset used to train AutoML models, such as invoices or contracts, Robot AI can consider an annotation's position on the page as a factor contributing to its proper label. For example, a real estate contract has both an acceptance date and a closing date. Robot AI can learn to distinguish between the entities based on the spatial position of the annotation.

Data requirements

  • At least 50, and no more than 100,000, training documents.

  • At least 1, and no more than 100, unique labels to annotate entities that you want to extract.

  • You can use a label to annotate between 1 and 10 words. Label names can be between 2 and 30 characters.

Best practices for text data used to train AutoML models for Entity Extraction

  • Use each label at least 200 times in your training dataset.

  • Annotate every occurrence of entities that you want your model to identify.

Last updated