Extract Table From PDF

Extract tables from a PDF.

This activity extracts tables from a user-specified PDF in a specific format. The output of the activity is a list of Tables.

Input

  • Filename: Object Argument Required The full path of the input PDF file.

  • Output Format: The output format of the parsed tables.

    • DataTable (default) - Extract the table from PDF in a DataTable format

    • JSON - Extract the table from PDF in a JSON format

    • CSV - Extract the table from PDF in a CSV format

  • Configuration:

    Specifies the selector that contains the user-selected area to parse the table.

  • Extraction Method:

    Specifies the method for detecting cells in the user-specified pdf doc. The available options are:-

    • Lattice - Uses gridlines to identify cells in the given pdf document. This method cannot be applied for scanned documents.

    • Stream - Uses whitespace to detect cells in the given pdf document.

    • Custom - Uses the user-specified regular expressions to extract the data from the pdf page. The regular expressions should be configured by launching the Build DataTable Window using the configure table button. Each page will give a row of data for the extracted table.

  • Page Range: String Argument Required Specifies the page number or a range of page numbers of the page/pages to be processed.

The page number range can be specified as the following:

"0" - All pages

"1" - Page 1 "1-5" - Pages 1 to 5 "1,4,6" - Pages 1,4,6 "1,3,5-9,12" - Pages 1,3,5 to 9 and 12 "^1" - Last page of the given PDF "^2" - Second last page of the given PDF "1,^1" - Pages 1,Last page of the given PDF

Output

  • Output:

    Saves the output as a list of tables in the specified variable. The list will be a list of DataTable or CSV or JSON according to the Output format

Last updated