Extract Table From PDF
Extract tables from a PDF.
Last updated
Extract tables from a PDF.
Last updated
This activity extracts tables from a user-specified PDF in a specific format. The output of the activity is a list of Tables.
Filename: Required
The full path of the input PDF file.
Output Format: The output format of the parsed tables.
DataTable (default)
- Extract the table from PDF in a DataTable format
JSON
- Extract the table from PDF in a JSON format
CSV
- Extract the table from PDF in a CSV format
Configuration:
Specifies the selector that contains the user-selected area to parse the table.
Extraction Method:
Specifies the method for detecting cells in the user-specified pdf doc. The available options are:-
Lattice
- Uses gridlines to identify cells in the given pdf document. This method cannot be applied for scanned documents.
Stream
- Uses whitespace to detect cells in the given pdf document.
Custom
- Uses the user-specified regular expressions to extract the data from the pdf page. The regular expressions should be configured by launching the Build DataTable Window using the configure table button. Each page will give a row of data for the extracted table.
Page Range: Required
Specifies the page number or a range of page numbers of the page/pages to be processed.
Output:
Saves the output as a list of tables in the specified variable. The list will be a list of DataTable or CSV or JSON according to the Output format