Extract Table From PDF
Extract tables from a PDF.
This activity extracts tables from a user-specified PDF in a specific format. The output of the activity is a list of Tables.
Input
Filename:
Object Argument
Required
The full path of the input PDF file.Output Format: The output format of the parsed tables.
DataTable (default)
- Extract the table from PDF in a DataTable formatJSON
- Extract the table from PDF in a JSON formatCSV
- Extract the table from PDF in a CSV format
Configuration:
Specifies the selector that contains the user-selected area to parse the table.
Extraction Method:
Specifies the method for detecting cells in the user-specified pdf doc. The available options are:-
Lattice
- Uses gridlines to identify cells in the given pdf document. This method cannot be applied for scanned documents.Stream
- Uses whitespace to detect cells in the given pdf document.Custom
- Uses the user-specified regular expressions to extract the data from the pdf page. The regular expressions should be configured by launching the Build DataTable Window using the configure table button. Each page will give a row of data for the extracted table.
Page Range:
String Argument
Required
Specifies the page number or a range of page numbers of the page/pages to be processed.
The page number range can be specified as the following:
"0" - All pages
"1" - Page 1 "1-5" - Pages 1 to 5 "1,4,6" - Pages 1,4,6 "1,3,5-9,12" - Pages 1,3,5 to 9 and 12 "^1" - Last page of the given PDF "^2" - Second last page of the given PDF "1,^1" - Pages 1,Last page of the given PDF
Output
Output:
Saves the output as a list of tables in the specified variable. The list will be a list of DataTable or CSV or JSON according to the Output format
Last updated