tabula-py

tabula-py is a simple Python wrapper of tabula-java, which can read table of PDF. You can read tables from PDF and convert into pandas's DataFrame.

Requirements

Java
- Confirmed working with Java 7, 8
pandas

Usage

Install

pip install tabula-py

Example

tabula-py enables you to extract table from PDF into DataFrame and JSON. It also can extract tables from PDF and save file as CSV, TSV or JSON.

import tabula

# Read pdf into DataFrame
df = tabula.read_pdf("test.pdf", options)

# Read remote pdf into DataFrame
df2 = tabula.read_pdf("https://github.com/tabulapdf/tabula-java/raw/master/src/test/resources/technology/tabula/arabic.pdf")

# convert PDF into CSV
tabula.convert_into("test.pdf", "output.csv", output_format="csv")

See example notebook

Options

pages (str, int, list of int, optional)
- An optional values specifying pages to extract from. It allows str, int, list of int.
- Example: 1, '1-2,3', 'all' or [1,2]. Default is 1
guess (bool, optional):
- Guess the portion of the page to analyze per page.
area (list of float, optional):
- Portion of the page to analyze(top,left,bottom,right).
- Example: [269.875, 12.75, 790.5, 561]. Default is entire page
spreadsheet (bool, optional):
- Force PDF to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
nospreadsheet (bool, optional):
- Force PDF not to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet)
password (bool, optional):
- Password to decrypt document. Default is empty
silent (bool, optional):
- Suppress all stderr output.
columns (list, optional):
- X coordinates of column boundaries.
- Example: [10.1, 20.2, 30.3]
format (str, optional):
- Format for output file or extracted object. (CSV, TSV, JSON)
output_path (str, optional):
- Output file path. File format of it is depends on format.
- Same as --outfile option of tabula-java.

FAQ

Can I use option `xxx`?

Yes. You can use options argument as following. The format is same as cli of tabula-java.

read_pdf_table(file_path, options="--columns 10.1,20.2,30.3")

How can I ignore useless area?

In short, you can extract with area and spreadsheet option.

In [4]: tabula.read_pdf('./table.pdf', spreadsheet=True, area=(337.29, 226.49, 472.85, 384.91))
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Out[4]:
  Unnamed: 0 Col2 Col3 Col4 Col5
0          A    B   12    R    G
1        NaN    R    T   23    H
2          B    B   33    R    A
3          C    T   99    E    M
4          D    I   12   34    M
5          E    I    I    W   90
6        NaN    1    2    W    h
7        NaN    4    3    E    H
8          F    E   E4    R    4

How to use area option

According to tabula-java wiki, there is a explain how to specify the area: https://github.com/tabulapdf/tabula-java/wiki/Using-the-command-line-tabula-extractor-tool#grab-coordinates-of-the-table-you-want

For example, using macOS's preview, I got area information of this PDF:

java -jar ./target/tabula-0.9.0-jar-with-dependencies.jar -p all -a $y1,$x1,$y2,$x2 -o $csvfile $filename

given

Note the left, top, height, and width parameters and calculate the following:

y1 = top
x1 = left
y2 = top + height
x2 = left + width

I confirmed with tabula-java:

java -jar ./tabula/tabula-0.9.1-jar-with-dependencies.jar -g -r -a "337.29,226.49,472.85,384.91" table.pdf

Without -r(same as --spreadsheet) option, it does not work properly.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
examples		examples
tabula		tabula
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
test-requirements.txt		test-requirements.txt
tox.ini		tox.ini

License

kirkobyte/tabula-py

Folders and files

Latest commit

History

Repository files navigation

tabula-py

Requirements

Usage

Install

Example

Options

FAQ

Can I use option xxx?

How can I ignore useless area?

About

Resources

License

Stars

Watchers

Forks

Languages

Can I use option `xxx`?