pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manuel AristarĂ¡n <man...@jazzido.com>
Subject Re: How to logically read text from a PDF table?
Date Tue, 18 Jul 2017 15:31:44 GMT
Hi Dane,

As you might know, there's no thing such as tables in PDF files. The only
way to extract them is to try to reconstruct the tabular arrangement from
the characters' positions, ruling lines, and so on. I'm one of the
maintainers of Tabula [1], which is a tool based on PDFBox that implements
a number of algorithms to attempt that. We have a GUI tool [2], and a Java
library [3]. Both are open source (MIT license)

Best,

[1] http://tabula.technology
[2] https://github.com/tabulapdf/tabula
[3] https://github.com/tabulapdf/tabula-java

--
Manuel AristarĂ¡n
jazzido.com



On Tue, Jul 18, 2017 at 9:28 AM, Dane Bezuidenhout <
dane.bezuidenhout@sprinthive.com> wrote:

> The examples available are clear on constructing a table, but there is
> little info on reading a table. I've investigated a few solution to this,
> but feel that they are "hacky" in that they rely on establishing column and
> row regions to read text from.
>
> Surely there is a canonical way to traverse the PDDocument table elements
> and access table cells with reference to row and columns?
>
> Any advice would be appreciated.
>
>
> Dane Bezuidenhout
> SprintHive <https://sprinthive.com/>
>
> M: +27 82 562 7850
>
>
> vCard <http://www.sprinthive.com/files/dane.vcf>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message