pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dane Bezuidenhout <dane.bezuidenh...@sprinthive.com>
Subject Re: How to logically read text from a PDF table?
Date Tue, 18 Jul 2017 15:35:51 GMT
Hi Manuel,

Thank you for the fast response, I will investigate Tabula.

Regards,

Dane

Dane Bezuidenhout
SprintHive <https://sprinthive.com/>

M: +27 82 562 7850


vCard <http://www.sprinthive.com/files/dane.vcf>

On Tue, Jul 18, 2017 at 5:31 PM, Manuel AristarĂ¡n <manuel@jazzido.com>
wrote:

> Hi Dane,
>
> As you might know, there's no thing such as tables in PDF files. The only
> way to extract them is to try to reconstruct the tabular arrangement from
> the characters' positions, ruling lines, and so on. I'm one of the
> maintainers of Tabula [1], which is a tool based on PDFBox that implements
> a number of algorithms to attempt that. We have a GUI tool [2], and a Java
> library [3]. Both are open source (MIT license)
>
> Best,
>
> [1] http://tabula.technology
> [2] https://github.com/tabulapdf/tabula
> [3] https://github.com/tabulapdf/tabula-java
>
> --
> Manuel AristarĂ¡n
> jazzido.com
>
>
>
> On Tue, Jul 18, 2017 at 9:28 AM, Dane Bezuidenhout <
> dane.bezuidenhout@sprinthive.com> wrote:
>
> > The examples available are clear on constructing a table, but there is
> > little info on reading a table. I've investigated a few solution to this,
> > but feel that they are "hacky" in that they rely on establishing column
> and
> > row regions to read text from.
> >
> > Surely there is a canonical way to traverse the PDDocument table elements
> > and access table cells with reference to row and columns?
> >
> > Any advice would be appreciated.
> >
> >
> > Dane Bezuidenhout
> > SprintHive <https://sprinthive.com/>
> >
> > M: +27 82 562 7850
> >
> >
> > vCard <http://www.sprinthive.com/files/dane.vcf>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message