pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank van der Hulst <drifter.fr...@gmail.com>
Subject Re: Regarding Table in PdfBox
Date Tue, 30 Sep 2014 08:07:11 GMT
Hi Borris,

I've been working on that problem for a while, and I'm close to an answer
(actually, 2 answers). Problem is, there's more than 2 questions :) A
program that's clever enough to handle all the permutations of ways to draw
tables will be huge and/or complex.

My first solution works OK for tables where you know beforehand the
position of each column boundary... you don't need anything to delimit one
column from another. It also handles tables that wrap from one page to the
next. (You can get the source from
https://issues.apache.org/jira/browse/PDFBOX-2286).

My second solution (which isn't finished yet) uses graphic lines to
identify the column boundaries. This makes it easier to automatically
extract text from the table. But it doesn't work on tables that span
multiple pages. I suspect that there are tables out there that use
different ways to draw the lines from what my program expects.

So both versions only handle small subsets of what I think is a very large
set of all possible tables. Both have difficulty with subscripts &
superscripts in the text. And text that goes any way other than horizontal,
left to right.

I'm happy to share my source code so someone else can extend it to work
with more types of tables :)

Frank


On Tue, Sep 30, 2014 at 8:05 PM, Tilman Hausherr <THausherr@t-online.de>
wrote:

> short answer: you can't, there is no "table" concept in PDF like in HTML
>
> long answer:
> https://stackoverflow.com/questions/3203790/parsing-pdf-
> files-especially-with-tables-with-pdfbox
>
> Tilman
>
> Am 30.09.2014 um 08:57 schrieb Borris Bonafort:
>
>  Hi ,
>>        How to identify table using PDFBOX . And extract text from it .
>> Please help me with the idea .
>>
>> Thanks
>>   Borris
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message