pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilija Pavlic <ilija.pav...@gmail.com>
Subject Re: How to use pdfbox extract table datas in pdf ?
Date Wed, 11 Jan 2012 14:51:33 GMT
There is no built in functionality to retrieve tabular data with
pdfbox because there is (usually) no table mark-up in pdf documents.
Instead, tables are usually represented as absolutely positioned text
and lines around that text forming the borders of the table.

It is possible to find all lines forming a table. Exactly how that
might work depends heavily on the document in question. For instance,
some documents use three overlapping lines instead of a thick line.
See the answer to my recent question about finding lines in a document
on how to use pdf operators to find lines in a document. While it is
certainly possible with pdfbox, I haven't been able to do it yet.
Therefore I cannot give more detailed information.

Another (a bit complex) option is:
1. Remove all text on a page.
2. Render the page to a png.
3. Find horizontal and vertical lines in the graphic using a line
detection algorithm like Hough transform.
4. Find intersections of detected lines -- they will form a grid from
which you can read with PDFTextStripperByArea

BR,
Ilija.

On Wed, Jan 11, 2012 at 3:42 PM, Kevin Brown <kb1381@gmail.com> wrote:
> I have not been able to do this. I am not sure it is possible with pdfbox.
> Have you had any luck? If you have, please post?
>
> Kevin
>
> 2012/1/10 金永梁 <jyl-tiger813@163.com>
>
>>  Hi,all
>>
>> I have a requirement to extract table datas from pdf files, I need the
>> datas remain the structure, such as store the data in xml format.
>>
>> How I fullfill this ?
>>
>> The main difficult for me Is that I don’t know where is a table begin and
>> end, how can I jude it? Acoording to lines?

Mime
View raw message