pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ray Weidner <ray.weidner.develo...@gmail.com>
Subject Re: extracting grid lines for PDF tables
Date Thu, 26 Jan 2012 22:58:35 GMT
Thanks Ilija.  It sounds like your suggestion might be the best approach.
I was under the impression that PDF documents represented grid lines with
something like vector graphics.  I suppose there is no reason to expect
this to always be the case, and I must make allowances for image
backgrounds.  Now all I need to do is find or write some code for line
detection in images...I've spent some time looking for this, but so far, no
dice.  Java Advanced Imaging looks promising, but I'm still learning what
that's all about.  Any suggestions are welcome.

Ray


On Thu, Jan 26, 2012 at 5:44 PM, Ilija Pavlic <ilija.pavlic@gmail.com>wrote:

> There is no built in functionality to retrieve tabular data with
> pdfbox because there is (usually) no table mark-up in pdf documents.
> Instead, tables are usually represented as absolutely positioned text
> and lines around that text forming the borders of the table.
>
> It is possible to find all lines forming a table. Exactly how that
> might work depends heavily on the document in question. For instance,
> some documents use three overlapping lines instead of a thick line.
> See the answer to my recent question about finding lines in a document
> on how to use pdf operators to find lines in a document. While it is
> certainly possible with pdfbox, I haven't been able to do it yet.
> Therefore I cannot give more detailed information.
>
> Another (a bit complex) option is:
> 1. Remove all text on a page.
> 2. Render the page to a graphic format.
> 3. Find horizontal and vertical lines in the graphic using a line
> detection algorithm like Hough transform.
> 4. Find intersections of detected lines -- they will form a tabular grid
> from
> which you can read with PDFTextStripperByArea
>
> BR,
> Ilija.
>
> On 26. 1. 2012., at 22:39, Ray Weidner wrote:
>
> > Hi,
> >
> > I'm currently using PDFBox for an application that detects table
> structures
> > in PDF documents.  So far, I do this by extending PDFTextStripper, and
> > using the character position and font data to heuristically detect
> > table-like text formatting.  This is working pretty well, but we want to
> > improve this, if possible, by analyzing vector graphics to detect
> > table-like grid lines.  This will definitely improve accuracy, and make
> it
> > easier to parse more complex table structures.
> >
> > So how can I do this, and is it even possible?  I'm not at all an expert
> of
> > PDFBox or the PDF standard, so I don't yet know if this can be done (for
> > instance, if tables grids are usually formed from background images, this
> > is probably not feasible within our time frame).  Please bear with my
> > newbishness.
> >
> > Thanks in advance!
> >
> > Ray Weidner
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message