pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ray Weidner <ray.weidner.develo...@gmail.com>
Subject extracting grid lines for PDF tables
Date Thu, 26 Jan 2012 21:39:39 GMT
Hi,

I'm currently using PDFBox for an application that detects table structures
in PDF documents.  So far, I do this by extending PDFTextStripper, and
using the character position and font data to heuristically detect
table-like text formatting.  This is working pretty well, but we want to
improve this, if possible, by analyzing vector graphics to detect
table-like grid lines.  This will definitely improve accuracy, and make it
easier to parse more complex table structures.

So how can I do this, and is it even possible?  I'm not at all an expert of
PDFBox or the PDF standard, so I don't yet know if this can be done (for
instance, if tables grids are usually formed from background images, this
is probably not feasible within our time frame).  Please bear with my
newbishness.

Thanks in advance!

Ray Weidner

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message