pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilija Pavlic <ilija.pav...@gmail.com>
Subject Re: extracting grid lines for PDF tables
Date Fri, 27 Jan 2012 10:40:42 GMT
Sorry, must have sent something wrong.

Most of the PDFs do represent grid lines with something like vectors
-- they use operators to "move a pen" and "stroke a line". JAI seems
like a bad choice to me because of difficulties with portability and
deployment. Take a look at OpenCV for algorithms, most of them are
also easy to find on the web.


On Thu, Jan 26, 2012 at 11:58 PM, Ray Weidner
<ray.weidner.developer@gmail.com> wrote:
> Thanks Ilija.  It sounds like your suggestion might be the best approach.
> I was under the impression that PDF documents represented grid lines with
> something like vector graphics.  I suppose there is no reason to expect
> this to always be the case, and I must make allowances for image
> backgrounds.  Now all I need to do is find or write some code for line
> detection in images...I've spent some time looking for this, but so far, no
> dice.  Java Advanced Imaging looks promising, but I'm still learning what
> that's all about.  Any suggestions are welcome.
> Ray
> On Thu, Jan 26, 2012 at 5:44 PM, Ilija Pavlic <ilija.pavlic@gmail.com>wrote:
>> There is no built in functionality to retrieve tabular data with
>> pdfbox because there is (usually) no table mark-up in pdf documents.
>> Instead, tables are usually represented as absolutely positioned text
>> and lines around that text forming the borders of the table.
>> It is possible to find all lines forming a table. Exactly how that
>> might work depends heavily on the document in question. For instance,
>> some documents use three overlapping lines instead of a thick line.
>> See the answer to my recent question about finding lines in a document
>> on how to use pdf operators to find lines in a document. While it is
>> certainly possible with pdfbox, I haven't been able to do it yet.
>> Therefore I cannot give more detailed information.
>> Another (a bit complex) option is:
>> 1. Remove all text on a page.
>> 2. Render the page to a graphic format.
>> 3. Find horizontal and vertical lines in the graphic using a line
>> detection algorithm like Hough transform.
>> 4. Find intersections of detected lines -- they will form a tabular grid
>> from
>> which you can read with PDFTextStripperByArea
>> BR,
>> Ilija.
>> On 26. 1. 2012., at 22:39, Ray Weidner wrote:
>> > Hi,
>> >
>> > I'm currently using PDFBox for an application that detects table
>> structures
>> > in PDF documents.  So far, I do this by extending PDFTextStripper, and
>> > using the character position and font data to heuristically detect
>> > table-like text formatting.  This is working pretty well, but we want to
>> > improve this, if possible, by analyzing vector graphics to detect
>> > table-like grid lines.  This will definitely improve accuracy, and make
>> it
>> > easier to parse more complex table structures.
>> >
>> > So how can I do this, and is it even possible?  I'm not at all an expert
>> of
>> > PDFBox or the PDF standard, so I don't yet know if this can be done (for
>> > instance, if tables grids are usually formed from background images, this
>> > is probably not feasible within our time frame).  Please bear with my
>> > newbishness.
>> >
>> > Thanks in advance!
>> >
>> > Ray Weidner

View raw message