pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilad Denneboom <gilad.denneb...@gmail.com>
Subject Help identifying hair-lines in PDFs using PDFBox and tabula
Date Mon, 22 May 2017 20:07:12 GMT
Hi all,

So I'm trying to identify hair-lines in my PDFs. I came across tabula,
which seems to be able to do it, but I can't get it to quite work with my
files in the way I need it to, so any help is greatly appreciated!

Here's what I've been doing so far: I used the Ruling object from tabula to
extract both the horizontal and vertical rules from a stripped version of
the PDF page (ie, after removing all the text in it).
I'm getting results but now I want to relate them back to the original PDF
page, and that's proving difficult. If I add a text field using the
coordinates of the Ruling objects they are way off then where I would
expect them to be. I think it has to do with the DPI setting used to
convert the PDF page to an image, which is necessary for the rulings
extraction.
So my question is: How can I take these Ruling objects and convert them
back to the original coordinates of the PDF?
I would also like to be able to only identify lines of a certain width and
height, but if I get the rectangles to work correctly I think I can do that
in post-processing.

Thanks in advance!
Gilad

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message