pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilad Denneboom <gilad.denneb...@gmail.com>
Subject Re: Help identifying hair-lines in PDFs using PDFBox and tabula
Date Tue, 23 May 2017 10:02:58 GMT
PS. I'm also happy to hear any ideas on how to achieve it using PDFBox on
its own, without tabula...

On Tue, May 23, 2017 at 12:01 PM, Gilad Denneboom <gilad.denneboom@gmail.com
> wrote:

> There doesn't seem to be one... I guess I can try StackOverflow.
>
> On Tue, May 23, 2017 at 11:54 AM, Andreas Lehmkühler <andreas@lehmi.de>
> wrote:
>
>> > Gilad Denneboom <gilad.denneboom@gmail.com> hat am 22. Mai 2017 um
>> 22:07 geschrieben:
>> >
>> >
>> > Hi all,
>> >
>> > So I'm trying to identify hair-lines in my PDFs. I came across tabula,
>> > which seems to be able to do it, but I can't get it to quite work with
>> my
>> > files in the way I need it to, so any help is greatly appreciated!
>> >
>> > Here's what I've been doing so far: I used the Ruling object from
>> tabula to
>> > extract both the horizontal and vertical rules from a stripped version
>> of
>> > the PDF page (ie, after removing all the text in it).
>> > I'm getting results but now I want to relate them back to the original
>> PDF
>> > page, and that's proving difficult. If I add a text field using the
>> > coordinates of the Ruling objects they are way off then where I would
>> > expect them to be. I think it has to do with the DPI setting used to
>> > convert the PDF page to an image, which is necessary for the rulings
>> > extraction.
>> > So my question is: How can I take these Ruling objects and convert them
>> > back to the original coordinates of the PDF?
>> > I would also like to be able to only identify lines of a certain width
>> and
>> > height, but if I get the rectangles to work correctly I think I can do
>> that
>> > in post-processing.
>> Sounds like a question for the tabulapdf community ...
>>
>> Andreas
>> >
>> > Thanks in advance!
>> > Gilad
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message