pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jorgeeflorez <jorgeeduardoflo...@gmail.com>
Subject Re: Extracting page "correctly"
Date Mon, 05 Nov 2018 15:15:02 GMT
 Hello Tilman, thanks for your reply.

That's it. I want to extract the text the way you did. You rotated 90°
clockwise because you saw the text was rotated, right?

What I get from the page is that it has 0° rotation and the TextPosition
90° (on the contrary to page rotation, this is counter clock-wise, I
assume).

So the idea would be: Rotate the page until the text appears without
rotation so the PDFTextStripper does its best to get the text, right? I
mention this because I have been trying to get the text from
the same pdf with all possible rotations (90, 180, 270). The pdf files I
receive in the system can have any rotation on its page and on it's text.

Thanks.

Jorge Eduardo Flórez

El lun., 5 nov. 2018 a las 2:08, <users-digest-help@pdfbox.apache.org>
escribió:

>
> users Digest 5 Nov 2018 07:08:47 -0000 Issue 1772
>
> Topics (messages 11288 through 11288)
>
> Re: Extracting page "correctly"
>         11288 by: Tilman Hausherr
>
> Administrivia:
>
> ---------------------------------------------------------------------
> To post to the list, e-mail: users@pdfbox.apache.org
> To unsubscribe, e-mail: users-digest-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-digest-help@pdfbox.apache.org
>
> ----------------------------------------------------------------------
>
>
>
>
> ---------- Forwarded message ----------
> From: Tilman Hausherr <THausherr@t-online.de>
> To: users@pdfbox.apache.org
> Cc:
> Bcc:
> Date: Sat, 3 Nov 2018 10:35:30 +0100
> Subject: Re: Extracting page "correctly"
> Am 02.11.2018 um 23:37 schrieb jorgeeflorez:
> >
> > The text I get is better than the first one, but it mixes the text
> > from left and right "columns" (please see the bold text).
> > My question is: is it possible to get the text as one would naturally
> > read it? i.e. the text of the left column and then the text of the
> > right column?
>
>
> Is this what you'd like to have?
>
> All I did was to rotate 90° and then extract without sorting. It works
> because many (but not all) PDFs with columns have the operators in the
> column sequence.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message