pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Extracting page "correctly"
Date Mon, 05 Nov 2018 17:08:29 GMT
Am 05.11.2018 um 16:15 schrieb jorgeeflorez:
>   Hello Tilman, thanks for your reply.
>
> That's it. I want to extract the text the way you did. You rotated 90°
> clockwise because you saw the text was rotated, right?
>
> What I get from the page is that it has 0° rotation and the TextPosition
> 90° (on the contrary to page rotation, this is counter clock-wise, I
> assume).
>
> So the idea would be: Rotate the page until the text appears without
> rotation so the PDFTextStripper does its best to get the text, right? I
> mention this because I have been trying to get the text from
> the same pdf with all possible rotations (90, 180, 270). The pdf files I
> receive in the system can have any rotation on its page and on it's text.

I've been thinking about similar strategies for the same problem for 
some time but never worked on it.

So yes, we could try all 4 rotations and then see what extract makes 
more sense.

Another idea that I just came up with: take the 
DrawPrintTextLocations.java example from the source code download, then 
find this line

AffineTransform at = text.getTextMatrix().createAffineTransform();

below that, add this line:

System.out.println("Angle: " + Math.toDegrees(Math.atan2(at.getShearY(), 
at.getScaleY())));

Then look at the output....

This gets the rotation angle, which will hopefully be one of 0, 90, 180, 
270.

Now run text extraction by preparing each page with 
page.setRotation(page.getRotation()-angle);

However this won't work with fine rotations, e.g. the file from PDFBOX-4368.

That would need something different, e.g. collecting all rotations, and 
then somehow run a filtered extract for each one.

Tilman


> Thanks.
>
> Jorge Eduardo Flórez
>
> El lun., 5 nov. 2018 a las 2:08, <users-digest-help@pdfbox.apache.org>
> escribió:
>
>> users Digest 5 Nov 2018 07:08:47 -0000 Issue 1772
>>
>> Topics (messages 11288 through 11288)
>>
>> Re: Extracting page "correctly"
>>          11288 by: Tilman Hausherr
>>
>> Administrivia:
>>
>> ---------------------------------------------------------------------
>> To post to the list, e-mail: users@pdfbox.apache.org
>> To unsubscribe, e-mail: users-digest-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-digest-help@pdfbox.apache.org
>>
>> ----------------------------------------------------------------------
>>
>>
>>
>>
>> ---------- Forwarded message ----------
>> From: Tilman Hausherr <THausherr@t-online.de>
>> To: users@pdfbox.apache.org
>> Cc:
>> Bcc:
>> Date: Sat, 3 Nov 2018 10:35:30 +0100
>> Subject: Re: Extracting page "correctly"
>> Am 02.11.2018 um 23:37 schrieb jorgeeflorez:
>>> The text I get is better than the first one, but it mixes the text
>>> from left and right "columns" (please see the bold text).
>>> My question is: is it possible to get the text as one would naturally
>>> read it? i.e. the text of the left column and then the text of the
>>> right column?
>>
>> Is this what you'd like to have?
>>
>> All I did was to rotate 90° and then extract without sorting. It works
>> because many (but not all) PDFs with columns have the operators in the
>> column sequence.



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message