pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Extracting rotated text
Date Mon, 25 Sep 2017 17:55:25 GMT
Am 25.09.2017 um 19:43 schrieb Allison, Timothy B.:
> Thank you, Tilman.  I haven't looked yet, but to confirm, there's no page parameter that
specifies that the text has been rotated?

Yes and no, because it can be rotated through page rotation but also 
with "cm" or "Tm" and maybe others.

In your file, there is no page level rotation. It is done in the content 
stream with commands like

     0 60 -60 0 192.84 160.08 Tm

And what gets really tricky is if you have diagonal rotations or mixed 
rotations...

Tilman

>
> Back to language modeling... 😊  Thank you, again!
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Monday, September 25, 2017 1:39 PM
> To: users@pdfbox.apache.org
> Subject: Re: Extracting rotated text
>
> No good idea except call setRotate() on the page and then do text extraction.
>
> A possible strategy might be to do all rotations and see which one brings most known
words.
>
> Tilman
>
>
> Am 25.09.2017 um 19:31 schrieb Allison, Timothy B.:
>> Colleagues,
>> Any recommendations for extracting rotated text such as: https://www.fsis.usda.gov/wps/wcm/connect/896bf55c-0d78-44a0-adfb-94f893eb0f72/GallagherEbelKause_74.pdf?MOD=AJPERES
?
>>
>> Adobe DC gets reasonable text with "save as text".  PDFBox's ExtractText (and Tika)
get something like this:
>>
>> FS
>> IS
>> L
>> is
>> te
>> ria
>> Li
>> st
>> er
>> ia
>> R
>> is
>> k
>> R
>> is
>> k
>> As
>> se
>> ss
>> m
>> en
>>
>> Thank you!
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message