pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Extract Skewed Text
Date Thu, 01 Nov 2018 04:36:01 GMT
Am 31.10.2018 um 22:07 schrieb Luca Loiodice:
> I am using 2.0.X and have to support arbitrary input PDF (including
> 90,180,270 orientations, multi-column text, etc..).
>
> Everything works fine except for the text on angle.
> I came up with 2 pass call to the PDFStripper. Getting standard oriented
> text using SortByPosition=false and getting  90,180,270 oriented text using
> SortByPosition=true,
> which I am not sure is correct, but seems to work.
>
> Are there any override I could try on the 2.0.X PDFStripper to make it
> work?

No, nothing out of the box... the reason is that PDFBox sees each glyph 
by itself. You, a human, are smarter than PDFBox and do notice that 
these glyphs seemingly on different "lines" are part of a skewed line.

Tilman




>
>
>
> On Wed, Oct 31, 2018 at 3:34 PM Tilman Hausherr <THausherr@t-online.de>
> wrote:
>
>> It might work with 1.8. However that version has other weaknesses.
>>
>> Tilman
>>
>> Am 31.10.2018 um 21:19 schrieb Luca Loiodice:
>>> Is it possible to extract the 2 lines of text from this page?
>>> https://www.dropbox.com/s/2uh3p464i7iwjwv/textonanangle.pdf?dl=0
>>>
>>> This is the text lines I get using standard PdfStripper
>>>
>>> Tex
>>> t on
>>> e o
>>> n a
>>> n a
>>> ngl
>>> e
>>> Text two on an angle
>>>
>>> Thanks a lot,
>>> Luca
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message