pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Wrong space parsed pdf
Date Thu, 25 Jan 2018 19:05:43 GMT
The font has some extremely high values that we use for our heuristics, 
these are misleading the software:



I'll see if something can be done... but I suspect that it requires a 
change that would break other text extractions so we can't commit it to 
the repository.

Tilman

Am 25.01.2018 um 15:20 schrieb Hesham Gneady:
> Hello ,
>
>   
>
> While reading a pdf using PDFBox v2.0.8 many spaces are being ignored, so
> words are merged together while reading the pdf. You can test a 1-page
> sample PDF from here:
>
> https://www.dropbox.com/s/9i9ofl3tje4iy3k/wrong_space_parsed_sample.pdf?dl=1
>
>   
>
> You can see wrong read words like :
>
> aboutmidnight, andbefore, CountyDonegal, ...
>
>   
>
> I have tried to use PDFTextStripper.setAverageCharTolerance(...) to control
> space sensitivity but it didn't make any change.
>
>   
>
> Any idea why this happens and how to fix it ?
>
>   
>
> Best regards ,
>
> Hesham
>
>   
>
>   
>
>
>
> ---
> This email has been checked for viruses by Avast antivirus software.
> https://www.avast.com/antivirus
>


Mime
View raw message