pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hesham Gneady" <heshamgne...@gmail.com>
Subject RE: Wrong space parsed pdf
Date Thu, 25 Jan 2018 20:33:18 GMT


I have reported this because the PDF appeared normal to me. If there is a
way to read the text in the PDF in a right way I hope you could help me with



Best regards,




Included Message:


The font has some extremely high values that we use for our heuristics,
these are misleading the software:

I'll see if something can be done... but I suspect that it requires a change
that would break other text extractions so we can't commit it to the


Am 25.01.2018 um 15:20 schrieb Hesham Gneady:

Hello ,
While reading a pdf using PDFBox v2.0.8 many spaces are being ignored, so
words are merged together while reading the pdf. You can test a 1-page
sample PDF from here:
You can see wrong read words like :
aboutmidnight, andbefore, CountyDonegal, ...
I have tried to use PDFTextStripper.setAverageCharTolerance(...) to control
space sensitivity but it didn't make any change.
Any idea why this happens and how to fix it ?
Best regards ,
This email has been checked for viruses by Avast antivirus software.


  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message