pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hesham Gneady" <heshamgne...@gmail.com>
Subject RE: Wrong space parsed pdf
Date Thu, 25 Jan 2018 20:33:18 GMT
Tilman,

 

I have reported this because the PDF appeared normal to me. If there is a
way to read the text in the PDF in a right way I hope you could help me with
that.

 

 

Best regards,

Hesham 

 

----------------------------------------------------------------------------
----------------------

Included Message:

 

The font has some extremely high values that we use for our heuristics,
these are misleading the software:



I'll see if something can be done... but I suspect that it requires a change
that would break other text extractions so we can't commit it to the
repository.

Tilman

Am 25.01.2018 um 15:20 schrieb Hesham Gneady:

Hello ,
 
 
 
While reading a pdf using PDFBox v2.0.8 many spaces are being ignored, so
words are merged together while reading the pdf. You can test a 1-page
sample PDF from here:
 
https://www.dropbox.com/s/9i9ofl3tje4iy3k/wrong_space_parsed_sample.pdf?dl=1
 
 
 
You can see wrong read words like :
 
aboutmidnight, andbefore, CountyDonegal, ...
 
 
 
I have tried to use PDFTextStripper.setAverageCharTolerance(...) to control
space sensitivity but it didn't make any change.
 
 
 
Any idea why this happens and how to fix it ?
 
 
 
Best regards ,
 
Hesham
 
 
 
 
 
 
 
---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
 

 


Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message