pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hesham G." <heshamgne...@gmail.com>
Subject Re: Softhyphens / white space
Date Fri, 10 Feb 2012 14:46:41 GMT
Dirk ,

Did you try to use PDFTextStripper.setAverageCharTolerance( float ) ?



Best regards ,
Hesham


---------------------------------------------
Included message :

> Hello,
>
> I use pdfbox 1.6.0 to extract text form PDFs, which works often fine.
>
> Unfortunately it seems to insert a space character, when there are
> soft-hyphens in the content of the PDF.
> Thus the extracted text is sometimes very fragmented. For example the word
> Medizin is extracted as Me di zin.
> I also tried to set the new option "parser.setEnableAutoSpace(false);".
> But this had no effect on the output.
>
> Has anyone a suggestion how to extract the content of PDF containing
> sof-hyphens without fragmenting it?
>
> As I use the output of pdfbox for searching with Apache Solr my search
> results are getting sometimes very strange...
>
> Best regards
> Dirk
> 

Mime
View raw message