pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dirk Högemann <dirk.hoegem...@googlemail.com>
Subject Re: Softhyphens / white space
Date Fri, 10 Feb 2012 15:27:28 GMT
Hi,
now I tried it...but without success.
I experimented with the following settings (with varying values):

textStripper.setSpacingTolerance(0.5f);
textStripper.setAverageCharTolerance(0.3f);

What could be reasonable values? I also tried 0.999 for both.

Thanks so far
Dirk

2012/2/10 Hesham G. <heshamgneady@gmail.com>

> Dirk ,
>
> Did you try to use PDFTextStripper.**setAverageCharTolerance( float ) ?
>
>
>
> Best regards ,
> Hesham
>
>
> ------------------------------**---------------
> Included message :
>
>
>  Hello,
>>
>> I use pdfbox 1.6.0 to extract text form PDFs, which works often fine.
>>
>> Unfortunately it seems to insert a space character, when there are
>> soft-hyphens in the content of the PDF.
>> Thus the extracted text is sometimes very fragmented. For example the word
>> Medizin is extracted as Me di zin.
>> I also tried to set the new option "parser.setEnableAutoSpace(**false);".
>> But this had no effect on the output.
>>
>> Has anyone a suggestion how to extract the content of PDF containing
>> sof-hyphens without fragmenting it?
>>
>> As I use the output of pdfbox for searching with Apache Solr my search
>> results are getting sometimes very strange...
>>
>> Best regards
>> Dirk
>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message