lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl <jan....@cominvent.com>
Subject Re: Solr / Tika Integration
Date Fri, 10 Feb 2012 11:11:02 GMT
I think you need to control the parameter "enableAutoSpace" in PDFBox. There's a JIRA for it,
but it depends on some Tika1.1 stuff as far I can understand

https://issues.apache.org/jira/browse/SOLR-2930

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 10. feb. 2012, at 11:21, Dirk Högemann wrote:

> Hello,
> 
> we use Solr 3.5 and Tika to index a lot of PDFs. The content of those PDFs
> is searchable via a full-text search.
> Also the terms are used to make search suggestions.
> 
> Unfortunately pdfbox seems to insert a space character, when there are
> soft-hyphens in the content of the PDF
> Thus the extracted text is sometimes very fragmented. For example the word
> Medizin is extracted as Me di zin.
> As a consequence the suggestions are often unusable and the search does not
> work as expected.
> 
> Has anyone a suggestion how to extract the content of PDF containing
> sof-hyphens withpout fragmenting it?
> 
> Best
> Dirk


Mime
View raw message