lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier.Mass...@real.lu
Subject Re: Problem indexing email attachments
Date Wed, 23 Apr 2014 14:49:38 GMT
As I said, it is not a problem in the Tika library ;)

I have tried with Tika 1.5 jars and it gives the same results.



Guido Medina <guido.medina@temetra.com> wrote on 23/04/2014 16:15:11:

> From: Guido Medina <guido.medina@temetra.com>
> To: solr-user@lucene.apache.org
> Date: 23/04/2014 16:15
> Subject: Re: Problem indexing email attachments
> 
> We particularly massage solr.war and put our own updated jars, maybe 
> this helps:
> 
> http://www.apache.org/dist/tika/CHANGES-1.5.txt
> 
> We using Tika 1.5 inside Solr with POI 3.10-Final, etc...
> 
> Guido.
> 
> On 23/04/14 14:38, Olivier.Masseau@real.lu wrote:
> > Hello,
> >
> > I'm trying to index email files with Solr (4.7.2)
> >
> > The files have the extension .eml (message/rfc822)
> >
> > The mail body is correctly indexed but attachments are not indexed if 
they
> > are not .txt files.
> >
> > If attachments are .txt files it works, but if attachment are .pdf of
> > .docx files they are not indexed.
> >
> >
> >
> > I checked the extracted text by calling:
> >
> > curl "
> > http://localhost:8983/solr/update/extract?
> literal.id=doc1&commit=true&extractOnly=true&extractFormat=text
> > " -F "myfile=@Test1.eml"
> >
> > The returned extracted text does not contain the content of the
> > attachments if they are not .txt files.
> >
> >
> > It is not a problem with the Apache Tika library not being able to 
process
> > attachments, because running the standalone Apache Tika app by 
calling:
> >
> >
> > java -jar tika-app-1.4.jar -t Test1.eml
> >
> >
> > on my eml files correctly displays the attachments' text.
> >
> >
> >
> > Maybe is it a problem with how Tika is called by Solr ?
> >
> > Is there something to modify in the default configuration ?
> >
> >
> > Thanx for any help ;)
> > 
> > Olivier
> 

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message