lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cloax <...@joekondel.com>
Subject ExtractRequestHandler - not properly indexing office docs?
Date Fri, 19 Jun 2009 23:20:14 GMT

Hi there, 

I've got a Solr instance running and am feeding it rich binary documents to
index from a Django application. The setup works just fine with pdf's, etc..
but no matter what type of MS Word document ( doc and docx ) I feed it I
can't get any results when searching for content-related queries.

I've curl'd with extract.only to verify that Solr ( and tika ) could extract
the contents, and it happily enough spits back the extracted XHTML to me.
That content never seems to find it's way into the ext.def.fl that I have
specified. 

When I go and search for terms specific to content in those documents, I get
zero hits. However I get hits on metadata related queries ( ie: i store
username of who uploaded it, etc.. ) 

Is there some magical bit I forgot to flip?

cheers,
joe
-- 
View this message in context: http://www.nabble.com/ExtractRequestHandler---not-properly-indexing-office-docs--tp24120125p24120125.html
Sent from the Solr - User mailing list archive at Nabble.com.


Mime
View raw message