lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "SDIS M. Beauchamp" <>
Subject RE: solr - other document formats
Date Wed, 14 Nov 2007 07:43:57 GMT
You should take a look at

It gives you a starting point to make the extractor you need 



-----Message d'origine-----
De : Dwarak R [] 
Envoyé : mercredi 14 novembre 2007 05:17
À :
Objet : solr - other document formats

Hey All

I read an article on

Its states that 

"As we've seen, the XML format used by Solr for indexing is quite simple. Extracting the relevant
metadata to create these XML documents from the many formats floating around, however, is
another story. Fortunately, Lucene users have the same problem and have been working on it
for quite a while; the Lucene FAQ lists a number of references to parsers and filters which
can be used to extract content and metadata from many common document formats. 
Solr won't index spreadsheets or other formats out of the box, but that is not its role: you
should see Solr as the "search engine" component of a broader "search system," where extraction
of content and metadata is handled by other components. This will help to keep your search
system maintainable and testable, and it helps the Solr team focus on doing one thing well."

Parsing documents like pdf, ms word document, excel to xml will be done other component ?

Somebody advise 


Dwarak R

This message is for the designated recipient only and may contain privileged, proprietary,
or otherwise private information. If you have received it in error, please notify the sender&
 immediately and delete the original. Any other use of the email by you is prohibited.

View raw message