lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: document support for file system crawling
Date Wed, 30 Aug 2006 17:20:29 GMT

: the text out of these types of documents.  You could borrow the
: document parsing pieces from Lucene's contrib and Nutch and glue them
: together into your client that speaks to Solr, or perhaps Solr isn't
: the right approach for your needs?   It certainly is possible to add
: these capabilities into Solr, but it would be awkward to have to
: stream binary data into XML documents such that Solr could parse them
: on the server side.

Agreed.  Solr's focus is in indexing "Structured Data".  The support for
dynamic fields certainly allows you do deal with complex structured data,
and somewhat heterogeneous structured data -- but it's still structured
data.  If your goal is to do a lot of crawling of disparat physical
documents, extract the text, and build a "path,title,content" index
then Nutch is probably your best bet.


-Hoss


Mime
View raw message