nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Storing full HTML with nutch/solrindexer.
Date Mon, 09 Feb 2009 16:36:26 GMT
Felix Zimmermann wrote:
> Hi,
> I use the latest Nutch-trunk with "solrindex" (nutch for crawling and solr
> for searching). My Question is: How can I store the native content of
> html-pages including all tags in e.g. the Solr-field "caching"? While
> indexing, the field remains empty, all other fields like "title" or
> "content" works well.

Currently this is not possible out of the box, it would require some 
changes to the indexer. Namely, the Content would have to be added as 
one of the inputs, and we would have to pass it in NutchDocument (which 
currently handles only String values, while Content uses byte[] for 
payload). Then this raw content would have to be turned into a String, 
or passed as is assuming you have added a BinaryFieldType extension to 
your Solr ...

So, it's possible to do it but it's not a simple config switch.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message