nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marek Bachmann <m.bachm...@uni-kassel.de>
Subject Re: why nutch 1.4 don't set the origin html content field in solrindexer
Date Wed, 28 Dec 2011 14:44:40 GMT
Hey ho,

I think the questions was why only the PARSED content is in the content
field.

As I have understood Cube wants to have the raw page content to be
stored and / or indexed.

Cube, for what will you need the raw content? It is possible to add it
to solr, even to index it in the content field. But I am not sure if it
makes sense because I don't know what you want to do. :)

Am 28.12.2011 15:35, schrieb Markus Jelsma:
> check your solr schema, its likely set not to store.
> 
>> When I use sorlindex command post the crawled content. I can find the
>> content field that is Parsed text. Why not have the raw content field?


Mime
View raw message