nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marek Bachmann <>
Subject Re: why nutch 1.4 don't set the origin html content field in solrindexer
Date Wed, 28 Dec 2011 14:44:40 GMT
Hey ho,

I think the questions was why only the PARSED content is in the content

As I have understood Cube wants to have the raw page content to be
stored and / or indexed.

Cube, for what will you need the raw content? It is possible to add it
to solr, even to index it in the content field. But I am not sure if it
makes sense because I don't know what you want to do. :)

Am 28.12.2011 15:35, schrieb Markus Jelsma:
> check your solr schema, its likely set not to store.
>> When I use sorlindex command post the crawled content. I can find the
>> content field that is Parsed text. Why not have the raw content field?

View raw message