nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marek Bachmann <m.bachm...@uni-kassel.de>
Subject Re: why nutch 1.4 don't set the origin html content field in solrindexer
Date Wed, 28 Dec 2011 18:47:42 GMT
Am 28.12.2011 15:56, schrieb Cube Agen:
> Thanks, that is my question.
> 
> If I want to make a html snapshot, how should I do? Modify the SolrIndexer
> and IndexerMapReduce ?
Hmm, I have never used a HTML Snapshot, just a textually one.

Don't know what would be the best practice. If you store the raw fetched
HTML in solr you have to select the right part of it somehow were the
terms appear.
I am not very familiar with solr yet so I can't tell you what would be
the best way to do that.

But in theory:

I think you can use the fetched raw html content from the segments and
add them to the solr index in a own field by extending the IndexFilter.
I think you can get the fetched content through the CrawlDatum object
(Please correct me if I am wrong ;-) )

I recently used this tutorials:

http://shuyo.wordpress.com/2011/01/04/how-to-develop-apache-nutch%E2%80%99s-plugin-4-indexingfilter-extension-point/
and
http://wiki.apache.org/nutch/WritingPluginExample

for adding some own fields to the solr index. But in this case I used
data from the parse object.

Is the original HTML mark-up important for you?

> 
> 
> 2011/12/28 Marek Bachmann <m.bachmann@uni-kassel.de>
> 
>> Hey ho,
>>
>> I think the questions was why only the PARSED content is in the content
>> field.
>>
>> As I have understood Cube wants to have the raw page content to be
>> stored and / or indexed.
>>
>> Cube, for what will you need the raw content? It is possible to add it
>> to solr, even to index it in the content field. But I am not sure if it
>> makes sense because I don't know what you want to do. :)
>>
>> Am 28.12.2011 15:35, schrieb Markus Jelsma:
>>> check your solr schema, its likely set not to store.
>>>
>>>> When I use sorlindex command post the crawled content. I can find the
>>>> content field that is Parsed text. Why not have the raw content field?
>>
>>
> 


Mime
View raw message