nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cube Agen <agen....@gmail.com>
Subject Re: why nutch 1.4 don't set the origin html content field in solrindexer
Date Thu, 29 Dec 2011 00:21:58 GMT
Thanks Marek.

The snapshot function is not so important for me.

I read the code about solrindex in nutch, It does not add the segment job
handling the raw content.

I think there is two way:

1. Modify the IndexerMapReduce, add new path like "
FileInputFormat.addInputPath(job, new Path(segment, Content.DIR_NAME))" in
initMRJob mehtod, and modify reduce method to add
           if (value instanceof Content) {
                raw_content = ((Content) value).getContent();
            }

2. Just like Marek said, do someting in filter (Maybe you should read the
segment content of raw html by yourself)

This is my first using nutch and solr, I dont know which is better.

Thanks all.



2011/12/29 Marek Bachmann <m.bachmann@uni-kassel.de>

> Am 28.12.2011 15:56, schrieb Cube Agen:
> > Thanks, that is my question.
> >
> > If I want to make a html snapshot, how should I do? Modify the
> SolrIndexer
> > and IndexerMapReduce ?
> Hmm, I have never used a HTML Snapshot, just a textually one.
>
> Don't know what would be the best practice. If you store the raw fetched
> HTML in solr you have to select the right part of it somehow were the
> terms appear.
> I am not very familiar with solr yet so I can't tell you what would be
> the best way to do that.
>
> But in theory:
>
> I think you can use the fetched raw html content from the segments and
> add them to the solr index in a own field by extending the IndexFilter.
> I think you can get the fetched content through the CrawlDatum object
> (Please correct me if I am wrong ;-) )
>
> I recently used this tutorials:
>
>
> http://shuyo.wordpress.com/2011/01/04/how-to-develop-apache-nutch%E2%80%99s-plugin-4-indexingfilter-extension-point/
> and
> http://wiki.apache.org/nutch/WritingPluginExample
>
> for adding some own fields to the solr index. But in this case I used
> data from the parse object.
>
> Is the original HTML mark-up important for you?
>
> >
> >
> > 2011/12/28 Marek Bachmann <m.bachmann@uni-kassel.de>
> >
> >> Hey ho,
> >>
> >> I think the questions was why only the PARSED content is in the content
> >> field.
> >>
> >> As I have understood Cube wants to have the raw page content to be
> >> stored and / or indexed.
> >>
> >> Cube, for what will you need the raw content? It is possible to add it
> >> to solr, even to index it in the content field. But I am not sure if it
> >> makes sense because I don't know what you want to do. :)
> >>
> >> Am 28.12.2011 15:35, schrieb Markus Jelsma:
> >>> check your solr schema, its likely set not to store.
> >>>
> >>>> When I use sorlindex command post the crawled content. I can find the
> >>>> content field that is Parsed text. Why not have the raw content field?
> >>
> >>
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message