nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: Merging content of multiple pages into one single Solr document
Date Thu, 24 Nov 2011 15:51:40 GMT
Hi,

If you need to merge different NutchDocument objects in a single SolrDocument 
you need to partially redesign IndexerMapReduce and create some composite key. 
The goal needs to be to get the same map/reduce key for both documents and 
process then accordingly in the reducer where you can merge them.

This is not going to be easy if you're not familiar with map/reduce 
programming. In the mapper you must check each input object for a marker that 
tells you it belongs to a group and emit a unique key for all objects for that 
group. In the reducer you can them process them together. The hard part is 
determining whether the various input objects are part of a group because you 
get ParseData, ParseText and CrawlDatum objects.

However, there may be another user with a better idea ;)

Cheers,


On Thursday 24 November 2011 15:37:04 Jose Gil wrote:
> Hi,
> 
> we are crawling a site which splits the content about a single item across
> a main page and several sub-pages.
> 
> We've written custom parsers to extract the necessary data from each of
> those pages, but we can't think of a clean way to merge all that into one
> single document for indexing in Solr -- specially taking into account that
> Nutch doesn't provide guarantees that the sub-pages will be parsed just
> after the main page, or even in the same run.
> 
> One solution would be to store each of the information fragments as a
> separate document in Solr and run a batch process to merge them together
> and store the "complete" document -- but we would really prefer Nutch to
> index the complete document at first and not having the noise of incomplete
> documents in Solr.
> 
> Ideas welcome.
> 
> Thanks,

-- 
Markus Jelsma - CTO - Openindex

Mime
View raw message