nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lewis john mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: Nutch Crawl to Solr with separate cores for hosts.
Date Mon, 24 Oct 2011 06:15:00 GMT
Hi Sudip,

Can you elaborate on how you called {quote} the run method of Crawl,
specifying the
<urlDir> and <solrURL> for each host{quote}.

My initial feelings are that for every generated segment, you would want
only URLs from a 'single' particular host.

An additional thought, do you wish to only capture the entire graph for
every distinct host, or are you wanting to collect an accurate webgraph
outside of the host domain space e.g. outlinks?

Lewis

On Mon, Oct 24, 2011 at 7:58 AM, Sudip Datta <sudipdatta10@gmail.com> wrote:

> Hi,
>
> I have just started using Nutch (and Solr). I am trying to crawl pages from
> a few hosts and wish to create individual cores for each host name. I tried
> the first cut solution - calling the run method of Crawl, specifying the
> <urlDir> and <solrURL> for each host. This doesn't work, with each Solr
> core
> (corresponding to host name) containing pages from multiple hosts. I tried
> searching on the web for any literature on this but while there exist
> documents describing integration of Solr with Nutch and multicore Solr
> indexes, I couldn't find anything that speaks about creating multicore Solr
> indexes using Nutch to crawl.
>
> It'd be great if somebody could shed some light on this.
>
> Thanks,
>
> --Sudip.
>



-- 
*Lewis*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message