nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <eks...@yahoo.co.uk>
Subject Re: Two Nutch parallel crawl with two conf folder.
Date Tue, 09 Mar 2010 17:07:47 GMT
coool answer



----- Original Message ----
> From: MilleBii <millebii@gmail.com>
> To: nutch-user@lucene.apache.org
> Sent: Tue, 9 March, 2010 8:35:42
> Subject: Re: Two Nutch parallel crawl with two conf folder.
> 
> Yes it should work, I personnaly run some tests crawl on the same
> hardware, even on the same nutch directory thus I share the conf
> directory.
> But If you don't want that I would use two nutch directory and of
> course two different crawl directory because with hadoop they will
> end-up on the same hdfs: (assuming you run in distribued or pseudo)
> 
> 2010/3/9, Pravin Karne :
> >
> > Can we share Hadoop cluster between two nutch instance.
> > So there will be two nutch instance and they will point to same Hadoop
> > cluster.
> >
> > This way I am able to share my hardware bandwidth. I know that Hadoop in
> > distributed mode serializes jobs.
> > But I will not affect my flow. I just want to share my hardware resource.
> >
> > I tried with two nutch setup , but somehow second instance overriding the
> > first one's configuration.
> >
> >
> > Any pointers ?????
> >
> > Thanks
> > -Pravin
> >
> >
> > -----Original Message-----
> > From: MilleBii [mailto:millebii@gmail.com]
> > Sent: Monday, March 08, 2010 8:02 PM
> > To: nutch-user@lucene.apache.org
> > Subject: Re: Two Nutch parallel crawl with two conf folder.
> >
> > How parallel is parallel in your case ?
> > Don't forget Hadoop in distributed mode will serialize your jobs anyhow.
> >
> > For the rest why don't you create two Nutch directories and run things
> > totally independently
> >
> >
> > 2010/3/8, Pravin Karne :
> >> Hi guys any pointer on following.
> >> Your help will highly appreciated .
> >>
> >> Thanks
> >> -Pravin
> >>
> >> -----Original Message-----
> >> From: Pravin Karne
> >> Sent: Friday, March 05, 2010 12:57 PM
> >> To: nutch-user@lucene.apache.org
> >> Subject: Two Nutch parallel crawl with two conf folder.
> >>
> >> Hi,
> >>
> >> I want to do two Nutch parallel crawl with two conf folder.
> >>
> >> I am using crawl command to do this. I have two separate conf folders,
> >> all
> >> files from conf are same except crawl-urlfilter.txt . In  this file we
> >> have
> >> different filters(domain filters).
> >>
> >>  e.g . 1 st conf have -
> >>              +.^http://([a-z0-9]*\.)*abc.com/
> >>
> >>        2nd conf have -
> >>         +.^http://([a-z0-9]*\.)*xyz.com/
> >>
> >>
> >> I am starting two crawl with above configuration and on separate
> >> console.(one followed by other)
> >>
> >> I am using following crawl commands  -
> >>
> >>       bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1 -depth
> >> 1
> >>
> >>       bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2 -depth
> >> 1
> >>
> >> [Note: We have modified nutch.sh for '--nutch_conf_dir']
> >>
> >> urls file have following entries-
> >>
> >>    http://www.abc.com
> >>    http://www.xyz.com
> >>    http://www.pqr.com
> >>
> >>
> >> Expected Result:
> >>
> >>      CrawlDB test1 should contains abc.com's  data and CrawlDB test2
> >> should
> >> contains xyz.com's data.
> >>
> >> Actual Results:
> >>
> >>   url filter of first run  is overridden by url filter of second run.
> >>
> >>   So Both CrawlDB have xyz.com's data.
> >>
> >>
> >> Please provide pointer regarding this.
> >>
> >> Thanks in advance.
> >>
> >> -Pravin
> >>
> >>
> >> DISCLAIMER
> >> ==========
> >> This e-mail may contain privileged and confidential information which is
> >> the
> >> property of Persistent Systems Ltd. It is intended only for the use of
> >> the
> >> individual or entity to which it is addressed. If you are not the
> >> intended
> >> recipient, you are not authorized to read, retain, copy, print,
> >> distribute
> >> or use this message. If you have received this communication in error,
> >> please notify the sender and delete all copies of this message.
> >> Persistent
> >> Systems Ltd. does not accept any liability for virus infected mails.
> >>
> >
> >
> > --
> > -MilleBii-
> >
> > DISCLAIMER
> > ==========
> > This e-mail may contain privileged and confidential information which is the
> > property of Persistent Systems Ltd. It is intended only for the use of the
> > individual or entity to which it is addressed. If you are not the intended
> > recipient, you are not authorized to read, retain, copy, print, distribute
> > or use this message. If you have received this communication in error,
> > please notify the sender and delete all copies of this message. Persistent
> > Systems Ltd. does not accept any liability for virus infected mails.
> >
> 
> 
> -- 
> -MilleBii-



      

Mime
View raw message