nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pravin Karne <>
Subject Two Nutch parallel crawl with two conf folder.
Date Fri, 05 Mar 2010 07:26:58 GMT
I want to do two Nutch parallel crawl with two conf folder.

I am using crawl command to do this. I have two separate conf folders, all files from conf
are same except crawl-urlfilter.txt . In  this file we have different filters(domain filters).

 e.g . 1 st conf have -

       2nd conf have -

I am starting two crawl with above configuration and on separate console.(one followed by

I am using following crawl commands  -

      bin/nutch --nutch_conf_dir=/home/conf1 crawl urls -dir test1 -depth 1
      bin/nutch --nutch_conf_dir=/home/conf2 crawl urls -dir test2 -depth 1

[Note: We have modified for '--nutch_conf_dir']

urls file have following entries-

Expected Result:

     CrawlDB test1 should contains's  data and CrawlDB test2 should contains's

Actual Results:

  url filter of first run  is overridden by url filter of second run.

  So Both CrawlDB have's data.

Please provide pointer regarding this.

Thanks in advance.


This e-mail may contain privileged and confidential information which is the property of Persistent
Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed.
If you are not the intended recipient, you are not authorized to read, retain, copy, print,
distribute or use this message. If you have received this communication in error, please notify
the sender and delete all copies of this message. Persistent Systems Ltd. does not accept
any liability for virus infected mails.

View raw message