nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emmanuel <>
Subject Generate is very slow
Date Tue, 10 Jul 2007 08:45:06 GMT
I'm having some very bad performance when i try to generate a segment on a
Crawldb which contains 1M of urls.

I have a cluster of 2 machines, 200 maps, 5 reduce task.

I have setup 200 maps coz i faced different issue of OutOfMemory.

Correct me if i'm wrong but the process is in 2 step:
1- first job to extract all urls which could be crawled in the limit of my
TopN parameter
2- second job to partition by host and create 200 output (same nb as map nb)

Actually its in the second part where it take a long time. The process took
more than 5 hours. I think its huge.
What about you ? do you have similar performance ?

Actually there is one thing i found out, its that it will create 200 output
even if the output is empty.
For instance, my crawldb contains 1M of urls but only for 5 differents
hosts. It means that it the second job will partition the list to create 5
ouput files which contains the list of urls needed and 195 output files
empty. Hence it creates some bad performance because it waste some time to
copy the ouput from 1 server to the other.

Don't you think we can find a better way to partition the url ? either to
avoid creating empty files or to have a better partition over the whole list
of maps ?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message