nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] Commented: (NUTCH-721) Fetcher2 Slow
Date Sun, 09 Aug 2009 13:52:15 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741082#action_12741082
] 

Julien Nioche commented on NUTCH-721:
-------------------------------------

I had another look at this issue after applying the patch from Nutch-719. I can easily reproduce
the situation from the original post by setting fetcher.threads.per.host.by.ip to true. The
nutch-site file sent by Rodger does not specify it so it would rely on this value by default.
Once setting it to false all threads are active and the fetching is much faster. 

I have used the first 5K URLs from the fetchlist sent by Rodger and compared the perfs with
by.ip set to false :  

OldFetcher :  
real	32m26.003s
user	1m11.768s
sys	0m10.337s

OldFetcher :  
real	30m52.965s
user	1m10.696s
sys	0m10.425s

Fetcher :  
real	31m21.924s
user	1m12.725s
sys	0m10.797s

Fetcher :
real	30m3.017s
user	1m15.509s
sys	0m10.909s

I ran each step twice and as we can see the results are comparable.

This explanation is also compliant with Steven's observation that we get 5-7 times the rate
as we would hit the DNS cache for subsequent calls for URLs from non unique sites. The IP
resolution is done by the QueueFeeder which explains why it is slowing down the number of
URLs being available for fetching.

I don't think that the oldFetcher allows to group URLs by IP for politeness in which case
why not making fetcher.threads.per.host.by.ip default to false in the new fetcher?


> Fetcher2 Slow
> -------------
>
>                 Key: NUTCH-721
>                 URL: https://issues.apache.org/jira/browse/NUTCH-721
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.0.0
>         Environment: Fedora Core r6, Kernel 2.6.22-14,  jdk1.6.0_12
>            Reporter: Roger Dunk
>         Attachments: crawl_generate.tar.gz, nutch-site.xml
>
>
> Fetcher2 fetches far more slowly than Fetcher1.
> Config options:
> fetcher.threads.fetch = 80
> fetcher.threads.per.host = 80
> fetcher.server.delay = 0
> generate.max.per.host = 1
> With a queue size of ~40,000, the result is:
> activeThreads=80, spinWaiting=79, fetchQueues.totalSize=0
> with maybe a download of 1 page per second.
> Runing with -noParse makes little difference.
> CPU load average is around 0.2. With Fetcher1 CPU load is around 2.0 - 3.0
> Hosts already cached by local caching NS appear to download quickly upon a re-fetch,
so possible issue relating to NS lookups, however all things being equal Fetcher1 runs fast
without pre-caching hosts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message