nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi" <f...@efendi.ca>
Subject RE: Niocchi - java asynchronous crawl library released
Date Mon, 19 Oct 2009 13:09:33 GMT

Hi Andrzej,

Yes, I measured/compared (two years ago), I am actually using simplified rewritten code based
on Nutch, with non-synchronized instance per thread.

Imagine 1024 threads, each having 100 Outlinks and trying to call synchronized method... total
102,400 concurrent calls to synchronized method (during, in average (network delays), 3-seconds
frame)... I was even able to have 1024 concurrent threads without any performance impact!
Also, each synchronization requires additional CPU cycles (500-1000) even when concurrency
is small.

With non-synchronized, I can't have more than 128 threads - CPU overloaded. It run faster.
-Fuad


> -----Original Message-----
> From: Andrzej Bialecki [mailto:ab@getopt.org]
> Sent: October-19-09 5:47 AM
> To: nutch-dev@lucene.apache.org
> Subject: Re: Niocchi - java asynchronous crawl library released
> 
> Fuad Efendi wrote:
> > Hi Andrzej,
> >
> > Real bottleneck of Nutch is RegexURLNormalizer, it is still synchronized
> singleton (shared by multiple threads). And similar synchronized plugins which
> should be probably refactored to Nutch core...
> 
> It's not a singleton, but it's true that the normalize() method is
> synchronized. Did you actually measure the impact of this
> synchronization on the crawling speed? I very much doubt it outweighs
> the impact of politeness limits.
> 
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com




Mime
View raw message