nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <kkrugler_li...@transpac.com>
Subject Re: Nutch Performance Improvements
Date Tue, 25 Aug 2009 17:12:12 GMT

On Aug 25, 2009, at 9:50am, Fuad Efendi wrote:

> I forgot to add for “Allow Redirects” to work properly we need also  
> Cookie handling in HttpClient... Most “stateful” websites generate  
> links inside HTML with Session tokens if they find that Client does  
> not support cookies; but if HttpClient supports – we are forced to  
> allow redirects (although new version of HttpClient supports per- 
> host cookies cache?!); to be verified...

HttpClient 4.0 provides per-user/thread context, which includes  
cookies. I don't know of any per-host cookie support, just per-host  
routing.

-- Ken

>
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: August-25-09 12:42 PM
> To: nutch-dev@lucene.apache.org
> Subject: Nutch Performance Improvements
>
> Hello,
>
>
> Few years ago I noticed some performance bottlenecks of Nutch;  
> checking source code now... the same...
>
>
> 1.       RegexURLNormalizer and similar plugins
> It’s singleton, and main method is synchronized. Would be better to  
> have per-thread instance, non-synchronized; but how to make it  
> plugin then?
>
>
> 2.       “Allow Redirects” for HttpClient
> By allowing redirects we can avoid HttpSession related tokens in  
> final URLs
> (may be it’s not acceptable for general crawl, but would be nice to  
> have such configuration option)
>
>
>
> Fuad Efendi
> ==================================
> http://www.linkedin.com/in/liferay
> http://www.tokenizer.org
> http://www.casaGURU.com
> ==================================
>
>

--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378


Mime
View raw message