nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi" <f...@efendi.ca>
Subject RE: Nutch Performance Improvements
Date Tue, 25 Aug 2009 16:50:34 GMT
I forgot to add for "Allow Redirects" to work properly we need also Cookie
handling in HttpClient... Most "stateful" websites generate links inside
HTML with Session tokens if they find that Client does not support cookies;
but if HttpClient supports - we are forced to allow redirects (although new
version of HttpClient supports per-host cookies cache?!); to be verified...

 

From: Fuad Efendi [mailto:fuad@efendi.ca] 
Sent: August-25-09 12:42 PM
To: nutch-dev@lucene.apache.org
Subject: Nutch Performance Improvements

 

Hello,

 

 

Few years ago I noticed some performance bottlenecks of Nutch; checking
source code now... the same...

 

 

1.       RegexURLNormalizer and similar plugins

It's singleton, and main method is synchronized. Would be better to have
per-thread instance, non-synchronized; but how to make it plugin then?

 

 

2.       "Allow Redirects" for HttpClient

By allowing redirects we can avoid HttpSession related tokens in final URLs

(may be it's not acceptable for general crawl, but would be nice to have
such configuration option)

 

 

 

Fuad Efendi

==================================

http://www.linkedin.com/in/liferay

http://www.tokenizer.org

http://www.casaGURU.com

==================================

 

 


Mime
View raw message