nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi" <f...@efendi.ca>
Subject RE: java.net.URL synchronization
Date Wed, 09 Dec 2009 22:55:48 GMT
Tomcat uses own slightly different version of URL class:

http://tomcat.apache.org/tomcat-5.5-doc/catalina/docs/api/index.html
URL is designed to provide public APIs for parsing and synthesizing Uniform
Resource Locators as similar as possible to the APIs of java.net.URL, but
without the ability to open a stream or connection. One of the consequences
of this is that you can construct URLs for protocols for which a
URLStreamHandler is not available (such as an "https" URL when JSSE is not
installed).



Synchonized staff in java.net.URL is URLStreamHandler -related.


> -----Original Message-----
> From: Fuad Efendi [mailto:fuad@efendi.ca]
> Sent: December-09-09 5:40 PM
> To: nutch-dev@lucene.apache.org
> Subject: RE: java.net.URL synchronization
> 
> I checked java.net.URL; yes, Nutch and BIXO implicitly use synchronized
> Hashtable:
> 
> 
>     public URL(String protocol, String host, int port, String file,
> 	       URLStreamHandler handler) throws MalformedURLException {
> 
> ...
> 	if (handler == null &&
>             (handler = getURLStreamHandler(protocol)) == null) {
>             throw new MalformedURLException("unknown protocol: " +
> protocol);
>         }
> 
> ...
> 
> 
> However, I don't think it hurts because both architecture (at least, BIXO)
> run single thread in a single JVM to process, for instance, Outlinks. Only
> "Fetch" part is multithreaded, but it doesn't use URL class.
> 
> 
> Not sure about Nutch, how the fetch list is generated... if multithreaded
> then "shared" between threads RegexUrlNormalizer is even bigger problem...
> 
> 
> Fuad Efendi
> +1 416-993-2060
> http://www.tokenizer.ca/
> Data Mining, Vertical Search
> 
> 
> > -----Original Message-----
> > From: Otis Gospodnetic [mailto:ogjunk-nutch@yahoo.com]
> > Sent: December-09-09 5:12 PM
> > To: nutch-dev@lucene.apache.org
> > Subject: java.net.URL synchronization
> >
> > Hello,
> >
> > Has anyone seen this:
> > http://www.supermind.org/blog/580/java-net-url-synchronization-
> bottleneck
> > ?
> >
> > Is this something that needs to be addressed in Nutch (and thus in Bixo,
> > and thus in the common crawler project)?
> >
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> 
> 


Fuad Efendi
+1 416-993-2060
http://www.linkedin.com/in/liferay




Mime
View raw message