nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Santiago Pérez <elara...@gmail.com>
Subject RE: Mutithreaded parsing
Date Mon, 28 Dec 2009 20:23:54 GMT

Because parsing the outlink of each fetched url with only one thread is too
slow when I am detecting the language of the content of those outlinks, so I
would like to share the load between multiple threads as Fetcher does...


Funtick wrote:
> 
> Not sure about multithreading:
> - Parsing is CPU-bound
> - In case of 4-core we need 3-4 threads at most
> - Map/Reduce can be configured with 3-4 Reducers and use 3-4 cores
> 
> Why multithreading?
> 
> (with Map/Reduce on Hadoop multithreading is necessity for fetching pages
> from Internet, Fetcher only...)
> 
> 
> Fuad Efendi
> +1 416-993-2060
> http://www.linkedin.com/in/liferay
> 
> Tokenizer Inc.
> http://www.tokenizer.ca/
> Data Mining, Vertical Search
> 
>> -----Original Message-----
>> From: Santiago Pérez [mailto:elaragon@gmail.com]
>> Sent: December-28-09 6:09 AM
>> To: nutch-dev@lucene.apache.org
>> Subject: Mutithreaded parsing
>> 
>> 
>> Hej,
>> 
>> I am developping a modification in Nutch for only accepting outlinks of
>> Spanish url. I have implemented downloading and parsing the content of
>> each
>> outlink (in ParseOutFormat) with Jericho and detecting the language with
>> Lingpipe.
>> 
>> This proccess seems too heavy, especially because it is done by only one
>> thread, so I would thank any idea for:
>> 
>> Any easier way for detecting the language of an outlink?
>> Any way for performing a multithreaded outlink extraction as fetcher
>> does?
>> 
>> Thanks in advance
>> --
>> View this message in context: http://old.nabble.com/Mutithreaded-parsing-
>> tp26941947p26941947.html
>> Sent from the Nutch - Dev mailing list archive at Nabble.com.
> 
> 
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Mutithreaded-parsing-tp26941947p26947396.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Mime
View raw message