nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fuad Efendi" <f...@efendi.ca>
Subject RE: Mutithreaded parsing
Date Mon, 28 Dec 2009 15:16:50 GMT
Not sure about multithreading:
- Parsing is CPU-bound
- In case of 4-core we need 3-4 threads at most
- Map/Reduce can be configured with 3-4 Reducers and use 3-4 cores

Why multithreading?

(with Map/Reduce on Hadoop multithreading is necessity for fetching pages
from Internet, Fetcher only...)


Fuad Efendi
+1 416-993-2060
http://www.linkedin.com/in/liferay

Tokenizer Inc.
http://www.tokenizer.ca/
Data Mining, Vertical Search

> -----Original Message-----
> From: Santiago PĂ©rez [mailto:elaragon@gmail.com]
> Sent: December-28-09 6:09 AM
> To: nutch-dev@lucene.apache.org
> Subject: Mutithreaded parsing
> 
> 
> Hej,
> 
> I am developping a modification in Nutch for only accepting outlinks of
> Spanish url. I have implemented downloading and parsing the content of
> each
> outlink (in ParseOutFormat) with Jericho and detecting the language with
> Lingpipe.
> 
> This proccess seems too heavy, especially because it is done by only one
> thread, so I would thank any idea for:
> 
> Any easier way for detecting the language of an outlink?
> Any way for performing a multithreaded outlink extraction as fetcher does?
> 
> Thanks in advance
> --
> View this message in context: http://old.nabble.com/Mutithreaded-parsing-
> tp26941947p26941947.html
> Sent from the Nutch - Dev mailing list archive at Nabble.com.




Mime
View raw message