nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Santiago PĂ©rez <elara...@gmail.com>
Subject Mutithreaded parsing
Date Mon, 28 Dec 2009 11:09:25 GMT

Hej,

I am developping a modification in Nutch for only accepting outlinks of
Spanish url. I have implemented downloading and parsing the content of each
outlink (in ParseOutFormat) with Jericho and detecting the language with
Lingpipe.

This proccess seems too heavy, especially because it is done by only one
thread, so I would thank any idea for:

Any easier way for detecting the language of an outlink?
Any way for performing a multithreaded outlink extraction as fetcher does?

Thanks in advance
-- 
View this message in context: http://old.nabble.com/Mutithreaded-parsing-tp26941947p26941947.html
Sent from the Nutch - Dev mailing list archive at Nabble.com.


Mime
View raw message