nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Nioche <lists.digitalpeb...@gmail.com>
Subject Re: Nutch distributed on IBM BladeCenter
Date Thu, 06 Dec 2012 11:16:31 GMT
Hi Sourajit

On 6 December 2012 07:11, Sourajit Basak <sourajit.basac@gmail.com> wrote:

> We are running Nutch distributed on a IBM blade center setup. Each blade is
> 2P8C with 4G RAM per core.
>
> The Nutch hadoop jobs will do an OCR (a plugged-in custom parser), hence,
> will be memory intensive. The jobs have a high initialization time. I am
> wondering if anyone can suggest which hadoop parameters do we tune to
> utilize the blades to their fullest.
>
> I understand that arriving at an optimized solution is subject to trials.
> To start off, we have zeroed on this params.
>

See http://svn.apache.org/viewvc/nutch/trunk/src/bin/crawl?view=markup for
a starting point on Hadoop params


>
> 1. mapred.tasktracker.map|reduce.tasks.maximum<
> http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.tasktracker.map.tasks.maximum
> >equal
> or multiple of total cores per blade/node = 16n (do we leave aside
> more room so as not throttle system procs?)
>

yes - you'll have at least the tasktracker and datanode running on each
machine as well as the system so you should leave them a bit of CPU


> 2. mapred.map|reduce.child.java.opts (Per my understanding, nutch jobs do
> not spawn any child jvm?) How do we say that use this amount of memory for
> any new job created ?
>

well, any Hadoop-based application will have the mappers and reducers
running as separate JVMs. The options above will determine how much RAM the
mappers and reducers are allowed. See also mapred.child.java.opts

The main memory hog is typically the parsing step,even more so if you do
OCR I expect



> 3. mapred.job.reuse.jvm.num.tasks ? Does this mean that our custom parser
> will be initialized only once. If so, we need to take care of parser
> failures and take appropriate precaution.
>

IIRC correctly the mapper or reducer instances will be reused but
reinitialised. A bit of experimentation will tell you but you got the idea
here and it is the right approach if the initialisation is slow.

HTH

Julien



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message