Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of stefan.will@gmx.net
 designates 213.165.64.20 as permitted sender)
User-Agent: Microsoft-Entourage/12.15.0.081119
Date: Mon, 11 May 2009 20:39:46 -0700
Subject: Re: How to do load control of MapReduce
From: Stefan Will <stefan.will@gmx.net>
To: <core-user@hadoop.apache.org>
Message-ID: <C62E3D92.1A579%stefan.will@gmx.net>
Thread-Topic: How to do load control of MapReduce
Thread-Index: AcnSs0s8nv/HCLCTHEexvQRRn7qyCg==
In-Reply-To: <fa03480d0905111958v14ef4366i91f2775d41f79d45@mail.gmail.com>
Mime-version: 1.0
Content-type: text/plain;
	charset="US-ASCII"
Content-transfer-encoding: 7bit

I'm having similar performance issues and have been running my Hadoop
processes using a nice level of 10 for a while, and haven't noticed any
improvement.

In my case, I believe what's happening is that the peak combined RAM usage
of all the Hadoop task processes and the service processes exceeeds the
ammount of RAM on my machines. This in turn causes part of the server
processes to get paged out to disk while the nightly Hadoop batch processes
are running. Since the swap space is typically on the same physical disks as
the DFS and MapReduce working directory, I'm heavily IO bound and real time
queries pretty much slow down to a crawl.

I think the key is to make absolutely sure that all of your processes fit in
your available RAM at all times. I'm actually having a hard time achieving
this since the virtual memory usage of the JVM is usually way higher than
the maximum heap size (see my other thread).

-- Stefan


> From: zsongbo <zsongbo@gmail.com>
> Reply-To: <core-user@hadoop.apache.org>
> Date: Tue, 12 May 2009 10:58:49 +0800
> To: <core-user@hadoop.apache.org>
> Subject: Re: How to do load control of MapReduce
> 
> Thanks Billy,I am trying 'nice', and will report the result later.
> 
> On Tue, May 12, 2009 at 3:42 AM, Billy Pearson
> <sales@pearsonwholesale.com>wrote:
> 
>> Might try setting the tasktrackers linux nice level to say 5 or 10
>> leavening dfs and hbase setting to 0
>> 
>> Billy
>> "zsongbo" <zsongbo@gmail.com> wrote in message
>> news:fa03480d0905110549j7f09be13qd434ca41c9f84d1d@mail.gmail.com...
>> 
>>  Hi all,
>>> Now, if we have a large dataset to process by MapReduce. The MapReduce
>>> will
>>> take machine resources as many as possible.
>>> 
>>> So when one such a big MapReduce job are running, the cluster would become
>>> very busy and almost cannot do anything else.
>>> 
>>> For example, we have a HDFS+MapReduc+HBase cluster.
>>> There are a large dataset in HDFS to be processed by MapReduce
>>> periodically,
>>> the workload is CPU and I/O heavy. And the cluster also provide other
>>> service for query (query HBase and read files in HDFS). So, when the job
>>> is
>>> running, the query latency will become very long.
>>> 
>>> Since the MapReduce job is not time sensitive, I want to control the load
>>> of
>>> MapReduce. Do you have some advices ?
>>> 
>>> Thanks in advance.
>>> Schubert
>>> 
>>> 
>> 
>>