hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jakub Stransky <stransky...@gmail.com>
Subject Re: CPU utilization
Date Fri, 12 Sep 2014 19:31:08 GMT
Adam, how did you come to the conclusion that it is memory bounded? I
haven't found any such sign, even though the map phase were assigned 768MB,
job counters reported that just something around 600MB were use and no
significant GC time imposed.

To be more specific about the job what in essence do is loading data out of
kafka messaging in protocol buffers format deserialize those and remap to
avro data format. And that is performed on per record bases except the
kafka reader which performs bulk read via buffer. Increasing a buffer size
and fetch size didn't have any significant impact.

May be completely silly question: how do I recognize that I have a memory
bound job? As having a ~600MB heap and GC time somewhere around 30sec out
of 60 min long job doesn't seem to me as a sign of insufficient memory.
I don't see any apparent bound except that I mentioned on CPU per task
process via top command.

On 12 September 2014 20:57, Adam Kawa <kawa.adam@gmail.com> wrote:

> Your NodeManager can use 2048 MB (yarn.nodemanager.resource.memory-mb) for
> allocating containers.
> If you run map task, you need 768 MB (mapreduce.map.memory.mb).
> If you run reduce task, you need 1024 MB (mapreduce.reduce.memory.mb).
> If you run the MapReduce app master, you need 1024 MB (
> yarn.app.mapreduce.am.resource.mb).
> Therefore, you run MapReduce job, you can run only 2 containers per
> NodeManager (3 x 768 = 2304 < 2048) on your setup.
> 2014-09-12 20:37 GMT+02:00 Jakub Stransky <stransky.ja@gmail.com>:
>>  I thought that memory assigned has to be muliply of
>> yarn.scheduler.minimum-allocation-mb and is rounded according that.
> That's right. It also specifies the minimum size of a container to prevent
> from requesting unreasonable small containers (that are likely to cause
> tasks failures).
>> any other I am not aware of. Are there any additional parameters like
>> that you mentioned which should be set?
> There are also settings related to vcores in mapred-site.xml and
> yarn-site.xml. But they don't change anything in your case (as you are
> limited by the memory, not vcores).
>> The job wasn't the smallest but wasn't PB of data. Was run on 1.5GB of
>> data and run for 60min. I wasn't able to make any significant improvment.
>> It is map only job. And wasn't able to achive more that 30% of total
>> machine cpu utilization. Howewer top command were displaying 100 %cpu for
>> process running on data node, that's why I was thinking that way about
>> limit on container process limit. I didn't find any other boundary like io
>> or network or memory.
> CPU utilization depends on type of your jobs (e.g. doing complex math
> operations or just counting words) and the number of containers you run. If
> you want to play with this, you can run more CPU-bound jobs or increase the
> number of containers running on a node.

Jakub Stransky

View raw message