hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xuri Nagarin <secs...@gmail.com>
Subject Re: Improving MR job disk IO
Date Thu, 10 Oct 2013 19:29:08 GMT
Thanks Pradeep. Does it mean this job is a bad candidate for MR?

Interestingly, running the cmdline '/bin/grep' under a streaming job
provides (1) Much better disk throughput and, (2) CPU load is almost evenly
spread across all cores/threads (no CPU gets pegged to 100%).




On Thu, Oct 10, 2013 at 11:15 AM, Pradeep Gollakota <pradeepg26@gmail.com>wrote:

> Actually... I believe that is expected behavior. Since your CPU is pegged
> at 100% you're not going to be IO bound. Typically jobs tend to be CPU
> bound or IO bound. If you're CPU bound you expect to see low IO throughput.
> If you're IO bound, you expect to see low CPU usage.
>
>
> On Thu, Oct 10, 2013 at 11:05 AM, Xuri Nagarin <secsubs@gmail.com> wrote:
>
>> Hi,
>>
>> I have a simple Grep job (from bundled examples) that I am running on a
>> 11-node cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs with HT
>> on), 64GB RAM and 8 x 1TB disks. I have mappers set to 20 per node.
>>
>> When I run the Grep job, I notice that CPU gets pegged to 100% on
>> multiple cores but disk throughput remains a dismal 1-2 Mbytes/sec on a
>> single disk on each node. So I guess, the cluster is poorly performing in
>> terms of disk IO. Running Terasort, I see each disk puts out 25-35
>> Mbytes/sec with a total cluster throughput of above 1.5 Gbytes/sec.
>>
>> How do I go about re-configuring or re-writing the job to utilize maximum
>> disk IO?
>>
>> TIA,
>>
>> Xuri
>>
>>
>>
>

Mime
View raw message