hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xuri Nagarin <secs...@gmail.com>
Subject Re: Improving MR job disk IO
Date Thu, 10 Oct 2013 20:50:46 GMT
On Thu, Oct 10, 2013 at 1:27 PM, Pradeep Gollakota <pradeepg26@gmail.com>wrote:

> I don't think it necessarily means that the job is a bad candidate for MR.
> It's a different type of a workload. Hortonworks has a great article on the
> different types of workloads you might see and how that affects your
> provisioning choices at
> http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_cluster-planning-guide/content/ch_hardware-recommendations.html

One statement that stood out to me in the link above is "For these reasons,
Hortonworks recommends that you either use the Balanced workload
configuration or invest in a pilot Hadoop cluster and plan to evolve as you
analyze the workload patterns in your environment."

Now, this is not a critique/concern of HW but rather of hadoop. Well, what
if my workloads can be both CPU and IO intensive? Do I take the approach of

> I have not looked at the Grep code so I'm not sure why it's behaving the
> way it is. Still curious that streaming has a higher IO throughput and
> lower CPU usage. It may have to do with the fact that /bin/grep is a native
> implementation and Grep (Hadoop) is probably using Java Pattern/Matcher api.

The Grep code is from the bundled examples in CDH. I made one line
modification for it to read Sequence files. The streaming job probably does
not have lower CPU utilization but I see that it does even out the CPU
utilization among all the available processors. I guess the native grep
binary threads better than the java MR job?

Which brings me to ask - If you have the mapper/reducer functionality built
into a platform specific binary, then won't it always be more efficient
than a java MR job? And, in such cases, am I better off with streaming than
Java MR?

Thanks for your responses.

> On Thu, Oct 10, 2013 at 12:29 PM, Xuri Nagarin <secsubs@gmail.com> wrote:
>> Thanks Pradeep. Does it mean this job is a bad candidate for MR?
>> Interestingly, running the cmdline '/bin/grep' under a streaming job
>> provides (1) Much better disk throughput and, (2) CPU load is almost evenly
>> spread across all cores/threads (no CPU gets pegged to 100%).
>> On Thu, Oct 10, 2013 at 11:15 AM, Pradeep Gollakota <pradeepg26@gmail.com
>> > wrote:
>>> Actually... I believe that is expected behavior. Since your CPU is
>>> pegged at 100% you're not going to be IO bound. Typically jobs tend to be
>>> CPU bound or IO bound. If you're CPU bound you expect to see low IO
>>> throughput. If you're IO bound, you expect to see low CPU usage.
>>> On Thu, Oct 10, 2013 at 11:05 AM, Xuri Nagarin <secsubs@gmail.com>wrote:
>>>> Hi,
>>>> I have a simple Grep job (from bundled examples) that I am running on a
>>>> 11-node cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs with HT
>>>> on), 64GB RAM and 8 x 1TB disks. I have mappers set to 20 per node.
>>>> When I run the Grep job, I notice that CPU gets pegged to 100% on
>>>> multiple cores but disk throughput remains a dismal 1-2 Mbytes/sec on a
>>>> single disk on each node. So I guess, the cluster is poorly performing in
>>>> terms of disk IO. Running Terasort, I see each disk puts out 25-35
>>>> Mbytes/sec with a total cluster throughput of above 1.5 Gbytes/sec.
>>>> How do I go about re-configuring or re-writing the job to utilize
>>>> maximum disk IO?
>>>> TIA,
>>>> Xuri

View raw message