Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of secsubs@gmail.com designates
 209.85.212.42 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CACG-F-vAqs18fHyk1TpNSd5dRBEcfK84oQYvtgohzF5DTOYgUA@mail.gmail.com>
References: 
 <CADPi3fhdLAkHmDk3nOQuNV17POyObWMsvmTo4BaOoRf0YKfkGg@mail.gmail.com>
	<CACG-F-vAqs18fHyk1TpNSd5dRBEcfK84oQYvtgohzF5DTOYgUA@mail.gmail.com>
Date: Thu, 10 Oct 2013 12:29:08 -0700
Message-ID: 
 <CADPi3fgOKbO26S7tHhp_oRxe0us6-HQMNJwjuJx2PpgBZbdqxA@mail.gmail.com>
Subject: Re: Improving MR job disk IO
From: Xuri Nagarin <secsubs@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=bcaec547c91703e13b04e8680495

--bcaec547c91703e13b04e8680495
Content-Type: text/plain; charset=ISO-8859-1

Thanks Pradeep. Does it mean this job is a bad candidate for MR?

Interestingly, running the cmdline '/bin/grep' under a streaming job
provides (1) Much better disk throughput and, (2) CPU load is almost evenly
spread across all cores/threads (no CPU gets pegged to 100%).


On Thu, Oct 10, 2013 at 11:15 AM, Pradeep Gollakota <pradeepg26@gmail.com>wrote:

> Actually... I believe that is expected behavior. Since your CPU is pegged
> at 100% you're not going to be IO bound. Typically jobs tend to be CPU
> bound or IO bound. If you're CPU bound you expect to see low IO throughput.
> If you're IO bound, you expect to see low CPU usage.
>
>
> On Thu, Oct 10, 2013 at 11:05 AM, Xuri Nagarin <secsubs@gmail.com> wrote:
>
>> Hi,
>>
>> I have a simple Grep job (from bundled examples) that I am running on a
>> 11-node cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs with HT
>> on), 64GB RAM and 8 x 1TB disks. I have mappers set to 20 per node.
>>
>> When I run the Grep job, I notice that CPU gets pegged to 100% on
>> multiple cores but disk throughput remains a dismal 1-2 Mbytes/sec on a
>> single disk on each node. So I guess, the cluster is poorly performing in
>> terms of disk IO. Running Terasort, I see each disk puts out 25-35
>> Mbytes/sec with a total cluster throughput of above 1.5 Gbytes/sec.
>>
>> How do I go about re-configuring or re-writing the job to utilize maximum
>> disk IO?
>>
>> TIA,
>>
>> Xuri
>>
>>
>>
>

--bcaec547c91703e13b04e8680495
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks Pradeep. Does it mean this job is a bad candidate f=
or MR?<div><br></div><div>Interestingly, running the cmdline &#39;/bin/grep=
&#39; under a streaming job provides (1) Much better disk throughput and, (=
2) CPU load is almost evenly spread across all cores/threads (no CPU gets p=
egged to 100%).</div>
<div><br></div><div><br></div></div><div class=3D"gmail_extra"><br><br><div=
 class=3D"gmail_quote">On Thu, Oct 10, 2013 at 11:15 AM, Pradeep Gollakota =
<span dir=3D"ltr">&lt;<a href=3D"mailto:pradeepg26@gmail.com" target=3D"_bl=
ank">pradeepg26@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Actually... I believe that =
is expected behavior. Since your CPU is pegged at 100% you&#39;re not going=
 to be IO bound. Typically jobs tend to be CPU bound or IO bound. If you=
9;re CPU bound you expect to see low IO throughput. If you&#39;re IO bound,=
 you expect to see low CPU usage.<br>

</div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_extra"><b=
r><br><div class=3D"gmail_quote">On Thu, Oct 10, 2013 at 11:05 AM, Xuri Nag=
arin <span dir=3D"ltr">&lt;<a href=3D"mailto:secsubs@gmail.com" target=3D"_=
blank">secsubs@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi,<div><br></div><div>I ha=
ve a simple Grep job (from bundled examples) that I am running on a 11-node=
 cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs with HT on), 64G=
B RAM and 8 x 1TB disks. I have mappers set to 20 per node.</div>


<div><br></div><div>When I run the Grep job, I notice that CPU gets pegged =
to 100% on multiple cores but disk throughput remains a dismal 1-2 Mbytes/s=
ec on a single disk on each node. So I guess, the cluster is poorly perform=
ing in terms of disk IO. Running Terasort, I see each disk puts out 25-35 M=
bytes/sec with a total cluster throughput of above 1.5 Gbytes/sec.=A0</div>


<div><br></div><div>How do I go about re-configuring or re-writing the job =
to utilize maximum disk IO?</div><div><br></div><div>TIA,</div><div><br></d=
iv><div>Xuri</div><div><br></div><div><br></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--bcaec547c91703e13b04e8680495--