Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 42060E4CD for ; Fri, 4 Jan 2013 02:04:19 +0000 (UTC) Received: (qmail 18781 invoked by uid 500); 4 Jan 2013 02:04:14 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 18641 invoked by uid 500); 4 Jan 2013 02:04:14 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 18633 invoked by uid 99); 4 Jan 2013 02:04:14 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Jan 2013 02:04:14 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of john.lilley@redpoint.net designates 206.225.164.222 as permitted sender) Received: from [206.225.164.222] (HELO hub021-nj-6.exch021.serverdata.net) (206.225.164.222) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Jan 2013 02:04:06 +0000 Received: from MBX021-E3-NJ-2.exch021.domain.local ([10.240.4.78]) by HUB021-NJ-6.exch021.domain.local ([10.240.4.92]) with mapi id 14.02.0318.001; Thu, 3 Jan 2013 18:03:45 -0800 From: John Lilley To: "user@hadoop.apache.org" Subject: RE: Hadoop throughput question Thread-Topic: Hadoop throughput question Thread-Index: Ac3p/SHw1SvtjA3FRb6wvLoM0YfjTgAAXskgAAFxWuAAAJyzQAAB21IgABHIl4AAATvngP//issV Date: Fri, 4 Jan 2013 02:03:45 +0000 Message-ID: <1qrxtbd61ywm2wm772sfmioa.1357264838352@email.android.com> References: <22945_1357250425_0MG200FEIL4LY760_99DD75DC8938B743BBBC2CA54F7224A706D293F6@NYSGMBXB06.a.wcmc-ad.net> <869970D71E26D7498BDAC4E1CA92226B3FCD63BB@MBX021-E3-NJ-2.exch021.domain.local> <22945_1357254138_0MG200A81NZT5750_99DD75DC8938B743BBBC2CA54F7224A706D294B1@NYSGMBXB06.a.wcmc-ad.net> <869970D71E26D7498BDAC4E1CA92226B3FCD64B9@MBX021-E3-NJ-2.exch021.domain.local> <5555_1357257622_0MG200C3QQOL2GB0_99DD75DC8938B743BBBC2CA54F7224A706D29506@NYSGMBXB06.a.wcmc-ad.net> <-4387359799626623061@unknownmsgid>,<5555_1357261396_0MG200LAWTLFHM00_99DD75DC8938B743BBBC2CA54F7224A706D2958F@NYSGMBXB06.a.wcmc-ad.net> In-Reply-To: <5555_1357261396_0MG200LAWTLFHM00_99DD75DC8938B743BBBC2CA54F7224A706D2958F@NYSGMBXB06.a.wcmc-ad.net> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: Content-Type: multipart/alternative; boundary="_000_1qrxtbd61ywm2wm772sfmioa1357264838352emailandroidcom_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_1qrxtbd61ywm2wm772sfmioa1357264838352emailandroidcom_ Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable Perhaps if Artem posted the presumably-simple code we could get other users= to benchmark other 4-node systems and compare. --John Lilley Artem Ervits wrote: Setting the property to 64k made the throughput jump to 36mb/sec, 39mb for = 128k. Thank you for the tip. From: Michael Katzenellenbogen [mailto:michael@cloudera.com] Sent: Thursday, January 03, 2013 7:28 PM To: user@hadoop.apache.org Subject: Re: Hadoop throughput question What is the value of the io.file.buffer.size property? Try tuning it up to = 64k or 128k and see if this improves performance when reading SequenceFiles= . -Michael On Jan 3, 2013, at 7:00 PM, Artem Ervits > wrote: I will follow up on that certainly, thank you for the information. So further investigation showed that counting SequenceFile records takes ab= out 26mb/sec. If I simply read bytes on the same cluster and the same file,= the speed is 70mb/sec. Is there a configuration for optimizing SequenceFil= e processing? Thank you. From: John Lilley [mailto:john.lilley@redpoint.net] Sent: Thursday, January 03, 2013 6:09 PM To: user@hadoop.apache.org Subject: RE: Hadoop throughput question Unless the Hadoop processing and the OneFS storage are co-located, MapReduc= e can=92t schedule tasks so as to take advantage of data locality. You wou= ld basically be doing a distributed computation against a separate NAS, so = throughput would be limited by the performance properties of the Insilon NA= S and the network switch architecture. Still, 26MB/sec in aggregate is far= worse than what I=92d expect Insilon to deliver, even over a single 1GB co= nnection. john From: Artem Ervits [mailto:are9004@nyp.org] Sent: Thursday, January 03, 2013 4:02 PM To: user@hadoop.apache.org Subject: RE: Hadoop throughput question Hadoop is using OneFS, not HDFS in our configuration. Isilon NAS and the Ha= doop nodes are in the same datacenter but as far as rack locations, I canno= t tell. From: John Lilley [mailto:john.lilley@redpoint.net] Sent: Thursday, January 03, 2013 5:15 PM To: user@hadoop.apache.org Subject: RE: Hadoop throughput question Let=92s suppose you are doing a read-intensive job like, for example, count= ing records. This is will be disk bandwidth limited. On a 4-node cluster = with 2 local SATA on each node you should easily read 400MB/sec in aggregat= e. When you are running the Hadoop cluster, is the Hadoop processing co-lo= cated with the Ilsilon nodes? Is Hadoop configured to use OneFS or HDFS? John From: Artem Ervits [mailto:are9004@nyp.org] Sent: Thursday, January 03, 2013 3:00 PM To: user@hadoop.apache.org Subject: Hadoop throughput question Hello all, I=92d like to pick the community brain on average throughput speeds for a m= oderately specced 4-node Hadoop cluster with 1GigE networking. Is it reason= able to expect constant average speeds of 150-200mb/sec on such setup? Forg= ive me if the question is loaded but we=92re Hadoop cluster with HDFS serve= d via EMC Isilon storage. We=92re getting about 30mb/sec with our machines = and we do not see a difference in job speed between 2 node cluster and 4 no= de cluster. Thank you. -------------------- This electronic message is intended to be for the use only of the named rec= ipient, and may contain information that is confidential or privileged. If= you are not the intended recipient, you are hereby notified that any discl= osure, copying, distribution or use of the contents of this message is stri= ctly prohibited. If you have received this message in error or are not the= named recipient, please notify us immediately by contacting the sender at = the electronic mail address noted above, and delete and destroy all copies = of this message. Thank you. -------------------- This electronic message is intended to be for the use only of the named rec= ipient, and may contain information that is confidential or privileged. If= you are not the intended recipient, you are hereby notified that any discl= osure, copying, distribution or use of the contents of this message is stri= ctly prohibited. If you have received this message in error or are not the= named recipient, please notify us immediately by contacting the sender at = the electronic mail address noted above, and delete and destroy all copies = of this message. Thank you. -------------------- This electronic message is intended to be for the use only of the named rec= ipient, and may contain information that is confidential or privileged. If= you are not the intended recipient, you are hereby notified that any discl= osure, copying, distribution or use of the contents of this message is stri= ctly prohibited. If you have received this message in error or are not the= named recipient, please notify us immediately by contacting the sender at = the electronic mail address noted above, and delete and destroy all copies = of this message. Thank you. -------------------- This electronic message is intended to be for the use only of the named rec= ipient, and may contain information that is confidential or privileged. If= you are not the intended recipient, you are hereby notified that any discl= osure, copying, distribution or use of the contents of this message is stri= ctly prohibited. If you have received this message in error or are not the= named recipient, please notify us immediately by contacting the sender at = the electronic mail address noted above, and delete and destroy all copies = of this message. Thank you. -------------------- Confidential Information subject to NYP's (and its affiliates') information= management and security policies (http://infonet.nyp.org/QA/HospManual/). -------------------- This electronic message is intended to be for the use only of the named rec= ipient, and may contain information that is confidential or privileged. If= you are not the intended recipient, you are hereby notified that any discl= osure, copying, distribution or use of the contents of this message is stri= ctly prohibited. If you have received this message in error or are not the= named recipient, please notify us immediately by contacting the sender at = the electronic mail address noted above, and delete and destroy all copies = of this message. Thank you. --_000_1qrxtbd61ywm2wm772sfmioa1357264838352emailandroidcom_ Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable
Perhaps if Artem posted the presumably-simple code we could get=
 other users to benchmark other 4-node systems and compare.=0A=
--John Lilley=0A=
=0A=
Artem Ervits <are9004@nyp.org> wrote:=0A=
=0A=

Setting the property t= o 64k made the throughput jump to 36mb/sec, 39mb for 128k.

 

Thank you for the tip.=

 

From: Michae= l Katzenellenbogen [mailto:michael@cloudera.com]
Sent: Thursday, January 03, 2013 7:28 PM
To: user@hadoop.apache.org
Subject: Re: Hadoop throughput question

 

What is the value of the io.file.buffer.size propert= y? Try tuning it up to 64k or 128k and see if this improves performance whe= n reading SequenceFiles. 

-Michael


On Jan 3, 2013, at 7:00 PM, Artem Ervits <are9004@nyp.org> wrote:

I will follow up on th= at certainly, thank you for the information.

 

So further investigati= on showed that counting SequenceFile records takes about 26mb/sec. If I sim= ply read bytes on the same cluster and the same file, the speed is 70mb/sec= . Is there a configuration for optimizing SequenceFile processing?

 

Thank you.

 

From: John L= illey [mailto:john.lilley@redpo= int.net]
Sent: Thursday, January 03, 2013 6:09 PM
To: user@hadoop.apache.org=
Subject: RE: Hadoop throughput question

 

Unless the Hadoop proc= essing and the OneFS storage are co-located, MapReduce can=92t schedule tas= ks so as to take advantage of data locality.  You would basically be d= oing a distributed computation against a separate NAS, so throughput would be limited by the performance properties of the I= nsilon NAS and the network switch architecture.  Still, 26MB/sec in ag= gregate is far worse than what I=92d expect Insilon to deliver, even over a= single 1GB connection.

john

 

From: Artem = Ervits [mailto:are9004@nyp.org] Sent: Thursday, January 03, 2013 4:02 PM
To: user@hadoop.apache.org=
Subject: RE: Hadoop throughput question

 

Hadoop is using OneFS,= not HDFS in our configuration. Isilon NAS and the Hadoop nodes are in the = same datacenter but as far as rack locations, I cannot tell.

 

From: John L= illey [mailto:john.lilley@redpo= int.net]
Sent: Thursday, January 03, 2013 5:15 PM
To: user@hadoop.apache.org=
Subject: RE: Hadoop throughput question

 

Let=92s suppose you ar= e doing a read-intensive job like, for example, counting records.  Thi= s is will be disk bandwidth limited.  On a 4-node cluster with 2 local= SATA on each node you should easily read 400MB/sec in aggregate.  When you are running the Hadoop cluster, is the Hadoop= processing co-located with the Ilsilon nodes?  Is Hadoop configured t= o use OneFS or HDFS?

John

 

From: Artem = Ervits [mailto:are9004@nyp.org] Sent: Thursday, January 03, 2013 3:00 PM
To: user@hadoop.apache.org=
Subject: Hadoop throughput question

 

Hello all,

 

I=92d like to pick the community brain on average th= roughput speeds for a moderately specced 4-node Hadoop cluster with 1GigE n= etworking. Is it reasonable to expect constant average speeds of 150-200mb/= sec on such setup? Forgive me if the question is loaded but we=92re Hadoop cluster with HDFS served via EMC Isi= lon storage. We=92re getting about 30mb/sec with our machines and we do not= see a difference in job speed between 2 node cluster and 4 node cluster.

 

Thank you.

 
 
--------------------
 
This electronic message is intended to be for the use only of the name=
d recipient, and may contain information that is confidential or privileged=
.  If you are not the intended recipient, you are hereby notified that=
 any disclosure, copying, distribution or use of the contents of this messa=
ge is strictly prohibited.  If you have received this message in error=
 or are not the named recipient, please notify us immediately by contacting=
 the sender at the electronic mail address noted above, and delete and dest=
roy all copies of this message.  Thank you.
 
 
--------------------
 
This electronic message is intended to be for the use only of the name=
d recipient, and may contain information that is confidential or privileged=
.  If you are not the intended recipient, you are hereby notified that=
 any disclosure, copying, distribution or use of the contents of this messa=
ge is strictly prohibited.  If you have received this message in error=
 or are not the named recipient, please notify us immediately by contacting=
 the sender at the electronic mail address noted above, and delete and dest=
roy all copies of this message.  Thank you.
 
 
--------------------
 
This electronic message is intended to be for the use only of the name=
d recipient, and may contain information that is confidential or privileged=
.  If you are not the intended recipient, you are hereby notified that=
 any disclosure, copying, distribution or use of the contents of this messa=
ge is strictly prohibited.  If you have received this message in error=
 or are not the named recipient, please notify us immediately by contacting=
 the sender at the electronic mail address noted above, and delete and dest=
roy all copies of this message.  Thank you.
 
--------------------
 
This electronic message is intended to be for the use only of the name=
d recipient, and may contain information that is confidential or privileged=
.  If you are not the intended recipient, you are hereby notified that=
 any disclosure, copying, distribution or use of the contents of this messa=
ge is strictly prohibited.  If you have received this message in error=
 or are not the named recipient, please notify us immediately by contacting=
 the sender at the electronic mail address noted above, and delete and dest=
roy all copies of this message.  Thank you.
 
 
--------------------
Confidential Information subject to NYP's (and its affiliates') information=
 management and security policies (http://infonet.nyp.org/QA/HospManual/).
--------------------

This electronic message is intended to be for the use only of the named rec=
ipient, and may contain information that is confidential or privileged.  If=
 you are not the intended recipient, you are hereby notified that any discl=
osure, copying, distribution or use of the contents of this message is stri=
ctly prohibited.  If you have received this message in error or are not the=
 named recipient, please notify us immediately by contacting the sender at =
the electronic mail address noted above, and delete and destroy all copies =
of this message.  Thank you.


--_000_1qrxtbd61ywm2wm772sfmioa1357264838352emailandroidcom_--