Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 41D5ED907 for ; Fri, 4 Jan 2013 00:10:45 +0000 (UTC) Received: (qmail 27504 invoked by uid 500); 4 Jan 2013 00:10:40 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 27339 invoked by uid 500); 4 Jan 2013 00:10:40 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 27332 invoked by uid 99); 4 Jan 2013 00:10:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Jan 2013 00:10:40 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.216.175] (HELO mail-qc0-f175.google.com) (209.85.216.175) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Jan 2013 00:10:33 +0000 Received: by mail-qc0-f175.google.com with SMTP id j3so8283882qcs.20 for ; Thu, 03 Jan 2013 16:10:12 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=kXcBWd2vh18iq5ilHdxFG2VYn/+M3WXwwVjxkUlKmzo=; b=c24O1AC8VG37UWa3Fu2AXg/0Rtcy7j27//GvVkQ4IszuMDWch0c69QyUj0A4wcMpbx 7D/Fw1txro1s3wJcCwrYzpjxwCD5/qnGtXuziAwoXcjWWFeDQECgnC0wIz2sxIK4r4Tu ambZnLZ7/BAgQwnxvMqIf8JomB9Lrp5ecJW8tchQZXYm77BYuHvFta1NvrQmGOtrP3Lx tmJZVkBvGxT6kxTBdXsZ6ovOK+R0H0kcqI7zzh7AImtNpPl4u4MVM4cTGn5lZQWfHaoz CHRuV//TlmKqJyPITIJJG0lgFUTqq7qcpTjtlTXoo9zwxNqdNLRszLaCDN2n0dYetUAQ ESXg== MIME-Version: 1.0 Received: by 10.224.186.82 with SMTP id cr18mr30494180qab.64.1357258212742; Thu, 03 Jan 2013 16:10:12 -0800 (PST) Received: by 10.49.107.198 with HTTP; Thu, 3 Jan 2013 16:10:12 -0800 (PST) In-Reply-To: <5555_1357257622_0MG200C3QQOL2GB0_99DD75DC8938B743BBBC2CA54F7224A706D29506@NYSGMBXB06.a.wcmc-ad.net> References: <22945_1357250425_0MG200FEIL4LY760_99DD75DC8938B743BBBC2CA54F7224A706D293F6@NYSGMBXB06.a.wcmc-ad.net> <869970D71E26D7498BDAC4E1CA92226B3FCD63BB@MBX021-E3-NJ-2.exch021.domain.local> <22945_1357254138_0MG200A81NZT5750_99DD75DC8938B743BBBC2CA54F7224A706D294B1@NYSGMBXB06.a.wcmc-ad.net> <869970D71E26D7498BDAC4E1CA92226B3FCD64B9@MBX021-E3-NJ-2.exch021.domain.local> <5555_1357257622_0MG200C3QQOL2GB0_99DD75DC8938B743BBBC2CA54F7224A706D29506@NYSGMBXB06.a.wcmc-ad.net> Date: Thu, 3 Jan 2013 16:10:12 -0800 Message-ID: Subject: Re: Hadoop throughput question From: Aaron Eng To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=20cf303b3e83a3980c04d26b4d84 X-Gm-Message-State: ALoCoQm9JRQNs2EuGsSxfq8nQyUUCtbC42SW+K4IvbWIdmvLFCq95aIyLkwNVNr45ebeXAst1Tax X-Virus-Checked: Checked by ClamAV on apache.org --20cf303b3e83a3980c04d26b4d84 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable If from the same machine, you can read the raw data of the file at 70MB/s and when reading it using SequenceFile you get 26MB/sec, I would presume that the speed difference comes down to the read pattern as well as the Isilon file system implementation. For the 70MB/s, if you are doing something like "hadoop fs -cat > /dev/null" then its probably doing individual read operations of 64KB or 128KB or whatever Isilon supports. Then when you use the sequence file format to read record by record, instead of reading 64KB, maybe its reading a 16KB record at a time, and each record requires an operation to be sent to Isilon to retrieve the data. Hence, I would presume the difference comes down to your file system implementation. Of course, if your record reader is poorly written or doing a lot of processing for each record, you might bottleneck on CPU. Presuming you aren't bottlenecked on CPU it would see to be the IO pattern and the file system implementation. If its the IO pattern and file system implementation, you can try to see if Isilon supports read-ahead at all. As a contrived example, with MapRFS, your user level process may issue a 16KB read to the MapRFS library, and in turn the MapRFS library can read ahead 128KB so that the next series of 16KB reads in your program are served out of the local cache on your client, reducing the effects of network latency, etc. On Thu, Jan 3, 2013 at 4:00 PM, Artem Ervits wrote: > I will follow up on that certainly, thank you for the information.**** > > ** ** > > So further investigation showed that counting SequenceFile records takes > about 26mb/sec. If I simply read bytes on the same cluster and the same > file, the speed is 70mb/sec. Is there a configuration for optimizing > SequenceFile processing?**** > > ** ** > > Thank you.**** > > ** ** > > *From:* John Lilley [mailto:john.lilley@redpoint.net] > *Sent:* Thursday, January 03, 2013 6:09 PM > > *To:* user@hadoop.apache.org > *Subject:* RE: Hadoop throughput question**** > > ** ** > > Unless the Hadoop processing and the OneFS storage are co-located, > MapReduce can=92t schedule tasks so as to take advantage of data locality= . > You would basically be doing a distributed computation against a separate > NAS, so throughput would be limited by the performance properties of the > Insilon NAS and the network switch architecture. Still, 26MB/sec in > aggregate is far worse than what I=92d expect Insilon to deliver, even ov= er a > single 1GB connection.**** > > john**** > > ** ** > > *From:* Artem Ervits [mailto:are9004@nyp.org] > *Sent:* Thursday, January 03, 2013 4:02 PM > *To:* user@hadoop.apache.org > *Subject:* RE: Hadoop throughput question**** > > ** ** > > Hadoop is using OneFS, not HDFS in our configuration. Isilon NAS and the > Hadoop nodes are in the same datacenter but as far as rack locations, I > cannot tell. **** > > ** ** > > *From:* John Lilley [mailto:john.lilley@redpoint.net] > > *Sent:* Thursday, January 03, 2013 5:15 PM > *To:* user@hadoop.apache.org > *Subject:* RE: Hadoop throughput question**** > > ** ** > > Let=92s suppose you are doing a read-intensive job like, for example, > counting records. This is will be disk bandwidth limited. On a 4-node > cluster with 2 local SATA on each node you should easily read 400MB/sec i= n > aggregate. When you are running the Hadoop cluster, is the Hadoop > processing co-located with the Ilsilon nodes? Is Hadoop configured to us= e > OneFS or HDFS?**** > > John**** > > ** ** > > *From:* Artem Ervits [mailto:are9004@nyp.org] > *Sent:* Thursday, January 03, 2013 3:00 PM > *To:* user@hadoop.apache.org > *Subject:* Hadoop throughput question**** > > ** ** > > Hello all,**** > > ** ** > > I=92d like to pick the community brain on average throughput speeds for a > moderately specced 4-node Hadoop cluster with 1GigE networking. Is it > reasonable to expect constant average speeds of 150-200mb/sec on such > setup? Forgive me if the question is loaded but we=92re Hadoop cluster wi= th > HDFS served via EMC Isilon storage. We=92re getting about 30mb/sec with o= ur > machines and we do not see a difference in job speed between 2 node clust= er > and 4 node cluster. **** > > ** ** > > Thank you.**** > > ** ** > > ** ** > > --------------------**** > > ** ** > > This electronic message is intended to be for the use only of the named r= ecipient, and may contain information that is confidential or privileged. = If you are not the intended recipient, you are hereby notified that any dis= closure, copying, distribution or use of the contents of this message is st= rictly prohibited. If you have received this message in error or are not t= he named recipient, please notify us immediately by contacting the sender a= t the electronic mail address noted above, and delete and destroy all copie= s of this message. Thank you.**** > > ** ** > > ** ** > > --------------------**** > > ** ** > > This electronic message is intended to be for the use only of the named r= ecipient, and may contain information that is confidential or privileged. = If you are not the intended recipient, you are hereby notified that any dis= closure, copying, distribution or use of the contents of this message is st= rictly prohibited. If you have received this message in error or are not t= he named recipient, please notify us immediately by contacting the sender a= t the electronic mail address noted above, and delete and destroy all copie= s of this message. Thank you.**** > > ** ** > > ** ** > > -------------------- > > This electronic message is intended to be for the use only of the named r= ecipient, and may contain information that is confidential or privileged. = If you are not the intended recipient, you are hereby notified that any dis= closure, copying, distribution or use of the contents of this message is st= rictly prohibited. If you have received this message in error or are not t= he named recipient, please notify us immediately by contacting the sender a= t the electronic mail address noted above, and delete and destroy all copie= s of this message. Thank you. > > > -------------------- > > This electronic message is intended to be for the use only of the named r= ecipient, and may contain information that is confidential or privileged. = If you are not the intended recipient, you are hereby notified that any dis= closure, copying, distribution or use of the contents of this message is st= rictly prohibited. If you have received this message in error or are not t= he named recipient, please notify us immediately by contacting the sender a= t the electronic mail address noted above, and delete and destroy all copie= s of this message. Thank you. > > > > --20cf303b3e83a3980c04d26b4d84 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable If from the same machine, you can read the raw data of the file at 70MB/s a= nd when reading it using SequenceFile you get 26MB/sec, I would presume tha= t the speed difference comes down to the read pattern as well as the Isilon= file system implementation. =A0

For the 70MB/s, if you are doing something like "hadoop= fs -cat <file> > /dev/null" then its probably doing individu= al read operations of 64KB or 128KB or whatever Isilon supports. =A0Then wh= en you use the sequence file format to read record by record, instead of re= ading 64KB, maybe its reading a 16KB record at a time, and each record requ= ires an operation to be sent to Isilon to retrieve the data. =A0Hence, I wo= uld presume the difference comes down to your file system implementation. = =A0Of course, if your record reader is poorly written or doing a lot of pro= cessing for each record, you might bottleneck on CPU. =A0Presuming you aren= 't bottlenecked on CPU it would see to be the IO pattern and the file s= ystem implementation.

If its the IO pattern and file system implementation, you can try = to see if Isilon supports read-ahead at all. =A0As a contrived example, wit= h MapRFS, your user level process may issue a 16KB read to the MapRFS libra= ry, and in turn the MapRFS library can read ahead 128KB so that the next se= ries of 16KB reads in your program are served out of the local cache on you= r client, reducing the effects of network latency, etc.


On Thu, Jan 3, 2013 at 4= :00 PM, Artem Ervits <are9004@nyp.org> wrote:

I will follow up on th= at certainly, thank you for the information.

=A0

So further investigati= on showed that counting SequenceFile records takes about 26mb/sec. If I sim= ply read bytes on the same cluster and the same file, the speed is 70mb/sec= . Is there a configuration for optimizing SequenceFile processing?

=A0

Thank you.

=A0

From: John Lil= ley [mailto:j= ohn.lilley@redpoint.net]
Sent: Thursday, January 03, 2013 6:09 PM


To: user= @hadoop.apache.org
Subject: RE: Hadoop throughput question

=

=A0

Unless the Hadoop proc= essing and the OneFS storage are co-located, MapReduce can=92t schedule tas= ks so as to take advantage of data locality.=A0 You would basically be doin= g a distributed computation against a separate NAS, so throughput would be limited by the performance properties of the I= nsilon NAS and the network switch architecture.=A0 Still, 26MB/sec in aggre= gate is far worse than what I=92d expect Insilon to deliver, even over a si= ngle 1GB connection.

john

=A0

From: Artem Er= vits [mailto:are90= 04@nyp.org]
Sent: Thursday, January 03, 2013 4:02 PM
To: user= @hadoop.apache.org
Subject: RE: Hadoop throughput question

=A0

Hadoop is using OneFS,= not HDFS in our configuration. Isilon NAS and the Hadoop nodes are in the = same datacenter but as far as rack locations, I cannot tell.

=A0

From: John Lil= ley [mailto:j= ohn.lilley@redpoint.net]
Sent: Thursday, January 03, 2013 5:15 PM
To: user= @hadoop.apache.org
Subject: RE: Hadoop throughput question

=A0

Let=92s suppose you ar= e doing a read-intensive job like, for example, counting records.=A0 This i= s will be disk bandwidth limited.=A0 On a 4-node cluster with 2 local SATA = on each node you should easily read 400MB/sec in aggregate.=A0 When you are running the Hadoop cluster, is the Hadoop pr= ocessing co-located with the Ilsilon nodes?=A0 Is Hadoop configured to use = OneFS or HDFS?

John

=A0

From: Artem Er= vits [mailto:are90= 04@nyp.org]
Sent: Thursday, January 03, 2013 3:00 PM
To: user= @hadoop.apache.org
Subject: Hadoop throughput question

=A0

Hello all,

=A0

I=92d like to pick the community brain on average th= roughput speeds for a moderately specced 4-node Hadoop cluster with 1GigE n= etworking. Is it reasonable to expect constant average speeds of 150-200mb/= sec on such setup? Forgive me if the question is loaded but we=92re Hadoop cluster with HDFS served via EMC Isi= lon storage. We=92re getting about 30mb/sec with our machines and we do not= see a difference in job speed between 2 node cluster and 4 node cluster.

=A0

Thank you.

=A0
=A0
--------------------
=A0
This electronic message is intended to be for the use only of the name=
d recipient, and may contain information that is confidential or privileged=
.=A0 If you are not the intended recipient, you are hereby notified that an=
y disclosure, copying, distribution or use of the contents of this message =
is strictly prohibited.=A0 If you have received this message in error or ar=
e not the named recipient, please notify us immediately by contacting the s=
ender at the electronic mail address noted above, and delete and destroy al=
l copies of this message.=A0 Thank you.
=A0
=A0
--------------------
=A0
This electronic message is intended to be for the use only of the name=
d recipient, and may contain information that is confidential or privileged=
.=A0 If you are not the intended recipient, you are hereby notified that an=
y disclosure, copying, distribution or use of the contents of this message =
is strictly prohibited.=A0 If you have received this message in error or ar=
e not the named recipient, please notify us immediately by contacting the s=
ender at the electronic mail address noted above, and delete and destroy al=
l copies of this message.=A0 Thank you.
=A0
=A0
--------------------

This electronic message is intended to be for the use only of the named rec=
ipient, and may contain information that is confidential or privileged.  If=
 you are not the intended recipient, you are hereby notified that any discl=
osure, copying, distribution or use of the contents of this message is stri=
ctly prohibited.  If you have received this message in error or are not the=
 named recipient, please notify us immediately by contacting the sender at =
the electronic mail address noted above, and delete and destroy all copies =
of this message.  Thank you.

--------------------

This electronic message is intended to be for the use only of the named rec=
ipient, and may contain information that is confidential or privileged.  If=
 you are not the intended recipient, you are hereby notified that any discl=
osure, copying, distribution or use of the contents of this message is stri=
ctly prohibited.  If you have received this message in error or are not the=
 named recipient, please notify us immediately by contacting the sender at =
the electronic mail address noted above, and delete and destroy all copies =
of this message.  Thank you.



--20cf303b3e83a3980c04d26b4d84--