Return-Path: Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: (qmail 26939 invoked from network); 10 Nov 2010 19:31:06 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 10 Nov 2010 19:31:06 -0000 Received: (qmail 36776 invoked by uid 500); 10 Nov 2010 19:31:35 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 36734 invoked by uid 500); 10 Nov 2010 19:31:35 -0000 Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-user@hadoop.apache.org Delivered-To: mailing list common-user@hadoop.apache.org Received: (qmail 36726 invoked by uid 99); 10 Nov 2010 19:31:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Nov 2010 19:31:35 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [64.78.17.19] (HELO EXHUB018-4.exch018.msoutlookonline.net) (64.78.17.19) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Nov 2010 19:31:26 +0000 Received: from EXVMBX018-1.exch018.msoutlookonline.net ([64.78.17.47]) by EXHUB018-4.exch018.msoutlookonline.net ([64.78.17.19]) with mapi; Wed, 10 Nov 2010 11:31:05 -0800 From: Scott Carey To: "common-user@hadoop.apache.org" Date: Wed, 10 Nov 2010 11:32:29 -0800 Subject: Re: Read() block mysteriously when using big BytesPerChecksum size Thread-Topic: Read() block mysteriously when using big BytesPerChecksum size Thread-Index: AcuBDdCPck3S9AYZSLuu7I8l53uwOg== Message-ID: References: In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org On Oct 7, 2010, at 2:35 AM, elton sky wrote: > Hello experts, >=20 > I was benchmarking sequential write throughput of HDFS. >=20 > For testing affect of bytesPerChecksum (bpc) size to write performance, I= am > using different bpc size: 2M, 256K, 32K, 4K, 512B. >=20 > My cluster has 1 name node and 5 data nodes. They are xen VMs and each of > them configured with 56MB/s duplex ethernet connection. I >=20 > I try to create a 10G file with different bpc. When bpc is 2M, the > throughput drops dramatically compared with others: >=20 > time(ms): 333008 bpc: 2M >=20 > time(ms): 234180 bpc: 256K >=20 > time(ms): 223737 bpc: 32K >=20 > time(ms): 228842 bpc: 4K >=20 > time(ms): 228238 bpc: 512 >=20 > After dig into the source, I found the problem happens on data nodes. > In org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(): >=20 > private int readNextPacket() throws IOException { > ... >=20 > while (buf.remaining() < SIZE_OF_INTEGER) { >=20 > if (buf.position() > 0) { > shiftBufData(); > } >=20 > * readToBuf(-1); // this line takes 30ms or more for each packet bef= ore > returns* > } > ... >=20 > while (toRead > 0) { //this loop also takes around 30 ms > toRead -=3D readToBuf(toRead); > } > ... > } >=20 > private long readToBufTime(int toRead) throws IOException { > ... >=20 > *int nRead =3D in.read(buf.array(), buf.limit(), toRead);**// this is the= line > actually causes the delay* > ... >=20 > } >=20 > The *in.read() *takes around 30ms to wait for data before it returns. And > when it returns it reads a few KBs data. The while loop comes later take= s > similar time to finish, which reads (2MB - a few KBs reads before). >=20 > I couldn't understand the reason for the pause of *in.read()*. Why data n= ode > needs to wait? why data is not available then? It is probably waiting on disk or network. > Why this happens when using > big bpc? >=20 Linux tends to asynchronously 'read-ahead' from disks if sequential access = is detected in a file. The default is to read-ahead in chunks of up to 128= K. You can change this on a per device level with "blockdev --setra" (goog= le it). Since Hadoop fetches data in a synchronous loop, it loses the benefit of th= e OS asynchronous read-ahead past 128K unless you change that setting. I recommend a readahead value of ~2MB for today's SATA drives if you need t= op sequential access performance from linux. This would look something lik= e this for 2MB: # blockdev --setra 4096 /dev/sda > any idea will be appreciated!