Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: Scott Carey <scott@richrelevance.com>
To: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
Date: Wed, 10 Nov 2010 11:32:29 -0800
Subject: Re: Read() block mysteriously when using big BytesPerChecksum size
Thread-Topic: Read() block mysteriously when using big BytesPerChecksum size
Thread-Index: AcuBDdCPck3S9AYZSLuu7I8l53uwOg==
Message-ID: <AEEA35ED-F2F1-4EDE-8C6F-D72774D29CBF@richrelevance.com>
References: <AANLkTimz_uuXMzW7LnRiW8dAFQpFOT=gwbUqSLcAPnGB@mail.gmail.com>
In-Reply-To: <AANLkTimz_uuXMzW7LnRiW8dAFQpFOT=gwbUqSLcAPnGB@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0


On Oct 7, 2010, at 2:35 AM, elton sky wrote:

> Hello experts,
>=20
> I was benchmarking sequential write throughput of HDFS.
>=20
> For testing affect of bytesPerChecksum (bpc) size to write performance, I=
 am
> using different bpc size: 2M, 256K, 32K, 4K, 512B.
>=20
> My cluster has 1 name node and 5 data nodes. They are xen VMs and each of
> them configured with 56MB/s duplex ethernet connection. I
>=20
> I try to create a 10G file with different bpc. When bpc is 2M, the
> throughput drops dramatically compared with others:
>=20
> time(ms): 333008  bpc: 2M
>=20
> time(ms): 234180  bpc: 256K
>=20
> time(ms): 223737  bpc: 32K
>=20
> time(ms): 228842  bpc: 4K
>=20
> time(ms): 228238  bpc: 512
>=20
> After dig into the source, I found the problem happens on data nodes.
> In org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket():
>=20
> private int readNextPacket() throws IOException {
> ...
>=20
> while (buf.remaining() < SIZE_OF_INTEGER) {
>=20
>     if (buf.position() > 0) {
>        shiftBufData();
>      }
>=20
> *      readToBuf(-1); // this line takes 30ms or more for each packet bef=
ore
> returns*
>    }
> ...
>=20
> while (toRead > 0) { //this loop also takes around 30 ms
>        toRead -=3D readToBuf(toRead);
>      }
> ...
> }
>=20
> private long readToBufTime(int toRead) throws IOException {
> ...
>=20
> *int nRead =3D in.read(buf.array(), buf.limit(), toRead);**// this is the=
 line
> actually causes the delay*
> ...
>=20
> }
>=20
> The *in.read() *takes around 30ms to wait for data before it returns. And
> when it returns it reads a few KBs data.  The while loop comes later take=
s
> similar time to finish, which reads (2MB - a few KBs reads before).
>=20
> I couldn't understand the reason for the pause of *in.read()*. Why data n=
ode
> needs to wait?  why data is not available then?

It is probably waiting on disk or network.
>  Why this happens when using
> big bpc?
>=20

Linux tends to asynchronously 'read-ahead' from disks if sequential access =
is detected in a file.  The default is to read-ahead in chunks of up to 128=
K.  You can change this on a per device level with "blockdev --setra" (goog=
le it).
Since Hadoop fetches data in a synchronous loop, it loses the benefit of th=
e OS asynchronous read-ahead past 128K unless you change that setting.

I recommend a readahead value of ~2MB for today's SATA drives if you need t=
op sequential access performance from linux.  This would look something lik=
e this for 2MB:

# blockdev --setra 4096 /dev/sda


> any idea will be appreciated!