Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of stas.oskin@gmail.com
 designates 209.85.218.176 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=CIjxvMZyK/GJjr+N1SyIy8kvuBLArg5EFGJ914kBnZUPOAmu0EsZsU1tgM901u7eQB
         LFsFfBNrOGqm0fNV5Z2oHZXy2UU4VpZM59V6Le8tjlLQRD2c7yem4//EItmaZCRwEpmi
         KNGz18LTLqe0FD1cpn5lpRIx3nnJOeZviPtC0=
MIME-Version: 1.0
In-Reply-To: <1038C178-357B-4A3F-90F6-9D0F509733E3@cse.unl.edu>
References: <77938bc20904091545x623893f6jef73eaa4cac429f0@mail.gmail.com>
	 <B8A87480-E36F-4E87-8CAB-32C65D821D5B@cse.unl.edu>
	 <77938bc20904100740r37c25a0dwa7f473ac90f62593@mail.gmail.com>
	 <1038C178-357B-4A3F-90F6-9D0F509733E3@cse.unl.edu>
Date: Fri, 10 Apr 2009 21:51:48 +0300
Message-ID: <77938bc20904101151r23f10826ie1559e6fe9192d7@mail.gmail.com>
Subject: Re: HDFS read/write speeds, and read optimization
From: Stas Oskin <stas.oskin@gmail.com>
To: core-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001636c5afb6688d6c046737db4a

--001636c5afb6688d6c046737db4a
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Hi.

>
> Depends on what kind of I/O you do - are you going to be using MapReduce
> and co-locating jobs and data?  If so, it's possible to get close to those
> speeds if you are I/O bound in your job and read right through each chunk.
>  If you have multiple disks mounted individually, you'll need the number of
> streams equal to the number of disks.  If you're going to do I/O that's not
> through MapReduce, you'll probably be bound by the network interface.
>

Btw, this what I wanted to ask as well:

Is it more efficient to unify the disks into one volume (RAID or LVM), and
then present them as a single space? Or it's better to specify each disk
separately?

Reliability-wise, the latter sounds more correct, as a single/several (up to
3) disks going down won't take the whole node with them. But perhaps there
is a performance penalty?

--001636c5afb6688d6c046737db4a--