Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of daemeonr@gmail.com designates
 209.85.213.177 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <54638C71.2050301@etinternational.com>
References: <54638C71.2050301@etinternational.com>
Date: Wed, 12 Nov 2014 08:55:39 -0800
Message-ID: 
 <CAOUOv0F492v-A6bTPeOGjXhg=62Q8vFqHk1g5KkTN9sLFEX-8g@mail.gmail.com>
Subject: Re: Datanode disk configuration
From: daemeon reiydelle <daemeonr@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001a113f94b2f874f00507ac437d

--001a113f94b2f874f00507ac437d
Content-Type: text/plain; charset=UTF-8

I would consider a jbod with 16-64mb stride. This would be a choice where
one or more (e.g. MR) steps will be io bound. Otherwise one or more tasks
will be hit with the low read/write times of having large amounts of data
behind a single spindle
On Nov 12, 2014 8:37 AM, "Brian C. Huffman" <bhuffman@etinternational.com>
wrote:

>  All,
>
> I'm setting up a 4-node Hadoop 2.5.1 cluster.  Each node has the following
> drives:
> 1 - 500GB drive (OS disk)
> 1 - 500GB drive
> 1 - 2 TB drive
> 1 - 3 TB drive.
>
> In past experience I've had lots of issues with non-uniform drive sizes
> for HDFS, but unfortunately it wasn't an option to get all 3TB or 2TB
> drives for this cluster.
>
> My thought is to set up the 2TB and 3TB drives as HDFS and the 500GB drive
> as intermediate data.  Most our of jobs don't make large use of
> intermediate data, but at least this way, I get a good amount of space
> (2TB) per node before I run into issues.  Then I may end up using the AvailableSpaceVolumeChoosingPolicy
> to help with balancing the blocks.
>
> If necessary I could put intermediate data on one of the OS partitions
> (/home).  But this doesn't seem ideal.
>
> Anybody have any recommendations regarding the optimal use of storage in
> this scenario?
>
> Thanks,
> Brian
>

--001a113f94b2f874f00507ac437d
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr">I would consider a jbod with 16-64mb stride. This would be a=
 choice where one or more (e.g. MR) steps will be io bound. Otherwise one o=
r more tasks will be hit with the low read/write times of having large amou=
nts of data behind a single spindle</p>
<div class=3D"gmail_quote">On Nov 12, 2014 8:37 AM, &quot;Brian C. Huffman&=
quot; &lt;<a href=3D"mailto:bhuffman@etinternational.com">bhuffman@etintern=
ational.com</a>&gt; wrote:<br type=3D"attribution"><blockquote class=3D"gma=
il_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-lef=
t:1ex">
 =20

   =20
 =20
  <div bgcolor=3D"#FFFFFF" text=3D"#000000">
    All,<br>
    <br>
    I&#39;m setting up a 4-node Hadoop 2.5.1 cluster.=C2=A0 Each node has t=
he
    following drives:<br>
    1 - 500GB drive (OS disk)<br>
    1 - 500GB drive<br>
    1 - 2 TB drive<br>
    1 - 3 TB drive.<br>
    <br>
    In past experience I&#39;ve had lots of issues with non-uniform drive
    sizes for HDFS, but unfortunately it wasn&#39;t an option to get all 3T=
B
    or 2TB drives for this cluster.=C2=A0 <br>
    <br>
    My thought is to set up the 2TB and 3TB drives as HDFS and the 500GB
    drive as intermediate data.=C2=A0 Most our of jobs don&#39;t make large=
 use
    of intermediate data, but at least this way, I get a good amount of
    space (2TB) per node before I run into issues.=C2=A0 Then I may end up
    using the <span style=3D"color:rgb(0,0,0);font-family:Simsun;font-size:=
medium">AvailableSpaceVolumeChoosingPolicy
      to help with balancing the blocks.<br>
      <br>
      If necessary I could put intermediate data on one of the OS
      partitions (/home).=C2=A0 But this doesn&#39;t seem ideal.<br>
      <br>
      Anybody have any recommendations regarding the optimal use of
      storage in this scenario?<br>
      <br>
      Thanks,<br>
      Brian<br>
    </span>
  </div>

</blockquote></div>

--001a113f94b2f874f00507ac437d--