Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-dev@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of agundabattula@gmail.com
 designates 209.85.223.173 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CANZa=Gs1+hUh_bJ7BGFdB90h=tmFVW8HGKuHWcHc7wGD66UDow@mail.gmail.com>
References: 
 <CAJBMQQipwWp0sxt==Ce1pN5WTAA6dmhvR2sKhn9Uu6xp=_KnrQ@mail.gmail.com>
	<CAMXuLLoNV_WJFUyr4TLy+_ZiCuMEggw+B36APfH9EgNTv0cW1w@mail.gmail.com>
	<CANZa=Gs1+hUh_bJ7BGFdB90h=tmFVW8HGKuHWcHc7wGD66UDow@mail.gmail.com>
Date: Sat, 20 Dec 2014 08:52:59 +1100
Message-ID: 
 <CAJBMQQhsKwLpoPPaqA6Q_8mB7q8QofFPbU1iaxEv2OyjitdADQ@mail.gmail.com>
Subject: Re: Controlling the block placement and the file placement in HDFS
 writes
From: Ananth Gundabattula <agundabattula@gmail.com>
To: hdfs-dev@hadoop.apache.org
Content-Type: multipart/alternative; boundary=001a11427204751732050a98bba9

--001a11427204751732050a98bba9
Content-Type: text/plain; charset=UTF-8

Hello Zhe,

Thanks a lot for the inputs. Storage policies is really what I was looking
for one of the problems.

@Nick: I agree that it would be a nice feature to have. Thanks for the
info.

Regards,
Ananth

On Fri, Dec 19, 2014 at 10:49 AM, Nick Dimiduk <ndimiduk@gmail.com> wrote:

> HBase would enjoy a similar functionality. In our case, we'd like all
> replicas for all files in a given HDFS path to land on the same set of
> machines. That way, in the event of a failover, regions can be assigned to
> one of these other machines that has local access to all blocks for all
> region files.
>
> On Thu, Dec 18, 2014 at 3:36 PM, Zhe Zhang <zhe.zhang.research@gmail.com>
> wrote:
> >
> > > The second aspect is that our queries are time based and this time
> window
> > > follows a familiar pattern of old data not being queried much. Hence we
> > > would like to preserve the most recent data in the HDFS cache ( impala
> is
> > > helping us manage this aspect via their command set ) but we would like
> > the
> > > next recent amount of data chunks to land on an SSD that is present on
> > > every datanode. The remaining set of blocks which are "very old but in
> > > large quantities" would land on spinning disks. The decision to choose
> a
> > > given volume is based on the file name as we can control the filename
> > that
> > > is being used to generate the file.
> > >
> >
> > Have you tried the 'setStoragePolicy' command? It's part of the HDFS
> > "Heterogeneous Storage Tiers" work and seems to address your scenario.
> >
> > > 1. Is there a way to control that all file blocks belonging to a
> > particular
> > > hdfs directory & file go to the same physical datanode ( and their
> > > corresponding replicas as well ? )
> >
> > This seems inherently hard: the file/dir could have more data than a
> > single DataNode can host. Implementation wise, if requires some sort
> > of a map in BlockPlacementPolicy from inode or file path to DataNode
> > address.
> >
> > My 2 cents..
> >
> > --
> > Zhe Zhang
> > Software Engineer, Cloudera
> > https://sites.google.com/site/zhezhangresearch/
> >
>

--001a11427204751732050a98bba9--