hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Controlling the block placement and the file placement in HDFS writes
Date Fri, 19 Dec 2014 22:24:40 GMT
Interesting - HDFS-6133 would directly help HBase data locality use case.

On Fri, Dec 19, 2014 at 2:20 PM, Yongjun Zhang <yzhang@cloudera.com> wrote:

> Hi,
>
> FYI,
>
> A relevant jira HDFS-6133 tries to tell Balancer not to move around the
> blocks stored at the favored nodes that application selected. I reviewed
> the patch, and the latest on looks good to me. Hope some committers can
> pick it up and push it forward.
>
> Thanks.
>
> --Yongjun
>
>
> On Fri, Dec 19, 2014 at 1:52 PM, Ananth Gundabattula <
> agundabattula@gmail.com> wrote:
> >
> > Hello Zhe,
> >
> > Thanks a lot for the inputs. Storage policies is really what I was
> looking
> > for one of the problems.
> >
> > @Nick: I agree that it would be a nice feature to have. Thanks for the
> > info.
> >
> > Regards,
> > Ananth
> >
> > On Fri, Dec 19, 2014 at 10:49 AM, Nick Dimiduk <ndimiduk@gmail.com>
> wrote:
> >
> > > HBase would enjoy a similar functionality. In our case, we'd like all
> > > replicas for all files in a given HDFS path to land on the same set of
> > > machines. That way, in the event of a failover, regions can be assigned
> > to
> > > one of these other machines that has local access to all blocks for all
> > > region files.
> > >
> > > On Thu, Dec 18, 2014 at 3:36 PM, Zhe Zhang <
> zhe.zhang.research@gmail.com
> > >
> > > wrote:
> > > >
> > > > > The second aspect is that our queries are time based and this time
> > > window
> > > > > follows a familiar pattern of old data not being queried much.
> Hence
> > we
> > > > > would like to preserve the most recent data in the HDFS cache (
> > impala
> > > is
> > > > > helping us manage this aspect via their command set ) but we would
> > like
> > > > the
> > > > > next recent amount of data chunks to land on an SSD that is present
> > on
> > > > > every datanode. The remaining set of blocks which are "very old but
> > in
> > > > > large quantities" would land on spinning disks. The decision to
> > choose
> > > a
> > > > > given volume is based on the file name as we can control the
> filename
> > > > that
> > > > > is being used to generate the file.
> > > > >
> > > >
> > > > Have you tried the 'setStoragePolicy' command? It's part of the HDFS
> > > > "Heterogeneous Storage Tiers" work and seems to address your
> scenario.
> > > >
> > > > > 1. Is there a way to control that all file blocks belonging to a
> > > > particular
> > > > > hdfs directory & file go to the same physical datanode ( and
their
> > > > > corresponding replicas as well ? )
> > > >
> > > > This seems inherently hard: the file/dir could have more data than a
> > > > single DataNode can host. Implementation wise, if requires some sort
> > > > of a map in BlockPlacementPolicy from inode or file path to DataNode
> > > > address.
> > > >
> > > > My 2 cents..
> > > >
> > > > --
> > > > Zhe Zhang
> > > > Software Engineer, Cloudera
> > > > https://sites.google.com/site/zhezhangresearch/
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message