hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wei-Chiu Chuang <weic...@apache.org>
Subject Re: [DISCUSS] HDFS roadmap/wish list
Date Thu, 13 Jun 2019 18:40:47 GMT
Thank you. I really appreciate your feedback as I don't always know the
detailed use case for a feature. (For me, it's mostly "hey, this thing is
broken, fix it")

What are the rest of the community thinks? This is a great opportunity to
share what you think.

My answers inline:

On Wed, Jun 12, 2019 at 1:12 AM Julien Laurenceau <
julien.laurenceau@pepitedata.com> wrote:

> Hi,
>
> I am not absolutely sure it is not already in a roadmap or supported, but
> I would appreciate those two features :
>
> - First feature : I would also like to be able to use a dedicated
> directory in HDFS as a /tmp directory leveraging RAMFS for high performing
> checkpoint of Spark Jobs without using Alluxio or Ignite.
>
My current issue is that the RAMFS is only useful with replication factor
> x1 (in order to avoid network).
> My default replication factor is x3, but I would need a way to set
> replication factor x1 on a specific directory (/tmp) for all new writes
> coming to this directory.
> Currently if I use "hdfs setrep 1 /tmp" it only works for blocks already
> written.
> For example, this could be done by specifying the replication factor at
> the storage policy level.
> In my view this would dramatically improve the interest of the
> Lazy-persist storage policy.
>

I am told LAZY_PERSIST is never considered a completed feature, and two
Hadoop distros, CDH and HDP don't support it.

But now that I understand the use case, it looks useful now.

> > From the Doc > Note 1: The Lazy_Persist policy is useful only for single
> replica blocks. For blocks with more than one replicas, all the replicas
> will be written to DISK since writing only one of the replicas to RAM_DISK
> does not improve the overall performance.
> In the current state of HDFS configuration, I only see the following hack
> (not tested) to implement such a solution : Configure HDFS replication x1
> as default configuration and use Erasure Coding RS(6,3) for the main
> storage by attaching an ec storage policy on all directories except /tmp.
>
> hdfs ec -setPolicy -path <directory> [-policy <policyName>]
>
>
>
> - Second feature: a bandwidth throttling dedicated to the re-replication
> in case of a failed datanode.
> Something similar to the option dedicated to the balancing algorithm
> dfs.datanode.balance.bandwidthPerSecbut only for re-replication.
>
I am pretty sure I've got people asking about this before a few times.

>
> Thanks and regards
> JL
>
> Le lun. 10 juin 2019 à 19:08, Wei-Chiu Chuang <weichiu@cloudera.com.invalid>
> a écrit :
>
>> Hi!
>>
>> I am soliciting feedbacks for HDFS roadmap items and wish list in the
>> future Hadoop releases. A community meetup
>> <https://www.meetup.com/Hadoop-Contributors/events/262055924/?rv=ea1_v2&_xtd=gatlbWFpbF9jbGlja9oAJGJiNTE1ODdkLTY0MDAtNDFiZS1iOTU5LTM5ZWYyMDU1N2Q4Nw>
>> is happening soon, and perhaps we can use this thread to converge on things
>> we should talk about there.
>>
>> I am aware of several major features that merged into trunk, such as RBF,
>> Consistent Standby Serving Reads, as well as some recent features that
>> merged into 3.2.0 release (storage policy satisfier).
>>
>> What else should we be doing? I have a laundry list of supportability
>> improvement projects, mostly about improving performance or making
>> performance diagnostics easier. I can share the list if folks are
>> interested.
>>
>> Are there things we should do to make developer's life easier or things
>> that would be nice to have for downstream applications? I know Sahil Takiar
>> made a series of improvements in HDFS for Impala recently, and those
>> improvements are applicable to other downstreamers such as HBase. Or would
>> it help if we provide more Hadoop API examples?
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message