hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arpit Agarwal <aagar...@hortonworks.com>
Subject Re: [VOTE] Merge HDFS-6581 to trunk - Writing to replicas in memory.
Date Wed, 24 Sep 2014 23:03:11 GMT
> I would appreciate in the future if benchmarks were posted
> a day or two before a merge vote of a performance improvement was
> suggested.
> I feel that it would have been better to float the idea of a
> merge on the JIRA before actually calling it,

Colin, I noted my intention to start the merge vote on the Jira last week.
In response you asked for verification that read performance will not
regressed. Read results were posted prior to starting the vote. I agree
that the write numbers were not posted until one day after the merge vote
started. If we delay the vote expiration by a day will it help address any
remaining timing concerns?


> There are two areas where I think we need more clarification.  The
> first is whether other eviction strategies can be implemented besides
> LRU.

If 2Q or another scheme requires more hooks we can certainly add to the
pluggable interface. It is not a public interface that is set in stone. It
is similar to BlockPlacementPolicy and VolumeChoosingPolicy. Both are
HDFS-internal interfaces with alternative implementations that we rev
frequently.


> The other area is how the memory management for HDFS-6581 fits in with
> the memory management for HDFS-4949.  I am very concerned that this
> has been handwaved away as future work. If we create a configuration
> mess for system administrators as a result, I will be sad.

You asked about memory management on 8/21. I responded the same day stating
that we are not introducing any configuration settings since mounting the
disk is done by an administrator. Your response did not indicate any
concern with this approach. You can see the comment history on HDFS-6581. So
I am surprised you raise it as a blocker one month later. We have not
introduced new limits as part of this change so there is no concern of
future config incompatibilities. The Cache Executor and Lazy Writer can be
updated to be aware of the memory usage of each other in a compatible way.

What we are voting on here is merging to trunk. If you have additional and
reasonable concerns that you would like to see addressed prior to 2.6 we
can have a separate discussion about 2.6 blockers.

Regards,
Arpit


On Wed, Sep 24, 2014 at 2:19 PM, Colin McCabe <cmccabe@alumni.cmu.edu>
wrote:

> On Wed, Sep 24, 2014 at 11:12 AM, Suresh Srinivas
> <suresh@hortonworks.com> wrote:
> > Features are done in a separate branch for two reasons: 1) During a
> feature
> > development the branch may be not functional 2) The high level approach
> and
> > design is not very clear and development can continue while that is being
> > sorted out.
>
> Up until the last two days, we had no benchmarks at all, so we didn't
> have any way to evaluate whether this performance improvement was
> functional.  Andrew commented on this as well in this thread, and we
> also raised the issue on the JIRA.  I am glad that a few benchmarks
> have finally been posted.  I would appreciate in the future if
> benchmarks were posted a day or two before a merge vote of a
> performance improvement was suggested.  As it is, it feels like we are
> racing the clock right now to figure out how well this works, and it
> puts the reviewers in an unpleasant position.
>
> >In case of this feature, clearly (2) is not an issue. We have
> > had enough discussion about the approach. I also think this branch is
> ready
> > for merge without rendering trunk not functional.
>
> I agree that this can be merged without rendering trunk
> non-functional.  I don't agree that we have achieved consensus on all
> the high-level approach and design.
>
> There are two areas where I think we need more clarification.  The
> first is whether other eviction strategies can be implemented besides
> LRU.  Yes, there is a pluggable interface, but I am concerned that it
> may not be flexible enough to implement anything besides LRU.  I am
> working on a patch now that implements 2Q.  I will find out for
> myself, I guess.
>
> The other area is how the memory management for HDFS-6581 fits in with
> the memory management for HDFS-4949.  I am very concerned that this
> has been handwaved away as future work.  If we create a configuration
> mess for system administrators as a result, I will be sad.  This is
> the kind of design that I would have liked to see done a while ago.  I
> would certainly veto any merge to 2.6 until we have figured this out.
>
> I commented about this a month ago:
>
> https://issues.apache.org/jira/browse/HDFS-6581?focusedCommentId=14106025&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14106025
> But my concerns were not addressed in the branch.
>
> If these two design clarifications can be made in time for the merge,
> I will be -0 on this.  I have opened HDFS-7141 and HDFS-7142 to
> continue the discussion on these points.
>
> > That leaves us with one
> > objection that is being raised; the content is not complete. This did not
> > prevent recently merged crypto file system work from being merged to
> trunk.
> > We discussed and decided that it can be merged to trunk and we will
> finish
> > the remaining work that is a must have in trunk before merging to
> branch-2.
> > That is a reasonable approach for this feature as well. In fact there
> are a
> > lot of bug fixes and improvements still happening on crypto work even
> after
> > branch-2 merge (which is a good thing in my opinion, it shows that the
> > testing is continuing and feature is still being improved).
>
> I am OK with merging this before it is complete.  I was just pointing
> out that this could have benefited from more time in development.  It
> would have been a much smoother merge and less stressful for everyone.
>
> >
> > As regards to the eviction policy we have had discussion in the jira. I
> > disagree that there exists a "usable" or "better" strategy that works
> well
> > for all the work loads. I also disagree that additional policy to the
> > existing LRU is needed, leave alone it being needed for trunk merge. New
> > eviction policies need to be developed as people experiment with it.
> Hence
> > making the eviction policy pluggable is great way to approach it.
>
> LRU is a poor policy for scan workloads.  For example, see here:
> http://www.cse.iitd.ernet.in/~sbansal/os/lec/l30.html
> "In general, LRU works well, but there are certain workloads where LRU
> performs very poorly. One well-known workload is a "scan" wherein a
> large memory region (larger than physical memory) is accessed in quick
> succession. e.g... Such scans are often quite common, as some thread
> may want to go through all its data. And so, plain LRU (or CLOCK) does
> not work as well in practice. Instead a variant of LRU that considers
> both recency and frequency of access to a page, is typically used in
> an OS (e.g., 2Q, CLOCK with Adaptive Replacement, etc.)."
>
> So is your argument that we are not using this for scanning workloads?
>  That seems tough to believe, and as I said earlier, you have posted
> no system-level benchmarks, so how can I evaluate that claim?
>
> > There are some testing information posted on the jira, such as
> performance.
> > But I think we need more details on how this feature is being tested.
> > Arpit, can you please post test details. As for me, given how much work
> > (both discussion and coding) has gone into it, and given it is a very
> > important feature that allows experimentation to use memory tier, I would
> > like to see this available in release 2.6.
>
> Just to reiterate, I am -0 provided we can address HDFS-7141 and
> HDFS-7142 before this gets set in stone.  I would hate to -1 this,
> because it would mean that you could not call another vote for a week.
> But I feel that it would have been better to float the idea of a merge
> on the JIRA before actually calling it, to avoid having discussions
> like this where we are racing the clock.
>
> thanks,
> Colin
>
> >
> > On Tue, Sep 23, 2014 at 6:09 PM, Colin McCabe <cmccabe@alumni.cmu.edu>
> > wrote:
> >
> >> This seems like a really aggressive timeframe for a merge.  We still
> >> haven't implemented:
> >>
> >> * Checksum skipping on read and write from lazy persisted replicas.
> >> * Allowing mmaped reads from the lazy persisted data.
> >> * Any eviction strategy other than LRU.
> >> * Integration with cache pool limits (how do HDFS-4949 and lazy
> >> persist replicas share memory)?
> >> * Eviction from RAM disk via truncation (HDFS-6918)
> >> * Metrics
> >> * System testing to find out how useful this is, and what the best
> >> eviction strategy is.
> >>
> >> I see why we might want to defer checksum skipping, metrics, allowing
> >> mmap, eviction via truncation, and so forth until later.  But I feel
> >> like we need to figure out how this will integrate with the memory
> >> used by HDFS-4949 before we merge.  I also would like to see another
> >> eviction strategy other than LRU, which is a very poor eviction
> >> strategy for scanning workloads.  I mentioned this a few times on the
> >> JIRA.
> >>
> >> I'd also like to get some idea of how much testing this has received
> >> in a multi-node cluster.  What makes us confident that this is the
> >> right time to merge, rather than in a week or two?
> >>
> >> best,
> >> Colin
> >>
> >>
> >> On Tue, Sep 23, 2014 at 4:55 PM, Arpit Agarwal <
> aagarwal@hortonworks.com>
> >> wrote:
> >> > I have posted write benchmark results to the Jira.
> >> >
> >> > On Tue, Sep 23, 2014 at 3:41 PM, Arpit Agarwal <
> aagarwal@hortonworks.com
> >> >
> >> > wrote:
> >> >
> >> >> Hi Andrew, I said "it is not going to be a substantial fraction of
> >> memory
> >> >> bandwidth". That is certainly not the same as saying it won't be
> good or
> >> >> there won't be any improvement.
> >> >>
> >> >> Any time you have transfers over RPC or the network stack you will
> not
> >> get
> >> >> close to the memory bandwidth even for intra-host transfers.
> >> >>
> >> >> I'll add some micro-benchmark results to the Jira shortly.
> >> >>
> >> >> Thanks,
> >> >> Arpit
> >> >>
> >> >> On Tue, Sep 23, 2014 at 2:33 PM, Andrew Wang <
> andrew.wang@cloudera.com>
> >> >> wrote:
> >> >>
> >> >>> Hi Arpit,
> >> >>>
> >> >>> Here is the comment. It was certainly not my intention to misquote
> >> anyone.
> >> >>>
> >> >>>
> >> >>>
> >>
> https://issues.apache.org/jira/browse/HDFS-6581?focusedCommentId=14138223&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14138223
> >> >>>
> >> >>> Quote:
> >> >>>
> >> >>> It would be nice to see that would could get a substantial fraction
> of
> >> >>> memory bandwidth when writing to a single replica in-memory.
> >> >>>
> >> >>> The comparison will be interesting but I can tell you without
> >> measurement
> >> >>> it is not going to be a substantial fraction of memory bandwidth.
We
> >> are
> >> >>> still going through DataTransferProtocol with all the copies and
> >> overhead
> >> >>> that involves.
> >> >>>
> >> >>> When the goal is in-memory writes and we are unable to achieve
a
> >> >>> substantial fraction of memory bandwidth, to me that is "not good
> >> >>> performance."
> >> >>>
> >> >>> I also looked through the subtasks, and AFAICT the only one related
> to
> >> >>> improving this is deferring checksum computation. The benchmarking
> we
> >> did
> >> >>> on HDFS-4949 showed that this only really helps when you're down
to
> >> single
> >> >>> copy or zero copies with SCR/ZCR. DTP reads didn't see much of
an
> >> >>> improvement, so I'd guess the same would be true for DTP writes.
> >> >>>
> >> >>> I think my above three questions are still open, as well as my
> question
> >> >>> about why we're merging now, as opposed to when the performance
of
> the
> >> >>> branch is proven out.
> >> >>>
> >> >>> Thanks,
> >> >>> Andrew
> >> >>>
> >> >>> On Tue, Sep 23, 2014 at 2:10 PM, Arpit Agarwal <
> >> aagarwal@hortonworks.com>
> >> >>> wrote:
> >> >>>
> >> >>> > Andrew, don't misquote me. Can you link the comment where
I said
> >> >>> > performance wasn't going to be good?
> >> >>> >
> >> >>> > I will add some add some preliminary write results to the
Jira
> later
> >> >>> today.
> >> >>> >
> >> >>> > > What's the plan to improve write performance?
> >> >>> > I described this in response to your and Colin's comments
on the
> >> Jira.
> >> >>> >
> >> >>> > For the benefit of folks not following the Jira, the immediate
> task
> >> we'd
> >> >>> > like to get done post-merge is moving checksum computation
off the
> >> write
> >> >>> > path. Also see open subtasks of HDFS-6581 for other planned
perf
> >> >>> > improvements.
> >> >>> >
> >> >>> > Thanks,
> >> >>> > Arpit
> >> >>> >
> >> >>> >
> >> >>> > On Tue, Sep 23, 2014 at 1:07 PM, Andrew Wang <
> >> andrew.wang@cloudera.com>
> >> >>> > wrote:
> >> >>> >
> >> >>> > > Hi Arpit,
> >> >>> > >
> >> >>> > > On HDFS-6581, I asked for write benchmarks on Sep 19th,
and you
> >> >>> responded
> >> >>> > > that the performance wasn't going to be good. However,
I thought
> >> the
> >> >>> > > primary goal of this JIRA was to improve write performance,
and
> >> write
> >> >>> > > performance is listed as the first feature requirement
in the
> >> design
> >> >>> doc.
> >> >>> > >
> >> >>> > > So, this leads me to a few questions, which I also asked
last
> week
> >> on
> >> >>> the
> >> >>> > > JIRA (I believe still unanswered):
> >> >>> > >
> >> >>> > > - What's the plan to improve write performance?
> >> >>> > > - What kind of performance can we expect after the plan
is
> >> completed?
> >> >>> > > - Can this expected performance be validated with a prototype?
> >> >>> > >
> >> >>> > > Even with these questions answered, I don't understand
the need
> to
> >> >>> merge
> >> >>> > > this before the write optimization work is completed.
Write
> perf is
> >> >>> > listed
> >> >>> > > as a feature requirement, so the branch can reasonably
be called
> >> not
> >> >>> > > feature complete until it's shown to be faster.
> >> >>> > >
> >> >>> > > Thanks,
> >> >>> > > Andrew
> >> >>> > >
> >> >>> > > On Tue, Sep 23, 2014 at 11:47 AM, Jitendra Pandey <
> >> >>> > > jitendra@hortonworks.com>
> >> >>> > > wrote:
> >> >>> > >
> >> >>> > > > +1. I have reviewed most of the code in the branch,
and I
> think
> >> its
> >> >>> > ready
> >> >>> > > > to be merged to trunk.
> >> >>> > > >
> >> >>> > > >
> >> >>> > > > On Mon, Sep 22, 2014 at 5:24 PM, Arpit Agarwal <
> >> >>> > aagarwal@hortonworks.com
> >> >>> > > >
> >> >>> > > > wrote:
> >> >>> > > >
> >> >>> > > > > HDFS Devs,
> >> >>> > > > >
> >> >>> > > > > We propose merging the HDFS-6581 development
branch to
> trunk.
> >> >>> > > > >
> >> >>> > > > > The work adds support to write to HDFS blocks
in memory. The
> >> >>> target
> >> >>> > use
> >> >>> > > > > case covers applications writing relatively
small,
> intermediate
> >> >>> data
> >> >>> > > sets
> >> >>> > > > > with low latency. We introduce a new CreateFlag
for the
> >> existing
> >> >>> > > > CreateFile
> >> >>> > > > > API. HDFS will subsequently attempt to place
replicas of
> file
> >> >>> blocks
> >> >>> > in
> >> >>> > > > > local memory with disk writes occurring off
the hot path.
> The
> >> >>> current
> >> >>> > > > > design is a simplification of original ideas
from Sanjay
> Radia
> >> on
> >> >>> > > > > HDFS-5851.
> >> >>> > > > >
> >> >>> > > > > Key goals of the feature were minimal API changes
to reduce
> >> >>> > application
> >> >>> > > > > burden and best effort data durability. The
feature is
> optional
> >> >>> and
> >> >>> > > > > requires appropriate DN configuration from
administrators.
> >> >>> > > > >
> >> >>> > > > > Design doc:
> >> >>> > > > >
> >> >>> > > > >
> >> >>> > > >
> >> >>> > >
> >> >>> >
> >> >>>
> >>
> https://issues.apache.org/jira/secure/attachment/12661926/HDFSWriteableReplicasInMemory.pdf
> >> >>> > > > >
> >> >>> > > > > Test plan:
> >> >>> > > > >
> >> >>> > > > >
> >> >>> > > >
> >> >>> > >
> >> >>> >
> >> >>>
> >>
> https://issues.apache.org/jira/secure/attachment/12669452/Test-Plan-for-HDFS-6581-Memory-Storage.pdf
> >> >>> > > > >
> >> >>> > > > > There are 28 resolved sub-tasks under HDFS-6581,
3 open
> tasks
> >> for
> >> >>> > > > > tests+Jenkins issues  and 7 open subtasks tracking
planned
> >> >>> > > improvements.
> >> >>> > > > > The latest merge patch is 3300 lines of changed
code of
> which
> >> 1300
> >> >>> > > lines
> >> >>> > > > is
> >> >>> > > > > new and updated tests. Merging the branch to
trunk will
> allow
> >> HDFS
> >> >>> > > > > applications to start evaluating the feature.
We will
> continue
> >> >>> work
> >> >>> > on
> >> >>> > > > > documentation, performance tuning and metrics
in parallel
> with
> >> the
> >> >>> > vote
> >> >>> > > > and
> >> >>> > > > > post-merge.
> >> >>> > > > >
> >> >>> > > > > Contributors to design and code include Xiaoyu
Yao, Sanjay
> >> Radia,
> >> >>> > > > Jitendra
> >> >>> > > > > Pandey, Tassapol Athiapinya, Gopal V, Bikas
Saha, Vikram
> Dixit,
> >> >>> > Suresh
> >> >>> > > > > Srinivas and Chris Nauroth.
> >> >>> > > > >
> >> >>> > > > > Thanks to Haohui Mai, Colin Patrick McCabe,
Andrew Wang,
> Todd
> >> >>> Lipcon,
> >> >>> > > > Eric
> >> >>> > > > > Baldeschwieler and Vinayakumar B for providing
useful
> feedback
> >> on
> >> >>> > > > > HDFS-6581, HDFS-5851 and sub-tasks.
> >> >>> > > > >
> >> >>> > > > > The vote runs for the usual 7 days and will
expire at 12am
> PDT
> >> on
> >> >>> Sep
> >> >>> > > 30.
> >> >>> > > > > Here is my +1 for the merge.
> >> >>> > > > >
> >> >>> > > > > Regards,
> >> >>> > > > > Arpit
> >> >>> > > > >
> >> >>> > > > > --
> >> >>> > > > > CONFIDENTIALITY NOTICE
> >> >>> > > > > NOTICE: This message is intended for the use
of the
> individual
> >> or
> >> >>> > > entity
> >> >>> > > > to
> >> >>> > > > > which it is addressed and may contain information
that is
> >> >>> > confidential,
> >> >>> > > > > privileged and exempt from disclosure under
applicable law.
> If
> >> the
> >> >>> > > reader
> >> >>> > > > > of this message is not the intended recipient,
you are
> hereby
> >> >>> > notified
> >> >>> > > > that
> >> >>> > > > > any printing, copying, dissemination, distribution,
> disclosure
> >> or
> >> >>> > > > > forwarding of this communication is strictly
prohibited. If
> you
> >> >>> have
> >> >>> > > > > received this communication in error, please
contact the
> sender
> >> >>> > > > immediately
> >> >>> > > > > and delete it from your system. Thank You.
> >> >>> > > > >
> >> >>> > > >
> >> >>> > > >
> >> >>> > > >
> >> >>> > > > --
> >> >>> > > > <http://hortonworks.com/download/>
> >> >>> > > >
> >> >>> > > > --
> >> >>> > > > CONFIDENTIALITY NOTICE
> >> >>> > > > NOTICE: This message is intended for the use of
the
> individual or
> >> >>> > entity
> >> >>> > > to
> >> >>> > > > which it is addressed and may contain information
that is
> >> >>> confidential,
> >> >>> > > > privileged and exempt from disclosure under applicable
law. If
> >> the
> >> >>> > reader
> >> >>> > > > of this message is not the intended recipient, you
are hereby
> >> >>> notified
> >> >>> > > that
> >> >>> > > > any printing, copying, dissemination, distribution,
> disclosure or
> >> >>> > > > forwarding of this communication is strictly prohibited.
If
> you
> >> have
> >> >>> > > > received this communication in error, please contact
the
> sender
> >> >>> > > immediately
> >> >>> > > > and delete it from your system. Thank You.
> >> >>> > > >
> >> >>> > >
> >> >>> >
> >> >>> > --
> >> >>> > CONFIDENTIALITY NOTICE
> >> >>> > NOTICE: This message is intended for the use of the individual
or
> >> >>> entity to
> >> >>> > which it is addressed and may contain information that is
> >> confidential,
> >> >>> > privileged and exempt from disclosure under applicable law.
If the
> >> >>> reader
> >> >>> > of this message is not the intended recipient, you are hereby
> >> notified
> >> >>> that
> >> >>> > any printing, copying, dissemination, distribution, disclosure
or
> >> >>> > forwarding of this communication is strictly prohibited. If
you
> have
> >> >>> > received this communication in error, please contact the sender
> >> >>> immediately
> >> >>> > and delete it from your system. Thank You.
> >> >>> >
> >> >>>
> >> >>
> >> >>
> >> >
> >> > --
> >> > CONFIDENTIALITY NOTICE
> >> > NOTICE: This message is intended for the use of the individual or
> entity
> >> to
> >> > which it is addressed and may contain information that is
> confidential,
> >> > privileged and exempt from disclosure under applicable law. If the
> reader
> >> > of this message is not the intended recipient, you are hereby notified
> >> that
> >> > any printing, copying, dissemination, distribution, disclosure or
> >> > forwarding of this communication is strictly prohibited. If you have
> >> > received this communication in error, please contact the sender
> >> immediately
> >> > and delete it from your system. Thank You.
> >>
> >
> >
> >
> > --
> > http://hortonworks.com/download/
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message