hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Carey <sc...@richrelevance.com>
Subject Re: [DISCUSS] Remove append?
Date Mon, 26 Mar 2012 20:53:09 GMT

On 3/26/12 12:53 PM, "Colin McCabe" <cmccabe@alumni.cmu.edu> wrote:

>On Fri, Mar 23, 2012 at 7:44 PM, Scott Carey <scott@richrelevance.com>
>> On 3/22/12 10:25 AM, "Eli Collins" <eli@cloudera.com> wrote:
>>>On Thu, Mar 22, 2012 at 1:26 AM, Konstantin Shvachko
>>><shv.hadoop@gmail.com> wrote:
>>>> Eli,
>>>> I went over the entire discussion on the topic, and did not get it. Is
>>>> there a problem with append? We know it does not work in hadoop-1,
>>>> only flush() does. Is there anything wrong with the new append
>>>> (HDFS-265)? If so please file a bug.
>>>> I tested it in Hadoop-0.22 branch it works fine.
>>>> I agree with people who were involved with the implementation of the
>>>> new append that the complexity is mainly in
>>>> 1. pipeline recovery
>>>> 2. consistent client reading while writing, and
>>>> 3. hflush()
>>>> Once it is done the append itself, which is reopening of previously
>>>> closed files for adding data, is not complex.
>>>I agree that much of the complexity is in #1-3 above, which is why
>>>HDFS-265 is leveraged.
>>>The primary simplicity of not having append (and truncate) comes from
>>>not leveraging the invariant that finalized blocks are immutable, that
>>>blocks once written won't eg shrink in size (which we assume today).
>> That invariant can co-exist with append via copy-on-write.  The new
>> and old state would co-exist until the old state was not needed, a
>> block map would have to use a persistent data structure. Copy on write
>> semantics with blocks in file systems is all the rage these days.  Free
>> snapshots, atomic transactions for operations on multiple blocks, etc.
>Hi Scott,
>If a client accesses a file, and then the client becomes unresponsive,
>how long should you wait before declaring the blocks he was looking at
>No matter how long or how short a period you choose, someone
>will argue with it.

How long does the NN wait now?  What if a client is reading a file, then
becomes unresponsive, then another deletes the file today?  At some point
the NN has to unlock the file and allow for delete.
If you choose locking you have the question of when to expire a lock. With
MVCC you have the question of when to retire a reference.  It is the same,
exact problem.

>And having to track this kind of state in the
>NameNode introduces a huge amount of complexity, not to mention extra
>memory consumption.  Basically, we would have to track the ID of every
>block that any client looked at, at all times.

There are simple, almost trivial solutions.  java.lang.ref.WeakReference
makes it trivial to track when an object (block reference) is no longer
referenced by client objects so that it can be logged as dead.  Persistent
data structures make it truly trivial to reference only exactly what is
visible to open transactions.  I strongly feel that the result would be
many fewer lines of code and complexity.

Solutions for the sort of data structures required have been solved by
others in the last 35 years -- but mostly for functional languages -- but
there is still plenty of innovation -- the Immutable Bitmapped Vector Trie
is a powerful and fascinating example.  The following presentation is
excellent, and covers the sort of data structures solve the problems you
list above without the complexity that would be required if the NN block
map was an ephemeral data structure:

In addition to allowing for atomic transaction batches and lockless file
access, file system snapshots become trivial as well -- they are
equivalent to a permanently open transaction. The space needed for such a
snapshot is proportional to the delta between the snapshot and the current


View raw message