hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4849) Idempotent create, append and delete operations.
Date Wed, 29 May 2013 10:59:27 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669149#comment-13669149

Steve Loughran commented on HDFS-4849:

bq. concurrency logic is always an issue in distributed storage.

Which is why I tend to defer to Lamport in such situations, here [Lamport78], and the notion
of a transient {{happens-before}} relation. This relation is antireflexive {{happens-before(A,A)
is false for all A}}, and antisymmetric {{happens-before(A,B) <=> !happens-before(B,A)
for all A, B}}. Currently HDFS provides happens-before guarantees, which is a guarantee which
applications rely on.

bq.  when B exists, which should replace B with contents of A. Suppose then that at the same
time client2 deletes file B. Since there is no guarantee which operation is executed first
you can either end up with A renamed to B, if the delete goes first, or with no files if the
rename prevails followed by the deletion of B. This is similar to your case. When client1
retries from its perspective delete was not completed, so it deletes again. And it is not
different from the case  when client1 is slow and executes delete after rename.

Cos, your proposed scenario has an initial state of both paths existing and two clients executing
un-coordinated requests against the filesystem.
0: exists(A),  exists(B) 
1. client1: mv(A,B)
2. client2: rm(B)

There are two orderings here, which each lead to two different outcomes
order1: happens-before(cient1, client2)

client1: mv(A,B) - failure
client2: rm(B) -success

result: exists(A), !exists(B) 

the rename fails, the data is where it was.
order2: happens-before( client2, client1)

client2: rm(B) -success 
client1: mv(A,B) - success 
result: !exists(A), exists(B) 

The rename has succeeded. 

In both cases, the NN has imposes a strict ordering the outcomes -and both clients know the
outcome. They may not know what happened after, but they both can infer what happened before.

Now, let's add the ability for operations that fail over the network to be retried, and, as
we cannot determine if the failure was before or after the operation was executed, re-execution.
This is what you appear to have proposed

order1: happens-before( client1, client2);
        initial rm(B) lost after execution and retried.

client1: mv(A,B) - failure 
client2: rm(B) -success/fail => result lost
client2: retry: rm(B) -success

result: exists(A), !exists(B) 

The outcome is as before: the happens-before ordering is the same. I think this is what you
are considering when we talk about idempotent deletes.
order2a: happens-before( client2, client1);
         initial rm(B) operation fails before execution
         retry happens before any client1 calls are executed.

client2: rm(B) -fail-
client2: retry rm(B) - success
client1: mv(A,B) - success
result: !exists(A), exists(B) 

Again, the retry has not affected the outcome, because the first delete operation never took
because the {{happens-before}} ordering is the same, client 2 actions executed before client

I believe these are the scenarios which you are considering when you say that making {{delete()}}
idempotent is harmless. The problem we have is the third ordering

order2b: client2 rm(B) operation is first and succeeds, but result lost
         retry after client1 operations

client2: rm(B) -succes, => result lost
client1: mv(A,B) - success => true
client2: retry rm(B) success 
result: !exists(A), !exists(B) 

Can you see what has happened here? The ordering {{happens-before(2,1)}} is no longer antisymmetric;
we have {{happens-before(2,1) && happens-before(1,2)}}. The outcome is that the rename
succeeded, {{!exists(A)}} now holds, and then the delete succeeded: {{!exists(B)}}. A new
final state of the operation sequence has been reached, resulting in "data lost" . *This is
the issue.*

bq. I see only one issue with delete that prevents it from being idempotent - it's the return
value, which must be true only if the deleted object existed and was actually deleted. This
cannot be guaranteed through retries. The semantics of delete should be that "object does
not exist after delete completes". This seems idempotent to me. The return value should be
treated as success or failure. Same as in mkdir.

The issue is not the final state of the object, but the fact that the filesystem state after
the first invocation is visible to other clients -and that they perform work which depends
on it. The real change of the semantics that you are proposing is {{an rm() call may be repeated
1+ times until the client finally receives a response -irrespective of what other operations
take place on the same path}}. It's not what the client sees, it is the external visibility
of the retried actions.

bq. My point is that if you need to coordinate clients you should do it with some external
tools, like ZK.

You know how MapReduce co-ordinates commits between speculating tasks, don't you? It relies
on {{rename()}} being atomic and failing if the destination path exists. It does this because
of the guarantees that the FS makes of any pair of atomic operations being serialized into
a strict order -either {{(1,2)}} or {{(2,1)}}. 

The proposal breaks this fundamental guarantee -a guarantee which allows client operations
to use the filesystem itself to co-ordinate actions that consist of single side-effecting
operations. Making delete idempotent to aid on MR job cleanup is unimportant if it can stop
the output from being committed reliably.

I am reasonably confident that the blobstores break these rules as rename is not-atomic; [HADOOP-9577]
is proof of this, [HADOOP-9565] the beginnings of a workaround, one which will require new
committers for MR jobs. Because today *the MR engine relies on atomic FS options to co-ordinate
loosely coupled nodes during is execution*.

This is why - and I'm stating this a warning before you invest significant time in implementing
retries - myself, and presumably others, will veto any change to the semantics of {{delete()}}
that breaks any part of the {{happens-before()}} relation, i.e. allows for the remote path
to be deleted more than once in a way that would be visible to other clients. 

Further reading [Lamport78: Time, Clocks and the Ordering of Events in a Distributed System|http://research.microsoft.com/en-us/um/people/lamport/pubs/time-clocks.pdf]
> Idempotent create, append and delete operations.
> ------------------------------------------------
>                 Key: HDFS-4849
>                 URL: https://issues.apache.org/jira/browse/HDFS-4849
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.0.4-alpha
>            Reporter: Konstantin Shvachko
>            Assignee: Konstantin Shvachko
> create, append and delete operations can be made idempotent. This will reduce chances
for a job or other app failures when NN fails over.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message