Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-dev@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of saint.ack@gmail.com designates
 74.125.92.24 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:date:x-google-sender-auth:message-id:subject
         :from:to:content-type:content-transfer-encoding;
        b=XuaAMBfkl1JVLC/mAIh02homtRqqygud9/t4+z9OcunLdhLBAkANhdzxGNyqclSfsC
         TlKpTc0EP+apOs/EdIoz/DfnjgHf+Txh/bhDiHk5eAdCoF5Em3AIfQ45A1xV7RDRLt0E
         8s4s1p2gycuZSv1kQUg+bM00vdZ03pCoL5i7U=
MIME-Version: 1.0
Sender: saint.ack@gmail.com
Date: Sat, 23 May 2009 16:01:51 -0700
Message-ID: <7c962aed0905231601g533088ebj4a7a068505ba3f50@mail.gmail.com>
Subject: Hadoop Committers Meeting at Yahoo on append/flush/sync
From: stack <stack@duboce.net>
To: hbase-dev@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

A few of us went to a Hadoop Committers Meeting kindly hosted by
Yahoo! yesterday.   HBase was represented by Chad Walters, Jim
Kellerman, Ryan Rawson, and myself.  The rest of the meeting attendees
were a bunch of the Y! HDFS team, plus meeting leader MapReducer Owen
O'Malley, along with Facebookees (Dhruba, Ashish, etc.) and Luke Liu
of HyperTable/Zvents.

The meeting topic was append/flush/sync in HDFS.

After some back and forth over a set of slides presented by Sanjay on
work being done by Hairong as part of HADOOP-5744, "Revising append",
the room settled on API3 from the list of options below as the
priority feature needed by HADOOP 0.21.0.  Readers must be able to
read up to the last writer 'successful' flush.  Its not important that
the file length is 'inexact'.

Hairong's revisit work builds on the work done in HADOOP-4379, etc.,
but is a different effort.  It was presented that the latest
HADOOP-4379 patch works pretty good and that its a million times
better than nothing though there is some lag while lease is recovered
(Hairong and Dhruba chatting think that the cycle waiting on a
successful append so we can then close, and then open to read may not
actually be necessary -- will update HADOOP-4379 after trying it out).

Dhruba notes HADOOP-4379 is not enough.  HADOOP-4663 is also needed.
We need to test but in discussion, a patched HADOOP 0.20.0 with a
working flush may be possible.

Before the above meeting, a few of us met with the Y! HDFS team to chat.

On DFSClient recovery, while in the room, Raghu may have fingered our
problem: HADOOP-5903.

On xceiver count, because TRUNK uses pread in HDFS, the number of
occupied threads in datanodes may actually be much lower since pread
opens socket, reads and then closes the socket.  We need to test.

On occasional slow writes into HDFS, we need to check see what the
datanode is doing at the time.

St.Ack

Below are options presented by Sanjay:

> Below is a list of APIs/semantics variations we are considering.
> Which ones do you absolutely needed for HBase in the short term and
> which ones may be useful to HBase in the longer term.
>
> API1: flushes out from the address space of client into the socket to the data nodes.
>
> On the return of the call there is no guarantee that that data is
> out of the underlying node and no guarantee of having reached a
> DN. Readers will see this data soon if there are no failures.
>
> For example, I suspect Scribe and chukwa will like the lower
> latency of this API and are prepared to loose some records
> occasionally in case of failures. Clearly a journal will not find
> this api acceptable.
>
> API2: flushes out to at lease one data node and receives an ack.
>
> New readers will eventually see the data
>
> API3: flushes out to all replicas of the block. The data is in the buffers of the DNs but not on the DN's OS buffers
>
> New readers will see the data after the call has returned. (Hadoop
> 5744 calls API3 hflush for now).
>
> API4: flushes out to all replicas and all replicas DNs have done a posix fflush equivalent - ie data is out the under lying OS file system of the DNs
>
> API5: flushes out to all replicas and all repliacs have done posix fsync equivalent - ie the OS has flushed it to the disk device (but the disk may have it in its cache).