Return-Path: Delivered-To: apmail-hadoop-hdfs-dev-archive@minotaur.apache.org Received: (qmail 30079 invoked from network); 11 Nov 2010 21:10:19 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 11 Nov 2010 21:10:19 -0000 Received: (qmail 5160 invoked by uid 500); 11 Nov 2010 21:10:51 -0000 Delivered-To: apmail-hadoop-hdfs-dev-archive@hadoop.apache.org Received: (qmail 5052 invoked by uid 500); 11 Nov 2010 21:10:50 -0000 Mailing-List: contact hdfs-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-dev@hadoop.apache.org Delivered-To: mailing list hdfs-dev@hadoop.apache.org Received: (qmail 5044 invoked by uid 99); 11 Nov 2010 21:10:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Nov 2010 21:10:50 +0000 X-ASF-Spam-Status: No, hits=2.9 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.213.176] (HELO mail-yx0-f176.google.com) (209.85.213.176) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Nov 2010 21:10:46 +0000 Received: by yxn35 with SMTP id 35so19358yxn.35 for ; Thu, 11 Nov 2010 13:10:25 -0800 (PST) Received: by 10.42.174.3 with SMTP id t3mr462227icz.81.1289509824730; Thu, 11 Nov 2010 13:10:24 -0800 (PST) MIME-Version: 1.0 Received: by 10.231.154.18 with HTTP; Thu, 11 Nov 2010 13:10:03 -0800 (PST) In-Reply-To: References: From: Todd Lipcon Date: Thu, 11 Nov 2010 13:10:03 -0800 Message-ID: Subject: Re: Why datanode does a flush to disk after receiving a packet To: hdfs-dev@hadoop.apache.org Content-Type: multipart/alternative; boundary=90e6ba6e8e720997320494cd6777 --90e6ba6e8e720997320494cd6777 Content-Type: text/plain; charset=ISO-8859-1 On Thu, Nov 11, 2010 at 12:43 PM, Thanh Do wrote: > Thank you all for clarification guys. > I also looked at 0.20-append trunk and see that the order is totally > different. > > One more thing, do you guys plan to implement hsync(), i.e API3 > in the near future. Are there any class of application that requires such > strong guarantee? > > I don't personally have any plans - everyone I've talked to who cares about data durability is OK with potential file truncation if power is lost across all DNs simultaneously. I'm sure there are some applications where this isn't acceptable, but people aren't using HBase for those applications yet :) -Todd > > On Thu, Nov 11, 2010 at 2:27 PM, Todd Lipcon wrote: > > > On Thu, Nov 11, 2010 at 11:55 AM, Hairong Kuang > >wrote: > > > > > A few clarification on API2 semantics. > > > > > > 1. Ack gets sent back to the client before a packet gets written to > local > > > files. > > > > > > > Ah, I see in trunk this is the case. In 0.20-append, it's the other way > > around - we only enqueue after flush. > > > > > > > 2. Data become visible to new readers on the condition that at least > one > > > DataNode does not have an error. > > > 3. The reason that flush is done after a write is more for the purpose > of > > > implementation simplification. Currently readers do not read from > > DataNode > > > buffer. They only read from system buffer. A flush makes the data > visible > > > to > > > readers sooner. > > > > > > Hairong > > > > > > On 11/11/10 7:31 AM, "Thanh Do" wrote: > > > > > > > Thanks Todd, > > > > > > > > In HDFS-6313, i see three API (sync, hflush, hsync), > > > > And I assume hflush corresponds to : > > > > > > > > *"API2: flushes out to all replicas of the block. > > > > The data is in the buffers of the DNs but not on the DN's OS buffers. > > > > New readers will see the data after the call has returned.*" > > > > > > > > I am still confused that, once the client calls hflushes, > > > > the client gonna wait for all outstanding packet to be acked, > > > > before sending subsequent packet. > > > > But at DataNode, it is possible that the ack to client is sent > > > > before the write of data and checksum to replica. So if > > > > the DataNode crashes just after sending the ack and before > > > > writing to replica, will the semantics be violated here? > > > > > > > > Thanks > > > > Thanh > > > > > > > > On Wed, Nov 10, 2010 at 11:11 PM, Todd Lipcon > > wrote: > > > > > > > >> Nope, flush just flushes the java side buffer to the Linux buffer > > > >> cache -- not all the way to the media. > > > >> > > > >> Hsync is the API that will eventually go all the way to disk, but it > > > >> has not yet been implemented. > > > >> > > > >> -Todd > > > >> > > > >> On Wednesday, November 10, 2010, Thanh Do > > wrote: > > > >>> Or another way to rephase my question: > > > >>> does data.flush and checksumOut.flush guarantee > > > >>> data be synchronized with underlying disk, > > > >>> just like fsync(). > > > >>> > > > >>> Thanks > > > >>> Thanh > > > >>> > > > >>> On Wed, Nov 10, 2010 at 10:26 PM, Thanh Do > > > wrote: > > > >>> > > > >>>> Hi all, > > > >>>> > > > >>>> After reading the appenddesign3.pdf in HDFS-256, > > > >>>> and looking at the BlockReceiver.java code in 0.21.0, > > > >>>> I am confused by the following. > > > >>>> > > > >>>> The document says that: > > > >>>> *For each packet, a DataNode in the pipeline has to do 3 things. > > > >>>> 1. Stream data > > > >>>> a. Receive data from the upstream DataNode or the client > > > >>>> b. Push the data to the downstream DataNode if there is any > > > >>>> 2. Write the data/crc to its block file/meta file. > > > >>>> 3. Stream ack > > > >>>> a. Receive an ack from the downstream DataNode if there is > any > > > >>>> b. Send an ack to the upstream DataNode or the client* > > > >>>> > > > >>>> And *"...there is no guarantee on the order of (2) and (3)"* > > > >>>> > > > >>>> In BlockReceiver.receivePacket(), after read the packet buffer, > > > >>>> datanode does: > > > >>>> 1) put the packet seqno in the ack queue > > > >>>> 2) write data and checksum to disk > > > >>>> 3) flush data and checksum (to disk) > > > >>>> > > > >>>> The thing that confusing me is that: the streaming of ack does not > > > >>>> necessary depends on whether data has been flush to disk or not. > > > >>>> Then, my question is: > > > >>>> Why do DataNode need to flush data and checksum > > > >>>> every time the DataNode receives a packet. This flush may be > costly. > > > >>>> Why cant the DataNode just batch server write (after receiving > > > >>>> server packet) and flush all at once? > > > >>>> Is there any particular reason for doing so? > > > >>>> > > > >>>> Can somebody clarify this for me? > > > >>>> > > > >>>> Thanks so much. > > > >>>> Thanh > > > >>>> > > > >>>> > > > >>>> > > > >>>> > > > >>> > > > >> > > > >> -- > > > >> Todd Lipcon > > > >> Software Engineer, Cloudera > > > >> > > > > > > > > > > > > > > > -- > > Todd Lipcon > > Software Engineer, Cloudera > > > -- Todd Lipcon Software Engineer, Cloudera --90e6ba6e8e720997320494cd6777--