hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Why datanode does a flush to disk after receiving a packet
Date Thu, 11 Nov 2010 20:27:38 GMT
On Thu, Nov 11, 2010 at 11:55 AM, Hairong Kuang <kuang.hairong@gmail.com>wrote:

> A few clarification on API2 semantics.
>
> 1. Ack gets sent back to the client before a packet gets written to local
> files.
>

Ah, I see in trunk this is the case. In 0.20-append, it's the other way
around - we only enqueue after flush.


> 2. Data become visible to new readers on the condition that at least one
> DataNode does not have an error.
> 3. The reason that flush is done after a write is more for the purpose of
> implementation simplification. Currently readers do not read from DataNode
> buffer. They only read from system buffer. A flush makes the data visible
> to
> readers sooner.
>
> Hairong
>
> On 11/11/10 7:31 AM, "Thanh Do" <thanhdo@cs.wisc.edu> wrote:
>
> > Thanks Todd,
> >
> > In HDFS-6313, i see three API (sync, hflush, hsync),
> > And I assume hflush corresponds to :
> >
> > *"API2: flushes out to all replicas of the block.
> > The data is in the buffers of the DNs but not on the DN's OS buffers.
> > New readers will see the data after the call has returned.*"
> >
> > I am still confused that, once the client calls hflushes,
> > the client gonna wait for all outstanding packet to be acked,
> > before sending subsequent packet.
> > But at DataNode, it is possible that the ack to client is sent
> > before the write of data and checksum to replica. So if
> > the DataNode crashes just after sending the ack and before
> > writing to replica, will the semantics be violated here?
> >
> > Thanks
> > Thanh
> >
> > On Wed, Nov 10, 2010 at 11:11 PM, Todd Lipcon <todd@cloudera.com> wrote:
> >
> >> Nope, flush just flushes the java side buffer to the Linux buffer
> >> cache -- not all the way to the media.
> >>
> >> Hsync is the API that will eventually go all the way to disk, but it
> >> has not yet been implemented.
> >>
> >> -Todd
> >>
> >> On Wednesday, November 10, 2010, Thanh Do <thanhdo@cs.wisc.edu> wrote:
> >>> Or another way to rephase my question:
> >>> does data.flush and checksumOut.flush guarantee
> >>> data be synchronized with underlying disk,
> >>> just like fsync().
> >>>
> >>> Thanks
> >>> Thanh
> >>>
> >>> On Wed, Nov 10, 2010 at 10:26 PM, Thanh Do <thanhdo@cs.wisc.edu>
> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> After reading the appenddesign3.pdf in HDFS-256,
> >>>> and looking at the BlockReceiver.java code in 0.21.0,
> >>>> I am confused by the following.
> >>>>
> >>>> The document says that:
> >>>> *For each packet, a DataNode in the pipeline has to do 3 things.
> >>>> 1. Stream data
> >>>>       a. Receive data from the upstream DataNode or the client
> >>>>       b. Push the data to the downstream DataNode if there is any
> >>>> 2. Write the data/crc to its block file/meta file.
> >>>> 3. Stream ack
> >>>>       a. Receive an ack from the downstream DataNode if there is any
> >>>>       b. Send an ack to the upstream DataNode or the client*
> >>>>
> >>>> And *"...there is no guarantee on the order of (2) and (3)"*
> >>>>
> >>>> In BlockReceiver.receivePacket(), after read the packet buffer,
> >>>> datanode does:
> >>>> 1) put the packet seqno in the ack queue
> >>>> 2) write data and checksum to disk
> >>>> 3) flush data and checksum (to disk)
> >>>>
> >>>> The thing that confusing me is that: the streaming of ack does not
> >>>> necessary depends on whether data has been flush to disk or not.
> >>>> Then, my question is:
> >>>> Why do DataNode need to flush data and checksum
> >>>> every time the DataNode receives a packet. This flush may be costly.
> >>>> Why cant the DataNode just batch server write (after receiving
> >>>> server packet) and flush all at once?
> >>>> Is there any particular reason for doing so?
> >>>>
> >>>> Can somebody clarify this for me?
> >>>>
> >>>> Thanks so much.
> >>>> Thanh
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>
> >> --
> >> Todd Lipcon
> >> Software Engineer, Cloudera
> >>
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message