hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Why datanode does a flush to disk after receiving a packet
Date Thu, 11 Nov 2010 21:37:39 GMT
On Thu, Nov 11, 2010 at 1:26 PM, Thanh Do <thanhdo@cs.wisc.edu> wrote:

> Got it!
>
> Currently, the model is single writer/multiple reader.
> In the GFS paper, i see they have *record append*
> semantics, that is allow multiple clients writing to the
> same file. Do you guys have any plan to implement
> this...
>
>
Not that I'm aware of - as a community project I can't speak for everyone
else, though :)

It's interesting to note that the GFS designers are on record in an ACM
Queue interview[1] saying that this feature was a mistake. It was too hard
to implement correctly and it has some really strange semantics that users
found difficult to understand (eg different replicas of a block could
contain records in different orders!)

[1] http://queue.acm.org/detail.cfm?id=1594206

Todd


> On Thu, Nov 11, 2010 at 3:10 PM, Todd Lipcon <todd@cloudera.com> wrote:
>
> > On Thu, Nov 11, 2010 at 12:43 PM, Thanh Do <thanhdo@cs.wisc.edu> wrote:
> >
> > > Thank you all for clarification guys.
> > > I also looked at 0.20-append trunk and see that the order is totally
> > > different.
> > >
> > > One more thing, do you guys plan to implement hsync(), i.e API3
> > > in the near future. Are there any class of application that requires
> such
> > > strong guarantee?
> > >
> > >
> > I don't personally have any plans - everyone I've talked to who cares
> about
> > data durability is OK with potential file truncation if power is lost
> > across
> > all DNs simultaneously.
> >
> > I'm sure there are some applications where this isn't acceptable, but
> > people
> > aren't using HBase for those applications yet :)
> >
> > -Todd
> >
> > >
> > > On Thu, Nov 11, 2010 at 2:27 PM, Todd Lipcon <todd@cloudera.com>
> wrote:
> > >
> > > > On Thu, Nov 11, 2010 at 11:55 AM, Hairong Kuang <
> > kuang.hairong@gmail.com
> > > > >wrote:
> > > >
> > > > > A few clarification on API2 semantics.
> > > > >
> > > > > 1. Ack gets sent back to the client before a packet gets written
to
> > > local
> > > > > files.
> > > > >
> > > >
> > > > Ah, I see in trunk this is the case. In 0.20-append, it's the other
> way
> > > > around - we only enqueue after flush.
> > > >
> > > >
> > > > > 2. Data become visible to new readers on the condition that at
> least
> > > one
> > > > > DataNode does not have an error.
> > > > > 3. The reason that flush is done after a write is more for the
> > purpose
> > > of
> > > > > implementation simplification. Currently readers do not read from
> > > > DataNode
> > > > > buffer. They only read from system buffer. A flush makes the data
> > > visible
> > > > > to
> > > > > readers sooner.
> > > > >
> > > > > Hairong
> > > > >
> > > > > On 11/11/10 7:31 AM, "Thanh Do" <thanhdo@cs.wisc.edu> wrote:
> > > > >
> > > > > > Thanks Todd,
> > > > > >
> > > > > > In HDFS-6313, i see three API (sync, hflush, hsync),
> > > > > > And I assume hflush corresponds to :
> > > > > >
> > > > > > *"API2: flushes out to all replicas of the block.
> > > > > > The data is in the buffers of the DNs but not on the DN's OS
> > buffers.
> > > > > > New readers will see the data after the call has returned.*"
> > > > > >
> > > > > > I am still confused that, once the client calls hflushes,
> > > > > > the client gonna wait for all outstanding packet to be acked,
> > > > > > before sending subsequent packet.
> > > > > > But at DataNode, it is possible that the ack to client is sent
> > > > > > before the write of data and checksum to replica. So if
> > > > > > the DataNode crashes just after sending the ack and before
> > > > > > writing to replica, will the semantics be violated here?
> > > > > >
> > > > > > Thanks
> > > > > > Thanh
> > > > > >
> > > > > > On Wed, Nov 10, 2010 at 11:11 PM, Todd Lipcon <todd@cloudera.com
> >
> > > > wrote:
> > > > > >
> > > > > >> Nope, flush just flushes the java side buffer to the Linux
> buffer
> > > > > >> cache -- not all the way to the media.
> > > > > >>
> > > > > >> Hsync is the API that will eventually go all the way to
disk,
> but
> > it
> > > > > >> has not yet been implemented.
> > > > > >>
> > > > > >> -Todd
> > > > > >>
> > > > > >> On Wednesday, November 10, 2010, Thanh Do <thanhdo@cs.wisc.edu>
> > > > wrote:
> > > > > >>> Or another way to rephase my question:
> > > > > >>> does data.flush and checksumOut.flush guarantee
> > > > > >>> data be synchronized with underlying disk,
> > > > > >>> just like fsync().
> > > > > >>>
> > > > > >>> Thanks
> > > > > >>> Thanh
> > > > > >>>
> > > > > >>> On Wed, Nov 10, 2010 at 10:26 PM, Thanh Do <
> thanhdo@cs.wisc.edu>
> > > > > wrote:
> > > > > >>>
> > > > > >>>> Hi all,
> > > > > >>>>
> > > > > >>>> After reading the appenddesign3.pdf in HDFS-256,
> > > > > >>>> and looking at the BlockReceiver.java code in 0.21.0,
> > > > > >>>> I am confused by the following.
> > > > > >>>>
> > > > > >>>> The document says that:
> > > > > >>>> *For each packet, a DataNode in the pipeline has
to do 3
> things.
> > > > > >>>> 1. Stream data
> > > > > >>>>       a. Receive data from the upstream DataNode
or the client
> > > > > >>>>       b. Push the data to the downstream DataNode
if there is
> > any
> > > > > >>>> 2. Write the data/crc to its block file/meta file.
> > > > > >>>> 3. Stream ack
> > > > > >>>>       a. Receive an ack from the downstream DataNode
if there
> is
> > > any
> > > > > >>>>       b. Send an ack to the upstream DataNode or
the client*
> > > > > >>>>
> > > > > >>>> And *"...there is no guarantee on the order of (2)
and (3)"*
> > > > > >>>>
> > > > > >>>> In BlockReceiver.receivePacket(), after read the
packet
> buffer,
> > > > > >>>> datanode does:
> > > > > >>>> 1) put the packet seqno in the ack queue
> > > > > >>>> 2) write data and checksum to disk
> > > > > >>>> 3) flush data and checksum (to disk)
> > > > > >>>>
> > > > > >>>> The thing that confusing me is that: the streaming
of ack does
> > not
> > > > > >>>> necessary depends on whether data has been flush
to disk or
> not.
> > > > > >>>> Then, my question is:
> > > > > >>>> Why do DataNode need to flush data and checksum
> > > > > >>>> every time the DataNode receives a packet. This
flush may be
> > > costly.
> > > > > >>>> Why cant the DataNode just batch server write (after
receiving
> > > > > >>>> server packet) and flush all at once?
> > > > > >>>> Is there any particular reason for doing so?
> > > > > >>>>
> > > > > >>>> Can somebody clarify this for me?
> > > > > >>>>
> > > > > >>>> Thanks so much.
> > > > > >>>> Thanh
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > > >> --
> > > > > >> Todd Lipcon
> > > > > >> Software Engineer, Cloudera
> > > > > >>
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Todd Lipcon
> > > > Software Engineer, Cloudera
> > > >
> > >
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message