hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thanh Do <than...@cs.wisc.edu>
Subject Re: Why datanode does a flush to disk after receiving a packet
Date Thu, 11 Nov 2010 21:26:21 GMT
Got it!

Currently, the model is single writer/multiple reader.
In the GFS paper, i see they have *record append*
semantics, that is allow multiple clients writing to the
same file. Do you guys have any plan to implement
this...

Thanh

On Thu, Nov 11, 2010 at 3:10 PM, Todd Lipcon <todd@cloudera.com> wrote:

> On Thu, Nov 11, 2010 at 12:43 PM, Thanh Do <thanhdo@cs.wisc.edu> wrote:
>
> > Thank you all for clarification guys.
> > I also looked at 0.20-append trunk and see that the order is totally
> > different.
> >
> > One more thing, do you guys plan to implement hsync(), i.e API3
> > in the near future. Are there any class of application that requires such
> > strong guarantee?
> >
> >
> I don't personally have any plans - everyone I've talked to who cares about
> data durability is OK with potential file truncation if power is lost
> across
> all DNs simultaneously.
>
> I'm sure there are some applications where this isn't acceptable, but
> people
> aren't using HBase for those applications yet :)
>
> -Todd
>
> >
> > On Thu, Nov 11, 2010 at 2:27 PM, Todd Lipcon <todd@cloudera.com> wrote:
> >
> > > On Thu, Nov 11, 2010 at 11:55 AM, Hairong Kuang <
> kuang.hairong@gmail.com
> > > >wrote:
> > >
> > > > A few clarification on API2 semantics.
> > > >
> > > > 1. Ack gets sent back to the client before a packet gets written to
> > local
> > > > files.
> > > >
> > >
> > > Ah, I see in trunk this is the case. In 0.20-append, it's the other way
> > > around - we only enqueue after flush.
> > >
> > >
> > > > 2. Data become visible to new readers on the condition that at least
> > one
> > > > DataNode does not have an error.
> > > > 3. The reason that flush is done after a write is more for the
> purpose
> > of
> > > > implementation simplification. Currently readers do not read from
> > > DataNode
> > > > buffer. They only read from system buffer. A flush makes the data
> > visible
> > > > to
> > > > readers sooner.
> > > >
> > > > Hairong
> > > >
> > > > On 11/11/10 7:31 AM, "Thanh Do" <thanhdo@cs.wisc.edu> wrote:
> > > >
> > > > > Thanks Todd,
> > > > >
> > > > > In HDFS-6313, i see three API (sync, hflush, hsync),
> > > > > And I assume hflush corresponds to :
> > > > >
> > > > > *"API2: flushes out to all replicas of the block.
> > > > > The data is in the buffers of the DNs but not on the DN's OS
> buffers.
> > > > > New readers will see the data after the call has returned.*"
> > > > >
> > > > > I am still confused that, once the client calls hflushes,
> > > > > the client gonna wait for all outstanding packet to be acked,
> > > > > before sending subsequent packet.
> > > > > But at DataNode, it is possible that the ack to client is sent
> > > > > before the write of data and checksum to replica. So if
> > > > > the DataNode crashes just after sending the ack and before
> > > > > writing to replica, will the semantics be violated here?
> > > > >
> > > > > Thanks
> > > > > Thanh
> > > > >
> > > > > On Wed, Nov 10, 2010 at 11:11 PM, Todd Lipcon <todd@cloudera.com>
> > > wrote:
> > > > >
> > > > >> Nope, flush just flushes the java side buffer to the Linux buffer
> > > > >> cache -- not all the way to the media.
> > > > >>
> > > > >> Hsync is the API that will eventually go all the way to disk,
but
> it
> > > > >> has not yet been implemented.
> > > > >>
> > > > >> -Todd
> > > > >>
> > > > >> On Wednesday, November 10, 2010, Thanh Do <thanhdo@cs.wisc.edu>
> > > wrote:
> > > > >>> Or another way to rephase my question:
> > > > >>> does data.flush and checksumOut.flush guarantee
> > > > >>> data be synchronized with underlying disk,
> > > > >>> just like fsync().
> > > > >>>
> > > > >>> Thanks
> > > > >>> Thanh
> > > > >>>
> > > > >>> On Wed, Nov 10, 2010 at 10:26 PM, Thanh Do <thanhdo@cs.wisc.edu>
> > > > wrote:
> > > > >>>
> > > > >>>> Hi all,
> > > > >>>>
> > > > >>>> After reading the appenddesign3.pdf in HDFS-256,
> > > > >>>> and looking at the BlockReceiver.java code in 0.21.0,
> > > > >>>> I am confused by the following.
> > > > >>>>
> > > > >>>> The document says that:
> > > > >>>> *For each packet, a DataNode in the pipeline has to do
3 things.
> > > > >>>> 1. Stream data
> > > > >>>>       a. Receive data from the upstream DataNode or the
client
> > > > >>>>       b. Push the data to the downstream DataNode if
there is
> any
> > > > >>>> 2. Write the data/crc to its block file/meta file.
> > > > >>>> 3. Stream ack
> > > > >>>>       a. Receive an ack from the downstream DataNode
if there is
> > any
> > > > >>>>       b. Send an ack to the upstream DataNode or the
client*
> > > > >>>>
> > > > >>>> And *"...there is no guarantee on the order of (2) and
(3)"*
> > > > >>>>
> > > > >>>> In BlockReceiver.receivePacket(), after read the packet
buffer,
> > > > >>>> datanode does:
> > > > >>>> 1) put the packet seqno in the ack queue
> > > > >>>> 2) write data and checksum to disk
> > > > >>>> 3) flush data and checksum (to disk)
> > > > >>>>
> > > > >>>> The thing that confusing me is that: the streaming of
ack does
> not
> > > > >>>> necessary depends on whether data has been flush to disk
or not.
> > > > >>>> Then, my question is:
> > > > >>>> Why do DataNode need to flush data and checksum
> > > > >>>> every time the DataNode receives a packet. This flush
may be
> > costly.
> > > > >>>> Why cant the DataNode just batch server write (after
receiving
> > > > >>>> server packet) and flush all at once?
> > > > >>>> Is there any particular reason for doing so?
> > > > >>>>
> > > > >>>> Can somebody clarify this for me?
> > > > >>>>
> > > > >>>> Thanks so much.
> > > > >>>> Thanh
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > > >> --
> > > > >> Todd Lipcon
> > > > >> Software Engineer, Cloudera
> > > > >>
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Todd Lipcon
> > > Software Engineer, Cloudera
> > >
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message