hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Why datanode does a flush to disk after receiving a packet
Date Thu, 11 Nov 2010 19:20:59 GMT
On Thu, Nov 11, 2010 at 7:31 AM, Thanh Do <thanhdo@cs.wisc.edu> wrote:

> Thanks Todd,
>
> In HDFS-6313, i see three API (sync, hflush, hsync),
> And I assume hflush corresponds to :
>
> *"API2: flushes out to all replicas of the block.
> The data is in the buffers of the DNs but not on the DN's OS buffers.
> New readers will see the data after the call has returned.*"
>

I think the way it got implemented, hflush() actually does flush to OS
buffers, since BlockReceiver calls flush() before it enqueues the sequence
number in the responder pending ack queue in receivePacket().


> I am still confused that, once the client calls hflushes,
> the client gonna wait for all outstanding packet to be acked,
> before sending subsequent packet.
>

Currently, yes. HDFS-895 which will hopefully be committed this week adds
the ability to "pipeline" the packets - eg an hflush() only blocks the
caller of hflush() until previously written data has been flushed, but
doesn't stop other writers from appending more on top. Big speed improvement
for HBase in particular with this.


> But at DataNode, it is possible that the ack to client is sent
> before the write of data and checksum to replica. So if
> the DataNode crashes just after sending the ack and before
> writing to replica, will the semantics be violated here?
>
>
The DN will forward the packet to its downstream mirror in the pipeline, but
doesn't actually enqueue the seqno on the pending ack queue until it has
flushed to disk. So the different replicas may end up writing to disk in
different orders, but the client won't get the ack until all have flushed.
If any fails to flush, it will break the pipeline and initiate replica
recovery -- but the client still has all of the unacked packets in its
"ackQueue", so after recovery it simply flips those back onto "dataQueue"
for the new pipeline.

-Todd

On Wed, Nov 10, 2010 at 11:11 PM, Todd Lipcon <todd@cloudera.com> wrote:
>
> > Nope, flush just flushes the java side buffer to the Linux buffer
> > cache -- not all the way to the media.
> >
> > Hsync is the API that will eventually go all the way to disk, but it
> > has not yet been implemented.
> >
> > -Todd
> >
> > On Wednesday, November 10, 2010, Thanh Do <thanhdo@cs.wisc.edu> wrote:
> > > Or another way to rephase my question:
> > > does data.flush and checksumOut.flush guarantee
> > > data be synchronized with underlying disk,
> > > just like fsync().
> > >
> > > Thanks
> > > Thanh
> > >
> > > On Wed, Nov 10, 2010 at 10:26 PM, Thanh Do <thanhdo@cs.wisc.edu>
> wrote:
> > >
> > >> Hi all,
> > >>
> > >> After reading the appenddesign3.pdf in HDFS-256,
> > >> and looking at the BlockReceiver.java code in 0.21.0,
> > >> I am confused by the following.
> > >>
> > >> The document says that:
> > >> *For each packet, a DataNode in the pipeline has to do 3 things.
> > >> 1. Stream data
> > >>       a. Receive data from the upstream DataNode or the client
> > >>       b. Push the data to the downstream DataNode if there is any
> > >> 2. Write the data/crc to its block file/meta file.
> > >> 3. Stream ack
> > >>       a. Receive an ack from the downstream DataNode if there is any
> > >>       b. Send an ack to the upstream DataNode or the client*
> > >>
> > >> And *"...there is no guarantee on the order of (2) and (3)"*
> > >>
> > >> In BlockReceiver.receivePacket(), after read the packet buffer,
> > >> datanode does:
> > >> 1) put the packet seqno in the ack queue
> > >> 2) write data and checksum to disk
> > >> 3) flush data and checksum (to disk)
> > >>
> > >> The thing that confusing me is that: the streaming of ack does not
> > >> necessary depends on whether data has been flush to disk or not.
> > >> Then, my question is:
> > >> Why do DataNode need to flush data and checksum
> > >> every time the DataNode receives a packet. This flush may be costly.
> > >> Why cant the DataNode just batch server write (after receiving
> > >> server packet) and flush all at once?
> > >> Is there any particular reason for doing so?
> > >>
> > >> Can somebody clarify this for me?
> > >>
> > >> Thanks so much.
> > >> Thanh
> > >>
> > >>
> > >>
> > >>
> > >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message