hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba borthakur (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1707) Remove the DFS Client disk-based cache
Date Fri, 12 Oct 2007 22:23:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12534427

dhruba borthakur commented on HADOOP-1707:

I have the following proposal in mind:

1. The Client uses a small pool of memory buffers per dfs-output stream. Say, 10 buffers of
size 64K each.
2. A write to the output stream actually copies the user data into one of the buffers, if
available. Otherwise the user-write blocks.
3. A separate thread (one per output stream), sends buffers that are full. Each buffer has
metadata that contains a sequence number (locally generated on the client) , the length of
the buffer and its offset in this block. 
4. Another thread(one per output stream) process incoming responses. The incoming response
has the sequence number of the buffer that the datanode had processed. The client removes
that buffer from its queue.
5. The client gets an exception if the primary datanode fails. If a secondary datanode fails,
the primary informs the client about this event.
6. In any datanodes fail, the client removes it from the pipeline and resends all pending
buffers to all known good datanodes.
7. A target datanode remembers the last sequencenumber that it has previously processed. It
forwards the buffer to the next datanode in the pipeline. If the datanode receives a buffer
that it has not processed earlier, it writes it to local disk. When the response arrives,
it forwards the response back to the client.

> Remove the DFS Client disk-based cache
> --------------------------------------
>                 Key: HADOOP-1707
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1707
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>             Fix For: 0.16.0
> The DFS client currently uses a staging file on local disk to cache all user-writes to
a file. When the staging file accumulates 1 block worth of data, its contents are flushed
to a HDFS datanode. These operations occur sequentially.
> A simple optimization of allowing the user to write to another staging file while simultaneously
uploading the contents of the first staging file to HDFS will improve file-upload performance.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message