hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1707) DFS client can allow user to write data to the next block while uploading previous block to HDFS
Date Wed, 10 Oct 2007 20:09:51 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533859

Doug Cutting commented on HADOOP-1707:

> The client will stream data to the datanodes directly [ ... ]

Some history to be aware of.  Long ago writes were tee'd to datanodes directly, and the local
file was only used to replay things.  Switching it so that writes were always buffered to
a local file had two advantages: it radically simplified the code (the tee multiplied the
number of failure modes) and it improved performance & reliability.  Each datanode had
far fewer active connections, since blocks were written in a burst rather than as a trickle.

How will you handle datanode failures?  Since you have no local file to replay, won't those
always cause an exception in the client?  That will cause tasks to fail, which might be acceptable,
now that things are overall more reliable, but, at the time I looked at this (again, long
ago) datanode timeouts were frequent enough that this would cause job failure.

> DFS client can allow user to write data to the next block while uploading previous block
> ------------------------------------------------------------------------------------------------
>                 Key: HADOOP-1707
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1707
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
> The DFS client currently uses a staging file on local disk to cache all user-writes to
a file. When the staging file accumulates 1 block worth of data, its contents are flushed
to a HDFS datanode. These operations occur sequentially.
> A simple optimization of allowing the user to write to another staging file while simultaneously
uploading the contents of the first staging file to HDFS will improve file-upload performance.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message