hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sameer Paranjpye (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-66) dfs client writes all data for a chunk to /tmp
Date Tue, 07 Mar 2006 20:14:44 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-66?page=comments#action_12369299 ] 

Sameer Paranjpye commented on HADOOP-66:

It doesn't make a lot of sense to buffer the entire block in RAM. On the other hand, an application
ought to be able to control the buffering strategy to some extent. Most stream implementations
have a setBufferSize() or equivalent method that allow programmers to do this. The default
buffer size is reasonably small so that many files can be opened without worrying too much
about buffering.

Besides the issues with filling up /tmp (32MB is a pretty large chunk to be writing there),
it's unclear that the scheme adds a lot of value, it may even be detrimental. If a connection
to a Datanode fails then why not try to recover by re-connecting and throw an exception if
that fails. If it's just the connection that has failed the client should be able to reconnect
pretty easily. If the Datanode is down for the count the odds are low (or are they?) that
it'll come back by the time the client finishes writing the block and the write will fail
anyway, so why write to a temp file. If it's common for Datanodes to bounce and come back
then Datanode stability is an problem that we should be working on. In that case, the temp
file is only a workaround and not a real solution, it might even be masking the problem in
many cases.

> dfs client writes all data for a chunk to /tmp
> ----------------------------------------------
>          Key: HADOOP-66
>          URL: http://issues.apache.org/jira/browse/HADOOP-66
>      Project: Hadoop
>         Type: Bug
>   Components: dfs
>     Versions: 0.1
>     Reporter: Sameer Paranjpye
>      Fix For: 0.1

> The dfs client writes all the data for the current chunk to a file in /tmp, when the
chunk is complete it is shipped out to the Datanodes. This can cause /tmp to fill up fast
when a lot of files are being written. A potentially better scheme is to buffer the written
data in RAM (application code can set the buffer size) and flush it to the Datanodes when
the buffer fills up.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
For more information on JIRA, see:

View raw message