hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1046) Datanode should periodically clean up /tmp from partially received (and not completed) block files
Date Fri, 02 Mar 2007 17:44:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477378

Owen O'Malley commented on HADOOP-1046:

This implies that there is also a failure to clean up blocks properly in the case of exceptions.
That should be addressed instead of a time out. I'm just worried that the timeout could happen
on a slow datanode and/or client and then we'll have more mysterious errors as one thread
deletes block files out from under another thread.

> Datanode should periodically clean up /tmp from partially received (and not completed)
block files
> --------------------------------------------------------------------------------------------------
>                 Key: HADOOP-1046
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1046
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.9.2, 0.12.0
>         Environment: Cluster of 10 machines, running Hadoop 0.9.2 + Nutch
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.12.0
>         Attachments: fsdataset.patch
> Cluster is set up with tasktrackers running on the same machines as datanodes. Tasks
create heavy load in terms of local CPU/RAM/diskIO. I noticed a lot of the following messages
from the datanodes in such situations:
> 2007-02-15 05:30:53,298 WARN  dfs.DataNode - Failed to transfer blk_-4590782726923911824
to xxx.xxx.xxx/
> java.net.SocketException: Connection reset 
> ....
> java.io.IOException: Block blk_71053993347675204 has already been started (though not
completed), and thus cannot be created. 
> My reading of the code in DataNode.DataXceiver.writeBlock() and FSDataset.writeToBlock()
+ FSDataset.java:459 suggests the following scenario: there is no cleanup of temporary files
in /tmp that are used to store the incomplete blocks being transferred. If the datanode is
CPU-starved and drops the connection while creating this temp file, the source datanode will
attempt to transfer it again - but there is already a file under this name in /tmp, because
when the connection was dropped the target datanode didn't bother to cleanup.
> I also see that this section is unchanged in trunk/.
> The solution to this would be to check the age of the physical file in the /tmp dir,
in FSDataset.java:436 - if it's older than a few hours or so, we should delete it and proceed
as if there were no ongoing create op for this block.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message