hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1046) Datanode should periodically clean up /tmp from partially received (and not completed) block files
Date Fri, 02 Mar 2007 18:00:53 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12477390
] 

Doug Cutting commented on HADOOP-1046:
--------------------------------------

> Owen: I'm just worried that the timeout could happen on a slow datanode and/or client
and then we'll have more mysterious errors [...]

Then perhaps we should log a warning whenever this happens?

> Andrzej: the timeout value here is 1 hour - this is way way longer than the ipc.timeout

Perhaps this should then instead be expressed as a function of that or some other parameter
rather than hard-coded?

> Datanode should periodically clean up /tmp from partially received (and not completed)
block files
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-1046
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1046
>             Project: Hadoop
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.9.2, 0.12.0
>         Environment: Cluster of 10 machines, running Hadoop 0.9.2 + Nutch
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.12.0
>
>         Attachments: fsdataset.patch
>
>
> Cluster is set up with tasktrackers running on the same machines as datanodes. Tasks
create heavy load in terms of local CPU/RAM/diskIO. I noticed a lot of the following messages
from the datanodes in such situations:
> 2007-02-15 05:30:53,298 WARN  dfs.DataNode - Failed to transfer blk_-4590782726923911824
to xxx.xxx.xxx/10.10.16.109:50010
> java.net.SocketException: Connection reset 
> ....
> java.io.IOException: Block blk_71053993347675204 has already been started (though not
completed), and thus cannot be created. 
> My reading of the code in DataNode.DataXceiver.writeBlock() and FSDataset.writeToBlock()
+ FSDataset.java:459 suggests the following scenario: there is no cleanup of temporary files
in /tmp that are used to store the incomplete blocks being transferred. If the datanode is
CPU-starved and drops the connection while creating this temp file, the source datanode will
attempt to transfer it again - but there is already a file under this name in /tmp, because
when the connection was dropped the target datanode didn't bother to cleanup.
> I also see that this section is unchanged in trunk/.
> The solution to this would be to check the age of the physical file in the /tmp dir,
in FSDataset.java:436 - if it's older than a few hours or so, we should delete it and proceed
as if there were no ongoing create op for this block.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message