hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brian Bockelman (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4775) FUSE crashes reliably on 0.19.0
Date Fri, 05 Dec 2008 23:25:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653976#action_12653976
] 

Brian Bockelman commented on HADOOP-4775:
-----------------------------------------

Hey Pete,

I'll have our sysadmins try out the 4616 and 4635 patches

There were no messages in syslog, meaning it probably didn't segfault (is this correct?)

Here's what the failure looks like:
http://jobrobot.web.cern.ch/JobRobot/errors_081205.html#T2_US_Nebraska
http://jobrobot.web.cern.ch/JobRobot/errors_081204.html#T2_US_Nebraska

I've got a hard time believing that a memory leak alone could disconnect the FUSE endpoint...
1/3 of the workers are 4GB, 1/3 are 8GB, 1/3 are 16GB.  It would take quite a bit of effort
to get a memory leak to cause the problems on the 16GB nodes.  Plus, I didn't see OOM killing
anything in dmesg.

I set up a debug FUSE instance on a node and hit it with a similar workflow.  No problems
at all; it may be that, in debug mode, FUSE doesn't allow multiple threads?

My suspicion is that either FUSE-DFS or libhdfs has a problem with error recovery which causes
an infinite loop (like we've seen in other places).  The interesting thing for the "ps" output
I showed above is that the fuse_dfs process was using 30% CPU *when nothing was using FUSE*
and the node wasn't swapping.

Nagios now restarts FUSE-DFS whenever the problem occurs, so I don't get much of a chance
to debug.  Still, about 7% of our jobs die because FUSE conks out mid-job.

> FUSE crashes reliably on 0.19.0
> -------------------------------
>
>                 Key: HADOOP-4775
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4775
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/fuse-dfs
>            Reporter: Brian Bockelman
>            Priority: Critical
>
> Every morning I come in and find many nodes which have developed the dreaded "Transport
endpoint not connected" error overnight.  This has only started after the 0.19.0 upgrade.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message