hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Levin <magn...@gmail.com>
Subject Re: Question about dead datanode
Date Sun, 16 Feb 2014 04:01:27 GMT
Looks like I patched it in DFSClient.java, here is the patch:
https://gist.github.com/anonymous/9028934

So, this issue was this,

public class DFSInputStream is the class that is started as a thread,
and it used to maintain 'deadNodes' list of datanodes that had
problems, (in our case datanode lost power and was down).  Since each
thread that ran DFSInputStream class, had its own deadNodes instance
that was empty there were _tons_ of errors (over period of 4 days!).
My changes are simple.

I moved 'deadNodes' list outside as global field that is accessible by
all running threads, so at any point datanode does go down, each
thread is basically informed that the datanode _is_ down.

I did not want to mess with caching of locatedBlocks, so I basically
installed a dampening counter that keeps track of DFSClient trying to
access 'bad/dead' datanode, I arbitrarily chose to value to be '10'.
After 10 attempts the DFSClient resumes to try to contact datanode, by
which time, its hopefully is up.

In Summary, all threads are informed of bad datanodes, so there are no
attempts to try to contact it unless a counter <datanode, count> is
greater than 10.  The better solution would have been to invalidate
locatedBlocks cache also, but this seems like a huge improvement.

Here is the log of my testing in our live cluster:

at 19:34:42, I kill datanode, and its put on deadNodes list,
at 19:47:05, its back up, and counter is > 10, so its used again.


2014-02-15 19:34:42,036 WARN org.apache.hadoop.hdfs.DFSClient: Failed
to connect to /10.101.5.5:50010 for file
/hbase/img863/36b17cc018e4b8494ef700523628054a/att/7640828832753135438
for block -4025527892682081728: Will add to deadNodes:
java.net.ConnectException: Connection refused
2014-02-15 19:34:42,036 WARN org.apache.hadoop.hdfs.DFSClient: Adding
server to deadNodes, maybe? 10.101.5.5:50010
2014-02-15 19:34:42,036 WARN org.apache.hadoop.hdfs.DFSClient: Inside
addToDeadNodes Print All DeadNodes:: 10.101.5.5:50010
2014-02-15 19:34:42,036 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:35:49,881 WARN org.apache.hadoop.hdfs.DFSClient: Inside
addToDeadNodes Print All DeadNodes:: 10.101.5.5:50010
2014-02-15 19:36:32,547 WARN org.apache.hadoop.hdfs.DFSClient: Remove
Node from deadNodes:: 10.103.2.5:50010 at counter
{10.103.2.5:50010=10, 10.101.5.5:50010=1}
2014-02-15 19:39:23,662 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:39:23,878 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:39:23,944 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:39:23,962 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:39:23,979 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:45:15,667 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:45:15,708 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:45:15,718 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:45:15,933 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:47:05,686 WARN org.apache.hadoop.hdfs.DFSClient:
Considering Node:: 10.101.5.5:50010
2014-02-15 19:47:05,686 WARN org.apache.hadoop.hdfs.DFSClient: Remove
Node from deadNodes:: 10.101.5.5:50010 at counter {10.103.2.5:50010=0,
10.101.5.5:50010=10}
2014-02-15 19:47:05,686 WARN org.apache.hadoop.hdfs.DFSClient: Found
bestNode:: 10.101.5.5:50010
2014-02-15 19:47:05,686 INFO org.apache.hadoop.hdfs.DFSClient:
Datanode available for block: 10.101.5.5:50010


-Jack

On Fri, Feb 14, 2014 at 10:16 AM, Jack Levin <magnito@gmail.com> wrote:
> I found the code path that does not work, patched it. Will report if it
> fixes the problem
>
> On Feb 14, 2014 8:19 AM, "Jack Levin" <magnito@gmail.com> wrote:
>>
>> 0.20.2-cdh3u2 --
>>
>> "add to deadNodes and continue" would solve this issue.  For some reason
>> its not getting into this code path.
>>
>> If its a matter of adding a quick line of code to make this work, then we
>> would rather recompile with that and upgrade later when we have better
>> backup.
>>
>> -Jack
>>
>>
>> On Thu, Feb 13, 2014 at 10:55 PM, Stack <stack@duboce.net> wrote:
>>>
>>> On Thu, Feb 13, 2014 at 9:18 PM, Jack Levin <magnito@gmail.com> wrote:
>>>
>>> > One other question, we get this:
>>> >
>>> > 2014-02-13 02:46:12,768 WARN org.apache.hadoop.hdfs.DFSClient: Failed
>>> > to
>>> > connect to /10.101.5.5:50010 for file
>>> > /hbase/img32/b97657bfcbf922045d96315a4ada0782/att/4890606694307129591
>>> > for
>>> > block -9099107892773428976:java.net.SocketTimeoutException: 60000
>>> > millis
>>> > timeout while waiting for channel to be ready for connect. ch :
>>> > java.nio.channels.SocketChannel[connection-pending remote=/
>>> > 10.101.5.5:50010]
>>> >
>>> >
>>> > Why can't RS do this instead:
>>> >
>>> >
>>> > hbase-root-regionserver-mtab5.prod.imageshack.com.log.2014-02-10:2014-02-10
>>> > 22:05:11,763 INFO org.apache.hadoop.hdfs.DFSClient: Failed to connect
>>> > to /
>>> > 10.103.8.109:50010, add to deadNodes and continue
>>> >
>>> > "add to deadNodes and continue" specifically?
>>> >
>>>
>>>
>>> The regionserver runs on the HDFS API.  The implementations can vary.
>>> The
>>> management of nodes -- their coming and going -- is done inside the HDFS
>>> client code.  The regionserver is insulated from all that goes on
>>> therein.
>>>
>>> What version of HDFS are you on Jack?
>>>
>>> St.Ack
>>
>>
>

Mime
View raw message