hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Konstantin Shvachko (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3633) Uncaught exception in DataXceiveServer
Date Thu, 26 Jun 2008 02:40:45 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12608254#action_12608254

Konstantin Shvachko commented on HADOOP-3633:

True. It looks like that one data-node received ~2000 blocks in one second.
This is a destination node during block replication, so many data-nodes were sending blocks
to this destination.
Don't know why, but it happened. May be there is a flaw in the random number generator in
ChooseTargets() or it could be that most of 
other nodes on the cluster are pretty much full.
This occurred between two heartbeats, when the name-node has not yet received the information
that this particular data-node is too busy.

I propose to introduce a parameter in the Datanode that would limit the number of concurrent
BlockReceives the data-node can handle.
This means that if D1 sends a block to D2 and D2 is already receiving enough blocks then D2
sends back to D1 a BusyException, and the 
transfer fails. The name-node will later reschedule the block to to be replicated another
Which happens now anyway because D2 is too slow and D1 gets SocketTimeoutException (after
8 minutes).

> Uncaught exception in DataXceiveServer
> --------------------------------------
>                 Key: HADOOP-3633
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3633
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.17.0
>         Environment: 17.0 + H1979-H2159-H3442
>            Reporter: Koji Noguchi
>         Attachments: jstack-H3633.txt
> Observed dfsclients timing out to some datanodes.
> Datanode's  '.out' file had 
> {noformat}
> Exception in thread "org.apache.hadoop.dfs.DataNode$DataXceiveServer@82d37" java.lang.OutOfMemoryError:
unable to create new native thread
>   at java.lang.Thread.start0(Native Method)
>   at java.lang.Thread.start(Thread.java:597)
>   at org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:906)
>   at java.lang.Thread.run(Thread.java:619)
> {noformat}
> Datanode was still running but not much activity besides verification.
> Jstack showed no DataXceiveServer running.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message