hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1924) [hbase] TestDFSAbort failed in nightly #242
Date Thu, 04 Oct 2007 23:52:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532549
] 

stack commented on HADOOP-1924:
-------------------------------

I posted below to list. 

On Hudson, we've been seeing tests sporadically hang on an ipc Client flush of params.  I'm
writing the list for suggestions or opinions on what folks think might be happening or ideas
on what to try next.  See below for the latest example for a thread dump from a recent patch
build.

The usual scenario is that we are trying to simulate failed servers in a mini-cluster.  All
servers -- hbase + dfs servers -- are up and running inside the same JVM.  The remote ipc
Server will of-a-sudden have its stop method run to simulate a server crash.  The Client,
unawares, tries to go about its usual business.

   [junit] "HMaster.metaScanner" daemon prio=10 tid=0x091ecde0 nid=0x4a runnable [0xe2af9000..0xe2af9b38]
   [junit]     at java.net.SocketOutputStream.socketWrite0(Native Method)
   [junit]     at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
   [junit]     at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
   [junit]     at org.apache.hadoop.ipc.Client$Connection$2.write(Client.java:190)
   [junit]     at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
   [junit]     at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
   [junit]     - locked <0xf7bb40e0> (a java.io.BufferedOutputStream)
   [junit]     at java.io.DataOutputStream.flush(DataOutputStream.java:106)
   [junit]     at org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:325)
   [junit]     - locked <0xf7bb3f68> (a java.io.DataOutputStream)
   [junit]     at org.apache.hadoop.ipc.Client.call(Client.java:462)
   [junit]     - locked <0xf7bb3fa8> (a org.apache.hadoop.ipc.Client$Call)
   [junit]     at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:165)
   [junit]     at $Proxy8.openScanner(Unknown Source)
   [junit]     at org.apache.hadoop.hbase.HMaster$BaseScanner.scanRegion(HMaster.java:207)
   [junit]     at org.apache.hadoop.hbase.HMaster$MetaScanner.scanOneMetaRegion(HMaster.java:643)
   [junit]     - locked <0xf7b6b460> (a java.lang.Integer)
   [junit]     at org.apache.hadoop.hbase.HMaster$MetaScanner.maintenanceScan(HMaster.java:694)
   [junit]     at org.apache.hadoop.hbase.HMaster$BaseScanner.chore(HMaster.java:188)
   [junit]     at org.apache.hadoop.hbase.Chore.run(Chore.java:59)


Other threads in the thread dump will be parked at the  DataOutputStream synchronize block.

Please correct me if I am wrong, but it is my understanding that writes do not timeout nor
is this type of I/O interruptable.  The connection is probably already established else it
would have timed out trying to connect to the non-existent server and besides, the ipc Client
pattern seems to be keeps up the connection multiplexing 'commands' to the remote server...

I'm wondering why don't we get an exception on client side when the remote side of the socket
goes away?

Am unable to reproduce locally.

Thanks for any input,
St.Ack 


Also, patch yesterday added thread dumping every 30 seconds during wait on terminate condition.
 A patch build from last night had the same hang: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/887/testReport/

TestDFSAbort has just been removed.

> [hbase] TestDFSAbort failed in nightly #242
> -------------------------------------------
>
>                 Key: HADOOP-1924
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1924
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: stack
>            Assignee: stack
>            Priority: Minor
>         Attachments: testdfsabort.patch, testdfsabort_patchbuild798.txt
>
>
> TestDFSAbort and TestBloomFilters failed in last nights nightly build (#242).  This issue
is about trying to figure whats up w/ TDFSA.
> Studying console logs, HRegionServer stopped logging any activity and HMaster for its
part did not expire the HRegionServer lease.  On top of it all, continued tests of the state
of HDFS -- the test is meant to sure Hbase shutdown when HDFS is pulled from under it -- seems
to have continued reporting itself healthy though it'd be closed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message