hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Veeramachaneni, Ravi" <ravi.veeramachan...@navteq.com>
Subject RE: regionserver crash under heavy load
Date Wed, 14 Jul 2010 02:04:35 GMT
Just curious, how big is the load we are talking about? 100s or 1000s inserts/second? We are
planning on moving to CDH3 with HBase soon.

-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
Sent: Tuesday, July 13, 2010 6:24 PM
To: user@hbase.apache.org
Subject: Re: regionserver crash under heavy load

Your region server doesn't look much loaded from the metrics POV. But
I specifically asked for the lines around that, not just the dump,
since it will contain the reason for the shutdown.

> I do notice that the disk usage is pretty high.  I am just thinking that our
> problem probably is a hardware limit. but the server should not crash when
> the hardware limit is reached.

We still don't know why it crashed and it may not even be related to
HW limits, we need those bigger log traces. Also use pastebin.com or
anything like that.

>
> do you have any idea when CDH3 official release will be out ?

I don't work for cloudera, but IIRC the next beta for CDH3 is due for September.

>
> Jimmy
>
> --------------------------------------------------
> From: "Jean-Daniel Cryans" <jdcryans@apache.org>
> Sent: Tuesday, July 13, 2010 2:55 PM
> To: <user@hbase.apache.org>
> Subject: Re: regionserver crash under heavy load
>
>> Please use a pasting service for the log traces. I personally use
>> pastebin.com
>>
>> You probably had a GC that lasted too long, this is something out of
>> the control of the application (apart from trying to put as less data
>> in memory as possible, but you are inserting so...). Your log doesn't
>> contain enough information for us to tell, please look for a "Dump of
>> metrics" line and paste the lines around that.
>>
>> J-D
>>
>> On Tue, Jul 13, 2010 at 2:49 PM, Jinsong Hu <jinsong_hu@hotmail.com>
>> wrote:
>>>
>>> Hi, Todd:
>>>  I downloaded hadoop-0.20.2+320 and hbase-0.89.20100621+17 from CDH3 and
>>> inserted data with full load, after a while the hbase regionserver
>>> crashed.
>>> I checked  system with "iostat -x 5" and notice the disk is pretty busy.
>>> Then I modified my client code and reduced the insertion rate by 6 times,
>>> and the test runs fine.  Is there any way that regionserver be modified
>>> so
>>> that at least it doesn't crash under heavy load ?  I used apache hbase
>>> 0.20.5 distribution and the same problem happens. I am thinking that when
>>> the regionserver is too busy, it should throttle incoming data rate to
>>> protect the server.  Could this be done ?
>>>  Do you also know when the CDH3 official release will come out ? the one
>>> I
>>> downloaded is beta version.
>>>
>>> Jimmy
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2010-07-13 02:24:34,389 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> Close
>>> d Spam_MsgEventTable,56-2010-05-19
>>> 10:09:02\x099a420f4f31748828fd24aeea1d06b294,
>>> 1278973678315.01dd22f517dabf53ddd135709b68ba6c.
>>> 2010-07-13 02:24:34,389 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>>> aborting server at: m0002029.ppops.net,60020,1278969481450
>>> 2010-07-13 02:24:34,389 DEBUG
>>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper
>>> : Closed connection with ZooKeeper; /hbase/root-region-server
>>> 2010-07-13 02:24:34,389 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>>> regionserver60020 exiting
>>> 2010-07-13 02:24:34,608 INFO
>>> org.apache.hadoop.hbase.regionserver.ShutdownHook:
>>> Shutdown hook starting; hbase.shutdown.hook=true;
>>> fsShutdownHook=Thread[Thread-1
>>> 0,5,main]
>>> 2010-07-13 02:24:34,608 INFO
>>> org.apache.hadoop.hbase.regionserver.ShutdownHook:
>>> Starting fs shutdown hook thread.
>>> 2010-07-13 02:24:34,608 ERROR org.apache.hadoop.hdfs.DFSClient: Exception
>>> closin
>>> g file
>>> /hbase/.logs/m0002029.ppops.net,60020,1278969481450/10.110.24.79%3A60020.
>>> 1278987220794 : java.io.IOException: IOException
>>> flush:java.io.IOException:
>>> IOEx
>>> ception flush:java.io.IOException: IOException flush:java.io.IOException:
>>> IOExce
>>> ption flush:java.io.IOException: IOException flush:java.io.IOException:
>>> IOExcept
>>> ion flush:java.io.IOException: IOException flush:java.io.IOException:
>>> IOExceptio
>>> n flush:java.io.IOException: IOException flush:java.io.IOException: Error
>>> Recove
>>> ry for block blk_-1605696159279298313_2395924 failed  because recovery
>>> from
>>> prim
>>> ary datanode 10.110.24.80:50010 failed 6 times.  Pipeline was
>>> 10.110.24.80:50010
>>> . Aborting...
>>> java.io.IOException: IOException flush:java.io.IOException: IOException
>>> flush:ja
>>> va.io.IOException: IOException flush:java.io.IOException: IOException
>>> flush:java
>>> .io.IOException: IOException flush:java.io.IOException: IOException
>>> flush:java.i
>>> o.IOException: IOException flush:java.io.IOException: IOException
>>> flush:java.io.
>>> IOException: IOException flush:java.io.IOException: Error Recovery for
>>> block
>>> blk
>>> _-1605696159279298313_2395924 failed  because recovery from primary
>>> datanode
>>> 10.
>>> 110.24.80:50010 failed 6 times.  Pipeline was 10.110.24.80:50010.
>>> Aborting...
>>>      at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:
>>> 3214)
>>>      at
>>> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:
>>> 97)
>>>      at
>>> org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944
>>> )
>>>      at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown Source)
>>>      at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
>>> sorImpl.java:25)
>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>      at
>>> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(S
>>> equenceFileLogWriter.java:124)
>>>      at
>>> org.apache.hadoop.hbase.regionserver.wal.HLog.hflush(HLog.java:826)
>>>      at
>>> org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1004)
>>>      at
>>> org.apache.hadoop.hbase.regionserver.wal.HLog.append(HLog.java:817)
>>>      at
>>> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchPut(HRegion.j
>>> ava:1531)
>>>      at
>>> org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1447)
>>>      at
>>> org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.
>>> java:1703)
>>>      at
>>> org.apache.hadoop.hbase.regionserver.HRegionServer.multiPut(HRegionSe
>>> rver.java:2361)
>>>      at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
>>>      at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces
>>> sorImpl.java:25)
>>>      at java.lang.reflect.Method.invoke(Method.java:597)
>>>      at
>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:576)
>>>      at
>>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:
>>> 919)
>>> 2010-07-13 02:24:34,610 ERROR org.apache.hadoop.hdfs.DFSClient: Exception
>>> closin
>>> g file
>>> /hbase/Spam_MsgEventTable/079c7de876422e57e5f09fef5d997e06/.tmp/677365813
>>> 4549268273 : java.io.IOException: All datanodes 10.110.24.80:50010 are
>>> bad.
>>> Abor
>>> ting...
>>> java.io.IOException: All datanodes 10.110.24.80:50010 are bad.
>>> Aborting...
>>>      at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError
>>> (DFSClient.java:2603)
>>>      at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClien
>>> t.java:2139)
>>>      at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFS
>>> Client.java:2306)
>>> 2010-07-13 02:24:34,729 INFO
>>> org.apache.hadoop.hbase.regionserver.ShutdownHook:
>>> Shutdown hook finished.
>>>
>>
>


The information contained in this communication may be CONFIDENTIAL and is intended only for
the use of the recipient(s) named above.  If you are not the intended recipient, you are hereby
notified that any dissemination, distribution, or copying of this communication, or any of
its contents, is strictly prohibited.  If you have received this communication in error, please
notify the sender and delete/destroy the original message and any copy of it from your computer
or paper files.
Mime
View raw message