hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From schubert zhang <zson...@gmail.com>
Subject Re: RegionServer failure and recovery take a long time
Date Sat, 21 Mar 2009 17:34:31 GMT
Jean Daniel,
Thanks for you kindness.
Yes, I want more machines, and we will get them soon. :-)
My application is write-heavy very much. Since my cluster is really small, I
will slow down the inserts now.

One more questions about you patch HBASE-1008: It is really helpful for me.
Does this patch take more memory? It seems not based on 0.19.1. Can it be
applied on 0.19.1?

Schubert

On Sun, Mar 22, 2009 at 12:47 AM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> Schubert,
>
> I have no problem at all with your english since my first language is
> french and I must doing loads of grammatical errors too ;)
>
> Regards the heap, make sure that 300MB fits your need in mem or you might
> OOME.
>
> Increasing the lease period is a good idea, I have done the same. Our
> jobs take 13 hours so it avoids many restarts.
>
> Swappinnes at 0 => no swap at all... so if your system needs to swap
> you might be in trouble. The advantage I see in a very low swappiness
> value (but not 0) is that it will only swap if ultimately necessary.
>
> On a final note, using the blocking caching feature is a bit of risk
> in versions < 0.20. It does make random reads a lot faster (most of
> the time) but the eviction of blocks produces a lot of garbage. The
> guys from Stream.com are implementing something better at this very
> moment.
>
> You may also want more machines :P. 6 is a very small number, we
> usually see a lot more stability passed 10. Or instead you might want
> to slow down the inserts... It's good to be realist regards what
> stress you put on the cluster VS the actual resources.
>
> J-D
>
> On Sat, Mar 21, 2009 at 12:24 PM, schubert zhang <zsongbo@gmail.com>
> wrote:
> > Hi Jean Daniel,
> > Your help is so great. Thank you very much.
> >
> > After reading of the HBase Troubleshooting:
> > http://wiki.apache.org/hadoop/Hbase/Troubleshooting, I also doubt
> > about garbage collector and have added the -XX:+UseConcMarkSweepGC option
> 4
> > hours ago. I checked the regionserves just now, one was shutdown as the
> same
> > cause. But its better than before.
> >
> > Now, I will do following turning according your guide:
> > (1)  and -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode, and study the
> GC
> > detailedly.
> > (2)  decrease the heap size of mapreduce child, now I am using 1024MB. I
> > want change to 300MB.
> > (3)  increase the lease period of master to 180 sec.
> > (4)  apply the you great patch.
> >
> > By the way, to avoid swap, I had changed the vm.swappiness = 0 now (you
> had
> > tell me 20 in another email), do think it is ok?
> >
> > Thank you again. My english is not good, please bear with me.
> >
> > Schubert
> >
> > On Sat, Mar 21, 2009 at 11:39 PM, Jean-Daniel Cryans <
> jdcryans@apache.org>wrote:
> >
> >> Schubert,
> >>
> >> Yeah that's the good old problem with the garbage collector. In your
> >> logs I see a lot of :
> >>
> >> 2009-03-21 05:59:06,498 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >> slept 144233ms, ten times longer than scheduled: 3000
> >> 2009-03-21 05:59:06,600 WARN
> >> org.apache.hadoop.hbase.regionserver.HRegionServer: unable to report
> >> to master for 144335 milliseconds - retrying
> >> 2009-03-21 05:59:06,512 WARN org.apache.hadoop.hbase.util.Sleeper: We
> >> slept 143279ms, ten times longer than scheduled: 10000
> >> 2009-03-21 05:59:06,701 INFO
> >> org.apache.hadoop.hbase.regionserver.HRegionServer:
> >> MSG_CALL_SERVER_STARTUP: safeMode=false
> >>
> >> That usually means that the garbage collector blocked all threads to
> >> do it's stuff. But, when it happens, it takes more time than the lease
> >> the master maintains on the region servers (120 sec) so the master
> >> considers this region server as dead. Then the log splitting takes
> >> over on the master which is a very very long process. During that
> >> time, sometimes more than 10 minutes, the regions from that region
> >> server are unavailable. If the cluster is small, that make things even
> >> far worse.
> >>
> >> We had these kinds of error on our cluster during the last weeks and
> >> here is how I solved it:
> >>
> >> - Regards the log splitting, I suggest you take a look at this issue
> >> https://issues.apache.org/jira/browse/HBASE-1008 as it has a patch I
> >> made to speed up the process. See if it helps you.
> >>
> >> - Regards the garbage collector, I found that the options
> >> "-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode" were really really
> >> helpful. See
> >> http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html
> >> for more information. Set this in the hbase-env.sh file on the line
> >> export HBASE_OPTS=...
> >>
> >> - Finally, to make sure that the garbage collection is fast, check if
> >> there is swap. If so, set lower heaps for the MR child processes in
> >> hadoop-site.xml (the mapred.child.java.opts property).
> >>
> >> J-D
> >>
> >> On Sat, Mar 21, 2009 at 2:31 AM, schubert zhang <zsongbo@gmail.com>
> wrote:
> >> > Hi Jean Daniel,
> >> >
> >> > I want your help for this issue. I attach the log files, please help
> >> analyse
> >> > it. Thanks.
> >> >
> >> > Test env.
> >> >  5+1 nodes cluster.  table: create 'TESTA', {NAME => 'info', VERSIONS
> =>
> >> 1,
> >> > COMPRESSION => 'BLOCK', BLOCKCACHE => true}
> >> >
> >> > All test data is generated randomly by a program.
> >> >
> >> > HRegionServer Failure (2009-03-21 08:27:41,090):
> >> >  After about 8 hours running, my node-5 rangeserver failure and the
> >> > HRegionServer Shutdown  .
> >> >  It seems caused by DFSClient exceptions. (I cannot make clear what
> >> > happened on HDFS, but it seems the HDFS is ok.)
> >> >
> >> > Then I start HRegionServer at this node-5. (2009-03-21 10:53:42,747):
> >> >  After the HRegionServer started, regions were reassign. I can see the
> >> > reassign on WebGUI of HBase, since some regions are now on this node.
> >> >  But following things are blocked for a long time:
> >> >  (1) The HBase client application cannot insert data for a long time
> >> (until
> >> > 2009/03/21 11:11:27, its about 18 minutes).  It is
> >> RetriesExhaustedException
> >> > exception on application side (MapReduce Job).
> >> >  (2) Some regions cannot be accessed (I cannot scan/get rows in these
> >> > regions.) The exception is NotServingRegionException when getRegion.
> >> >  (3) I check the history of the region of (2) from the WebGUI. I can
> see
> >> in
> >> > the history, it is assigned at 11:04:15. It is so later.
> >> > The history is:
> >> > at, 21 Mar 2009 11:10:39openRegion opened on server : nd1-rack0-cloud
> >> Sat,
> >> > 21 Mar 2009 11:04:15assignmentRegion assigned to server
> >> > 10.24.1.12:60020Sat, 21 Mar 2009 06:48:03openRegion opened on server
> :
> >> > nd1-rack0-cloud Sat,
> >> > 21 Mar 2009 06:47:57assignmentRegion assigned to server
> >> > 10.24.1.12:60020Sat, 21 Mar 2009 06:27:25openRegion opened on server
> :
> >> > nd5-rack0-cloud Sat,
> >> > 21 Mar 2009 06:27:21assignmentRegion assigned to server
> >> > 10.24.1.20:60020Sat, 21 Mar 2009 06:26:13openRegion opened on server
> :
> >> > nd5-rack0-cloud Sat,
> >> > 21 Mar 2009 06:24:53assignmentRegion assigned to server
> >> > 10.24.1.20:60020Sat, 21 Mar 2009 06:24:28openRegion opened on server
> :
> >> > nd3-rack0-cloud Sat,
> >> > 21 Mar 2009 06:24:13assignmentRegion assigned to server
> >> > 10.24.1.16:60020Sat, 21 Mar 2009 06:19:08openRegion opened on server
> :
> >> > nd4-rack0-cloud Sat,
> >> > 21 Mar 2009 06:19:02assignmentRegion assigned to server
> >> > 10.24.1.18:60020Sat, 21 Mar 2009 05:59:39openRegion opened on server
> :
> >> > nd5-rack0-cloud Sat,
> >> > 21 Mar 2009 05:59:36assignmentRegion assigned to server
> >> > 10.24.1.20:60020Sat, 21 Mar 2009 03:50:15openRegion opened on server
> :
> >> > nd3-rack0-cloud Sat,
> >> > 21 Mar 2009 03:50:12assignmentRegion assigned to server
> >> > 10.24.1.16:60020Sat, 21 Mar 2009 03:50:08splitRegion split from:
> >> > TESTA,13576334163@2009-03-21
> >> > 00:35:57.526,1237569164012<
> >>
> http://nd0-rack0-cloud:60010/regionhistorian.jsp?regionname=CDR,13576334163@2009-03-21%2000:35:57.526,1237569164012
> >> >
> >> >
> >> >
> >> > And following is exception when I scan a rowkey range.
> >> >
> >> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
> >> contact
> >> > region server 10.24.1.12:60020 for region
> >> > TESTA,13576334163@2009-03-2100:35:57.526,1237578615553, row
> >> > '13576334163@2009-03-2100:35:57.526', but failed after 5 attempts.
> >> > Exceptions:
> >> > org.apache.hadoop.hbase.NotServingRegionException:
> >> > org.apache.hadoop.hbase.NotServingRegionException:
> >> > TESTA,13576334163@2009-03-21 00:35:57.526,1237578615553
> >> >        at
> >> > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(Unknown
> >> Source)
> >> >        at
> >> > org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(Unknown
> >> > Source)
> >> >        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> >> >        at
> >> >
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >> >        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(Unknown
> >> Source)
> >> >        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(Unknown
> >> > Source)
> >> >
> >> > org.apache.hadoop.hbase.NotServingRegionException:
> >> > org.apache.hadoop.hbase.NotServingRegionException:
> >> > TESTA,13576334163@2009-03-21 00:35:57.526,1237578615553
> >> >        at
> >> > org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(Unknown
> >> Source)
> >> >        at
> >> > org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(Unknown
> >> > Source)
> >> >        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> >> >        at
> >> >
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >> >        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(Unknown
> >> Source)
> >> >        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(Unknown
> >> > Source)
> >> >
> >> > I will send the log files to you email address.
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message