hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From schubert zhang <zson...@gmail.com>
Subject Re: RegionServer failure and recovery take a long time
Date Sat, 21 Mar 2009 19:01:36 GMT
Yes, I missed " ". Thank you.

On Sun, Mar 22, 2009 at 2:17 AM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> Put the options between " "
>
> J-D
>
> On Sat, Mar 21, 2009 at 2:15 PM, schubert zhang <zsongbo@gmail.com> wrote:
> > It's strange when I add -XX:+UseConcMarkSweepGC
> > -XX:+CMSIncrementalMode./hbase/bin/../conf/hbase-env.sh:
> > line 37: export: `-XX:+CMSIncrementalMode': not a valid identifer
> >
> > My jdk version is jdk-6u6-linux-x64, I will try the
> > latest jdk-6u12-linux-x64 now.
> >
> > Schubert
> >
> > On Sun, Mar 22, 2009 at 1:40 AM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> Schubert,
> >>
> >> It's based on the 0.19 branch in svn but it should patch with no
> >> problem. In this state the patch is still just a test I'm doing but,
> >> unless you write to thousands of regions at the same time when the
> >> region server fails, there should be no problem. If it does cause you
> >> trouble, please leave a comment in the jira. As you can see, it was
> >> able to process a huge amount of logs without any problem. Also this
> >> process is only done on the master which never receives any load so
> >> it's even safer.
> >>
> >> J-D
> >>
> >> On Sat, Mar 21, 2009 at 1:34 PM, schubert zhang <zsongbo@gmail.com>
> wrote:
> >> > Jean Daniel,
> >> > Thanks for you kindness.
> >> > Yes, I want more machines, and we will get them soon. :-)
> >> > My application is write-heavy very much. Since my cluster is really
> >> small, I
> >> > will slow down the inserts now.
> >> >
> >> > One more questions about you patch HBASE-1008: It is really helpful
> for
> >> me.
> >> > Does this patch take more memory? It seems not based on 0.19.1. Can it
> be
> >> > applied on 0.19.1?
> >> >
> >> > Schubert
> >> >
> >> > On Sun, Mar 22, 2009 at 12:47 AM, Jean-Daniel Cryans <
> >> jdcryans@apache.org>wrote:
> >> >
> >> >> Schubert,
> >> >>
> >> >> I have no problem at all with your english since my first language
is
> >> >> french and I must doing loads of grammatical errors too ;)
> >> >>
> >> >> Regards the heap, make sure that 300MB fits your need in mem or you
> >> might
> >> >> OOME.
> >> >>
> >> >> Increasing the lease period is a good idea, I have done the same. Our
> >> >> jobs take 13 hours so it avoids many restarts.
> >> >>
> >> >> Swappinnes at 0 => no swap at all... so if your system needs to
swap
> >> >> you might be in trouble. The advantage I see in a very low swappiness
> >> >> value (but not 0) is that it will only swap if ultimately necessary.
> >> >>
> >> >> On a final note, using the blocking caching feature is a bit of risk
> >> >> in versions < 0.20. It does make random reads a lot faster (most
of
> >> >> the time) but the eviction of blocks produces a lot of garbage. The
> >> >> guys from Stream.com are implementing something better at this very
> >> >> moment.
> >> >>
> >> >> You may also want more machines :P. 6 is a very small number, we
> >> >> usually see a lot more stability passed 10. Or instead you might want
> >> >> to slow down the inserts... It's good to be realist regards what
> >> >> stress you put on the cluster VS the actual resources.
> >> >>
> >> >> J-D
> >> >>
> >> >> On Sat, Mar 21, 2009 at 12:24 PM, schubert zhang <zsongbo@gmail.com>
> >> >> wrote:
> >> >> > Hi Jean Daniel,
> >> >> > Your help is so great. Thank you very much.
> >> >> >
> >> >> > After reading of the HBase Troubleshooting:
> >> >> > http://wiki.apache.org/hadoop/Hbase/Troubleshooting, I also doubt
> >> >> > about garbage collector and have added the -XX:+UseConcMarkSweepGC
> >> option
> >> >> 4
> >> >> > hours ago. I checked the regionserves just now, one was shutdown
as
> >> the
> >> >> same
> >> >> > cause. But its better than before.
> >> >> >
> >> >> > Now, I will do following turning according your guide:
> >> >> > (1)  and -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode, and
study
> >> the
> >> >> GC
> >> >> > detailedly.
> >> >> > (2)  decrease the heap size of mapreduce child, now I am using
> 1024MB.
> >> I
> >> >> > want change to 300MB.
> >> >> > (3)  increase the lease period of master to 180 sec.
> >> >> > (4)  apply the you great patch.
> >> >> >
> >> >> > By the way, to avoid swap, I had changed the vm.swappiness = 0
now
> >> (you
> >> >> had
> >> >> > tell me 20 in another email), do think it is ok?
> >> >> >
> >> >> > Thank you again. My english is not good, please bear with me.
> >> >> >
> >> >> > Schubert
> >> >> >
> >> >> > On Sat, Mar 21, 2009 at 11:39 PM, Jean-Daniel Cryans <
> >> >> jdcryans@apache.org>wrote:
> >> >> >
> >> >> >> Schubert,
> >> >> >>
> >> >> >> Yeah that's the good old problem with the garbage collector.
In
> your
> >> >> >> logs I see a lot of :
> >> >> >>
> >> >> >> 2009-03-21 05:59:06,498 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> >> >> >> slept 144233ms, ten times longer than scheduled: 3000
> >> >> >> 2009-03-21 05:59:06,600 WARN
> >> >> >> org.apache.hadoop.hbase.regionserver.HRegionServer: unable
to
> report
> >> >> >> to master for 144335 milliseconds - retrying
> >> >> >> 2009-03-21 05:59:06,512 WARN org.apache.hadoop.hbase.util.Sleeper:
> We
> >> >> >> slept 143279ms, ten times longer than scheduled: 10000
> >> >> >> 2009-03-21 05:59:06,701 INFO
> >> >> >> org.apache.hadoop.hbase.regionserver.HRegionServer:
> >> >> >> MSG_CALL_SERVER_STARTUP: safeMode=false
> >> >> >>
> >> >> >> That usually means that the garbage collector blocked all
threads
> to
> >> >> >> do it's stuff. But, when it happens, it takes more time than
the
> >> lease
> >> >> >> the master maintains on the region servers (120 sec) so the
master
> >> >> >> considers this region server as dead. Then the log splitting
takes
> >> >> >> over on the master which is a very very long process. During
that
> >> >> >> time, sometimes more than 10 minutes, the regions from that
region
> >> >> >> server are unavailable. If the cluster is small, that make
things
> >> even
> >> >> >> far worse.
> >> >> >>
> >> >> >> We had these kinds of error on our cluster during the last
weeks
> and
> >> >> >> here is how I solved it:
> >> >> >>
> >> >> >> - Regards the log splitting, I suggest you take a look at
this
> issue
> >> >> >> https://issues.apache.org/jira/browse/HBASE-1008 as it has
a
> patch I
> >> >> >> made to speed up the process. See if it helps you.
> >> >> >>
> >> >> >> - Regards the garbage collector, I found that the options
> >> >> >> "-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode" were really
> really
> >> >> >> helpful. See
> >> >> >>
> http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html
> >> >> >> for more information. Set this in the hbase-env.sh file on
the
> line
> >> >> >> export HBASE_OPTS=...
> >> >> >>
> >> >> >> - Finally, to make sure that the garbage collection is fast,
check
> if
> >> >> >> there is swap. If so, set lower heaps for the MR child processes
> in
> >> >> >> hadoop-site.xml (the mapred.child.java.opts property).
> >> >> >>
> >> >> >> J-D
> >> >> >>
> >> >> >> On Sat, Mar 21, 2009 at 2:31 AM, schubert zhang <
> zsongbo@gmail.com>
> >> >> wrote:
> >> >> >> > Hi Jean Daniel,
> >> >> >> >
> >> >> >> > I want your help for this issue. I attach the log files,
please
> >> help
> >> >> >> analyse
> >> >> >> > it. Thanks.
> >> >> >> >
> >> >> >> > Test env.
> >> >> >> >  5+1 nodes cluster.  table: create 'TESTA', {NAME =>
'info',
> >> VERSIONS
> >> >> =>
> >> >> >> 1,
> >> >> >> > COMPRESSION => 'BLOCK', BLOCKCACHE => true}
> >> >> >> >
> >> >> >> > All test data is generated randomly by a program.
> >> >> >> >
> >> >> >> > HRegionServer Failure (2009-03-21 08:27:41,090):
> >> >> >> >  After about 8 hours running, my node-5 rangeserver failure
and
> the
> >> >> >> > HRegionServer Shutdown  .
> >> >> >> >  It seems caused by DFSClient exceptions. (I cannot make
clear
> what
> >> >> >> > happened on HDFS, but it seems the HDFS is ok.)
> >> >> >> >
> >> >> >> > Then I start HRegionServer at this node-5. (2009-03-21
> >> 10:53:42,747):
> >> >> >> >  After the HRegionServer started, regions were reassign.
I can
> see
> >> the
> >> >> >> > reassign on WebGUI of HBase, since some regions are now
on this
> >> node.
> >> >> >> >  But following things are blocked for a long time:
> >> >> >> >  (1) The HBase client application cannot insert data
for a long
> >> time
> >> >> >> (until
> >> >> >> > 2009/03/21 11:11:27, its about 18 minutes).  It is
> >> >> >> RetriesExhaustedException
> >> >> >> > exception on application side (MapReduce Job).
> >> >> >> >  (2) Some regions cannot be accessed (I cannot scan/get
rows in
> >> these
> >> >> >> > regions.) The exception is NotServingRegionException
when
> >> getRegion.
> >> >> >> >  (3) I check the history of the region of (2) from the
WebGUI. I
> >> can
> >> >> see
> >> >> >> in
> >> >> >> > the history, it is assigned at 11:04:15. It is so later.
> >> >> >> > The history is:
> >> >> >> > at, 21 Mar 2009 11:10:39openRegion opened on server :
> >> nd1-rack0-cloud
> >> >> >> Sat,
> >> >> >> > 21 Mar 2009 11:04:15assignmentRegion assigned to server
> >> >> >> > 10.24.1.12:60020Sat, 21 Mar 2009 06:48:03openRegion opened
on
> >> server
> >> >> :
> >> >> >> > nd1-rack0-cloud Sat,
> >> >> >> > 21 Mar 2009 06:47:57assignmentRegion assigned to server
> >> >> >> > 10.24.1.12:60020Sat, 21 Mar 2009 06:27:25openRegion opened
on
> >> server
> >> >> :
> >> >> >> > nd5-rack0-cloud Sat,
> >> >> >> > 21 Mar 2009 06:27:21assignmentRegion assigned to server
> >> >> >> > 10.24.1.20:60020Sat, 21 Mar 2009 06:26:13openRegion opened
on
> >> server
> >> >> :
> >> >> >> > nd5-rack0-cloud Sat,
> >> >> >> > 21 Mar 2009 06:24:53assignmentRegion assigned to server
> >> >> >> > 10.24.1.20:60020Sat, 21 Mar 2009 06:24:28openRegion opened
on
> >> server
> >> >> :
> >> >> >> > nd3-rack0-cloud Sat,
> >> >> >> > 21 Mar 2009 06:24:13assignmentRegion assigned to server
> >> >> >> > 10.24.1.16:60020Sat, 21 Mar 2009 06:19:08openRegion opened
on
> >> server
> >> >> :
> >> >> >> > nd4-rack0-cloud Sat,
> >> >> >> > 21 Mar 2009 06:19:02assignmentRegion assigned to server
> >> >> >> > 10.24.1.18:60020Sat, 21 Mar 2009 05:59:39openRegion opened
on
> >> server
> >> >> :
> >> >> >> > nd5-rack0-cloud Sat,
> >> >> >> > 21 Mar 2009 05:59:36assignmentRegion assigned to server
> >> >> >> > 10.24.1.20:60020Sat, 21 Mar 2009 03:50:15openRegion opened
on
> >> server
> >> >> :
> >> >> >> > nd3-rack0-cloud Sat,
> >> >> >> > 21 Mar 2009 03:50:12assignmentRegion assigned to server
> >> >> >> > 10.24.1.16:60020Sat, 21 Mar 2009 03:50:08splitRegion
split
> from:
> >> >> >> > TESTA,13576334163@2009-03-21
> >> >> >> > 00:35:57.526,1237569164012<
> >> >> >>
> >> >>
> >>
> http://nd0-rack0-cloud:60010/regionhistorian.jsp?regionname=CDR,13576334163@2009-03-21%2000:35:57.526,1237569164012
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > And following is exception when I scan a rowkey range.
> >> >> >> >
> >> >> >> > org.apache.hadoop.hbase.client.RetriesExhaustedException:
Trying
> to
> >> >> >> contact
> >> >> >> > region server 10.24.1.12:60020 for region
> >> >> >> > TESTA,13576334163@2009-03-2100:35:57.526,1237578615553,
row
> >> >> >> > '13576334163@2009-03-2100:35:57.526', but failed after
5
> attempts.
> >> >> >> > Exceptions:
> >> >> >> > org.apache.hadoop.hbase.NotServingRegionException:
> >> >> >> > org.apache.hadoop.hbase.NotServingRegionException:
> >> >> >> > TESTA,13576334163@2009-03-21 00:35:57.526,1237578615553
> >> >> >> >        at
> >> >> >> >
> >> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(Unknown
> >> >> >> Source)
> >> >> >> >        at
> >> >> >> >
> >> org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(Unknown
> >> >> >> > Source)
> >> >> >> >        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown
> >> Source)
> >> >> >> >        at
> >> >> >> >
> >> >> >>
> >> >>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >> >> >> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >> >> >> >        at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(Unknown
> >> >> >> Source)
> >> >> >> >        at
> >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(Unknown
> >> >> >> > Source)
> >> >> >> >
> >> >> >> > org.apache.hadoop.hbase.NotServingRegionException:
> >> >> >> > org.apache.hadoop.hbase.NotServingRegionException:
> >> >> >> > TESTA,13576334163@2009-03-21 00:35:57.526,1237578615553
> >> >> >> >        at
> >> >> >> >
> >> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(Unknown
> >> >> >> Source)
> >> >> >> >        at
> >> >> >> >
> >> org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(Unknown
> >> >> >> > Source)
> >> >> >> >        at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown
> >> Source)
> >> >> >> >        at
> >> >> >> >
> >> >> >>
> >> >>
> >>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >> >> >> >        at java.lang.reflect.Method.invoke(Method.java:597)
> >> >> >> >        at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(Unknown
> >> >> >> Source)
> >> >> >> >        at
> >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(Unknown
> >> >> >> > Source)
> >> >> >> >
> >> >> >> > I will send the log files to you email address.
> >> >> >> >
> >> >> >>
> >> >> >
> >> >>
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message