Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 582BBD5CC for ; Mon, 11 Mar 2013 02:24:39 +0000 (UTC) Received: (qmail 17919 invoked by uid 500); 11 Mar 2013 02:24:37 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 17859 invoked by uid 500); 11 Mar 2013 02:24:37 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 17851 invoked by uid 99); 11 Mar 2013 02:24:37 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Mar 2013 02:24:37 +0000 Received: from localhost (HELO mail-vb0-f46.google.com) (127.0.0.1) (smtp-auth username apurtell, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Mar 2013 02:24:36 +0000 Received: by mail-vb0-f46.google.com with SMTP id b13so1419512vby.33 for ; Sun, 10 Mar 2013 19:24:35 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.52.76.103 with SMTP id j7mr3681066vdw.90.1362968675370; Sun, 10 Mar 2013 19:24:35 -0700 (PDT) Received: by 10.58.128.170 with HTTP; Sun, 10 Mar 2013 19:24:35 -0700 (PDT) In-Reply-To: References: <513A0758.3090507@psafe.com> <513A34D0.80507@psafe.com> <513CD815.9070005@psafe.com> <513D092F.1020106@psafe.com> Date: Mon, 11 Mar 2013 10:24:35 +0800 Message-ID: Subject: Re: RegionServers Crashing every hour in production env From: Andrew Purtell To: "user@hbase.apache.org" Content-Type: multipart/alternative; boundary=bcaec5016233bc809604d79cdf96 --bcaec5016233bc809604d79cdf96 Content-Type: text/plain; charset=ISO-8859-1 Be careful with GC tuning, throwing changes at an application without analysis of what is going on with the heap is shooting in the dark. One particular good treatment of the subject is here: http://java.dzone.com/articles/how-tame-java-gc-pauses If you have made custom changes to blockcache or memstore configurations, back them out until you're sure everything else is ok. Watch carefully for swapping. Set the vm.swappiness sysctl to 0. Monitor for spikes in page scanning or any swap activity. Nothing brings on "Juliette" pauses better than a JVM partially swapped out. The Java GC starts collection by examining the oldest pages, and those are the first pages the OS swaps out... On Mon, Mar 11, 2013 at 10:13 AM, Azuryy Yu wrote: > Hi Pablo, > It'a terrible for a long minor GC. I don't think there are swaping from > your vmstat log. > but I just suggest you > 1) add following JVM options: > -XX:+DisableExplicitGC -XX:+UseCompressedOops -XX:GCTimeRatio=19 > -XX:SoftRefLRUPolicyMSPerMB=0 -XX:SurvivorRatio=2 > -XX:MaxTenuringThreshold=3 -XX:+UseFastAccessorMethods > > 2) -Xmn is two small, your total Mem is 74GB, just make -Xmn2g > 3) what are you doing during long GC happened? read or write? if reading, > what the block cache size? > > > > > On Mon, Mar 11, 2013 at 6:41 AM, Stack wrote: > > > You could increase your zookeeper session timeout to 5 minutes while you > > are figuring why these long pauses. > > http://hbase.apache.org/book.html#zookeeper.session.timeout > > > > Above, there is an outage for almost 5 minutes: > > > > >> We slept 225100ms instead of 3000ms, this is likely due to a long > > > > You have ganglia or tsdb running? When you see the big pause above, can > > you see anything going on on the machine? (swap, iowait, concurrent fat > > mapreduce job?) > > > > St.Ack > > > > > > > > On Sun, Mar 10, 2013 at 3:29 PM, Pablo Musa wrote: > > > > > Hi Sreepathi, > > > they say in the book (or the site), we could try it to see if it is > > really > > > a timeout error > > > or there is something more. But it is not recomended for production > > > environments. > > > > > > I could give it a try if five minutes will ensure to us that the > problem > > > is the GC or > > > elsewhere!! Anyway, I think it is hard to beleive a GC is taking 2:30 > > > minutes. > > > > > > Abs, > > > Pablo > > > > > > > > > On 03/10/2013 04:06 PM, Sreepathi wrote: > > > > > >> Hi Stack/Ted/Pablo, > > >> > > >> Should we increase the hbase.rpc.timeout property to 5 minutes ( > 300000 > > ms > > >> ) ? > > >> > > >> Regards, > > >> - Sreepathi > > >> > > >> On Sun, Mar 10, 2013 at 11:59 AM, Pablo Musa wrote: > > >> > > >> That combo should be fine. > > >>>> > > >>> Great!! > > >>> > > >>> > > >>> If JVM is full GC'ing, the application is stopped. > > >>>> The below does not look like a full GC but that is a long pause in > > >>>> system > > >>>> time, enough to kill your zk session. > > >>>> > > >>> Exactly. This pause is really making the zk expire the RS which > > shutsdown > > >>> (logs > > >>> in the end of the email). > > >>> But the question is: what is causing this pause??!! > > >>> > > >>> You swapping? > > >>>> > > >>> I don't think so (stats below). > > >>> > > >>> Hardware is good? > > >>>> > > >>> Yes, it is a 16 processor machine with 74GB of RAM and plenty disk > > space. > > >>> Below are some metrics I have heard about. Hope it helps. > > >>> > > >>> > > >>> ** I am having some problems with the datanodes[1] which are having > > >>> trouble to > > >>> write. I really think the issues are related, but cannot solve any of > > >>> them > > >>> :( > > >>> > > >>> Thanks again, > > >>> Pablo > > >>> > > >>> [1] http://mail-archives.apache. > ****org/mod_mbox/hadoop-hdfs-user/**** > > >>> > 201303.mbox/%3CCAJzooYfS-F1KS+******jGOPUt15PwFjcCSzigE0APeM9FXaCr**** > > >>> QfVbyQ@mail.gmail.com%3E > >>> mod_mbox/hadoop-hdfs-user/**201303.mbox/%3CCAJzooYfS-F1KS+** > > >>> jGOPUt15PwFjcCSzigE0APeM9FXaCr**QfVbyQ@mail.gmail.com%3E< > > > http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201303.mbox/%3CCAJzooYfS-F1KS+jGOPUt15PwFjcCSzigE0APeM9FXaCrQfVbyQ@mail.gmail.com%3E > > > > > >>> > > > >>> > > >>> top - 15:38:04 up 297 days, 21:03, 2 users, load average: 4.34, > 2.55, > > >>> 1.28 > > >>> Tasks: 528 total, 1 running, 527 sleeping, 0 stopped, 0 zombie > > >>> Cpu(s): 0.1%us, 0.2%sy, 0.0%ni, 99.7%id, 0.0%wa, 0.0%hi, 0.0%si, > > >>> 0.0%st > > >>> Mem: 74187256k total, 29493992k used, 44693264k free, 5836576k > > buffers > > >>> Swap: 51609592k total, 128312k used, 51481280k free, 1353400k > cached > > >>> > > >>> ]$ vmstat -w > > >>> procs -------------------memory-----****------------- ---swap-- > > >>> -----io---- > > >>> --system-- -----cpu------- > > >>> r b swpd free buff cache si so bi bo > > >>> in > > >>> cs us sy id wa st > > >>> 2 0 128312 32416928 5838288 5043560 0 0 202 53 > > >>> 0 > > >>> 0 2 1 96 1 0 > > >>> > > >>> ]$ sar > > >>> 02:20:01 PM all 26.18 0.00 2.90 0.63 0.00 > > >>> 70.29 > > >>> 02:30:01 PM all 1.66 0.00 1.25 1.05 0.00 > > >>> 96.04 > > >>> 02:40:01 PM all 10.01 0.00 2.14 0.75 0.00 > > >>> 87.11 > > >>> 02:50:01 PM all 0.76 0.00 0.80 1.03 0.00 > > >>> 97.40 > > >>> 03:00:01 PM all 0.23 0.00 0.30 0.71 0.00 > > >>> 98.76 > > >>> 03:10:01 PM all 0.22 0.00 0.30 0.66 0.00 > > >>> 98.82 > > >>> 03:20:01 PM all 0.22 0.00 0.31 0.76 0.00 > > >>> 98.71 > > >>> 03:30:01 PM all 0.24 0.00 0.31 0.64 0.00 > > >>> 98.81 > > >>> 03:40:01 PM all 1.13 0.00 2.97 1.18 0.00 > > >>> 94.73 > > >>> Average: all 3.86 0.00 1.38 0.88 0.00 > > >>> 93.87 > > >>> > > >>> ]$ iostat > > >>> Linux 2.6.32-220.7.1.el6.x86_64 (PSLBHDN002) 03/10/2013 _x86_64_ > > >>> (16 CPU) > > >>> > > >>> avg-cpu: %user %nice %system %iowait %steal %idle > > >>> 1.86 0.00 0.96 0.78 0.00 96.41 > > >>> > > >>> Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn > > >>> sda 1.23 20.26 23.53 521533196 605566924 > > >>> sdb 6.51 921.55 241.90 23717850730 > 6225863488 > > >>> sdc 6.22 921.83 236.41 23725181162 > 6084471192 > > >>> sdd 6.25 925.13 237.26 23810004970 > 6106357880 > > >>> sde 6.19 913.90 235.60 23521108818 > 6063722504 > > >>> sdh 6.26 933.08 237.77 24014594546 > 6119511376 > > >>> sdg 6.18 914.36 235.31 23532747378 > 6056257016 > > >>> sdf 6.24 923.66 235.33 23772251810 > 6056604008 > > >>> > > >>> Some more logging which reinforce that the RS crash is happening > > because > > >>> of > > >>> timeout. However this time the GC log is not accusing a big time. > > >>> > > >>> #####RS LOG##### > > >>> 2013-03-10 15:37:46,712 INFO org.apache.zookeeper.****ClientCnxn: > > Client > > >>> session timed out, have not heard from server in 257739ms for > sessionid > > >>> 0x13d3c4bcba6014a, closing socket connection and attempting reconnect > > >>> 2013-03-10 15:37:46,712 INFO org.apache.zookeeper.****ClientCnxn: > > Client > > >>> session timed out, have not heard from server in 226785ms for > sessionid > > >>> 0x13d3c4bcba60149, closing socket connection and attempting reconnect > > >>> 2013-03-10 15:37:46,712 DEBUG org.apache.hadoop.hbase.io.**** > > >>> hfile.LruBlockCache: > > >>> Stats: total=61.91 MB, free=1.94 GB, max=2 GB, blocks=1254, > > >>> accesses=60087, > > >>> hits=58811, hitRatio=97.87%, , cachingAccesses=60069, > > cachingHits=58811, > > >>> cachingHitsRatio=97.90%, , evictions=0, evicted=0, evictedPerRun=NaN > > >>> 2013-03-10 15:37:46,712 WARN > org.apache.hadoop.hbase.util.****Sleeper: > > >>> We > > >>> slept 225100ms instead of 3000ms, this is likely due to a long > garbage > > >>> collecting pause and it's usually bad, see > > >>> http://hbase.apache.org/book.**** > > >>> html#trouble.rs.runtime.****zkexpired > >>> org/book.html#trouble.rs.**runtime.zkexpired< > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> > > >>> > > > >>> 2013-03-10 15:37:46,714 WARN org.apache.hadoop.hdfs.****DFSClient: > > >>> DFSOutputStream ResponseProcessor exception for block > > >>> BP-43236042-172.17.2.10-****1362490844340:blk_-** > > >>> 6834190810033122569_25150229 > > >>> > > >>> java.io.EOFException: Premature EOF: no length prefix available > > >>> at org.apache.hadoop.hdfs.****protocol.HdfsProtoUtil.** > > >>> vintPrefixed(HdfsProtoUtil.****java:171) > > >>> at org.apache.hadoop.hdfs.****protocol.datatransfer.** > > >>> PipelineAck.readFields(****PipelineAck.java:114) > > >>> at > > org.apache.hadoop.hdfs.****DFSOutputStream$DataStreamer$**** > > >>> ResponseProcessor.run(****DFSOutputStream.java:670) > > >>> 2013-03-10 15:37:46,716 ERROR org.apache.hadoop.hbase.** > > >>> regionserver.HRegionServer: > > >>> org.apache.hadoop.hbase.ipc.****CallerDisconnectedException: Aborting > > >>> call > > >>> get([B@7caf69ed, {"timeRange":[0,****9223372036854775807],"** > > >>> totalColumns":1,"cacheBlocks":****true,"families":{"details":[**"** > > >>> ALL"]},"maxVersions":1,"row":"****\\x00\\x00\\x00\\x00\\x00\\*** > > >>> *x12\\x93@"}), > > >>> rpc version=1, client version=29, methodsFingerPrint=1891768260 from > > >>> 172.17.1.71:51294 after 0 ms, since caller disconnected > > >>> > > >>> > > >>> #####GC LOG##### > > >>> 2716.635: [GC 2716.635: [ParNew: 57570K->746K(59008K), 0.0046530 > secs] > > >>> 354857K->300891K(1152704K), 0.0047370 secs] [Times: user=0.03 > sys=0.00, > > >>> real=0.00 secs] > > >>> 2789.478: [GC 2789.478: [ParNew: 53226K->1192K(59008K), 0.0041370 > secs] > > >>> 353371K->301337K(1152704K), 0.0042220 secs] [Times: user=0.04 > sys=0.00, > > >>> real=0.00 secs] > > >>> 2868.435: [GC 2868.435: [ParNew: 53672K->740K(59008K), 0.0041570 > secs] > > >>> 353817K->300886K(1152704K), 0.0042440 secs] [Times: user=0.03 > sys=0.00, > > >>> real=0.01 secs] > > >>> 2920.309: [GC 2920.309: [ParNew: 53220K->6202K(59008K), 0.0058440 > secs] > > >>> 353366K->310443K(1152704K), 0.0059410 secs] [Times: user=0.05 > sys=0.00, > > >>> real=0.01 secs] > > >>> 2984.239: [GC 2984.239: [ParNew: 58680K->5282K(59008K), 0.0048070 > secs] > > >>> 362921K->310977K(1152704K), 0.0049180 secs] [Times: user=0.03 > sys=0.00, > > >>> real=0.01 secs] > > >>> 3050.620: [GC 3050.620: [ParNew: 57524K->5292K(59008K), 0.0055510 > secs] > > >>> 363219K->313384K(1152704K), 0.0056520 secs] [Times: user=0.04 > sys=0.00, > > >>> real=0.01 secs] > > >>> 3111.331: [GC 3111.331: [ParNew: 57542K->5637K(59008K), 0.0056900 > secs] > > >>> 365634K->318079K(1152704K), 0.0057950 secs] [Times: user=0.03 > sys=0.00, > > >>> real=0.01 secs] > > >>> 3163.510: [GC 3163.510: [ParNew: 58117K->1138K(59008K), 0.0046540 > secs] > > >>> 370559K->316086K(1152704K), 0.0047470 secs] [Times: user=0.03 > sys=0.00, > > >>> real=0.01 secs] > > >>> 3217.642: [GC 3217.643: [ParNew: 53618K->1001K(59008K), 0.0038530 > secs] > > >>> 368566K->315949K(1152704K), 0.0039450 secs] [Times: user=0.03 > sys=0.00, > > >>> real=0.00 secs] > > >>> 3464.067: [GC 3464.067: [ParNew: 53481K->5348K(59008K), 0.0068370 > secs] > > >>> 368429K->322343K(1152704K), 0.0069170 secs] [Times: user=0.07 > sys=0.01, > > >>> real=0.00 secs] > > >>> 3465.648: [GC 3465.648: [ParNew: 57828K->1609K(59008K), 0.0111520 > secs] > > >>> 374823K->321496K(1152704K), 0.0112170 secs] [Times: user=0.07 > sys=0.00, > > >>> real=0.01 secs] > > >>> 3466.291: [GC 3466.291: [ParNew: 54089K->2186K(59008K), 0.0093190 > secs] > > >>> 373976K->322073K(1152704K), 0.0093800 secs] [Times: user=0.06 > sys=0.00, > > >>> real=0.01 secs] > > >>> Heap > > >>> par new generation total 59008K, used 15497K [0x00000005fae00000, > > >>> 0x00000005fee00000, 0x00000005fee00000) > > >>> eden space 52480K, 25% used [0x00000005fae00000, > > 0x00000005fbaff8c8, > > >>> 0x00000005fe140000) > > >>> from space 6528K, 33% used [0x00000005fe140000, > 0x00000005fe362b70, > > >>> 0x00000005fe7a0000) > > >>> to space 6528K, 0% used [0x00000005fe7a0000, > 0x00000005fe7a0000, > > >>> 0x00000005fee00000) > > >>> concurrent mark-sweep generation total 1093696K, used 319887K > > >>> [0x00000005fee00000, 0x0000000641a10000, 0x00000007fae00000) > > >>> concurrent-mark-sweep perm gen total 30464K, used 30427K > > >>> [0x00000007fae00000, 0x00000007fcbc0000, 0x0000000800000000) > > >>> > > >>> > > >>> > > >>> On 03/08/2013 07:02 PM, Stack wrote: > > >>> > > >>> On Fri, Mar 8, 2013 at 10:58 AM, Pablo Musa > wrote: > > >>>> > > >>>> 0.94 currently doesn't support hadoop 2.0 > > >>>> > > >>>>> Can you deploy hadoop 1.1.1 instead ? > > >>>>>> > > >>>>>> I am using cdh4.2.0 which uses this version as default > > installation. > > >>>>>> > > >>>>> I think it will be a problem for me to deploy 1.1.1 because I would > > >>>>> need > > >>>>> to > > >>>>> "upgrade" the whole cluster with 70TB of data (backup everything, > go > > >>>>> offline, etc.). > > >>>>> > > >>>>> Is there a problem to use cdh4.2.0? > > >>>>> I should send my email to cdh list? > > >>>>> > > >>>>> > > >>>>> That combo should be fine. > > >>>>> > > >>>> > > >>>> You Full GC'ing around this time? > > >>>> > > >>>>> The GC shows it took a long time. However it does not make any > sense > > >>>>> to be it, since the same ammount of data was cleaned before and > AFTER > > >>>>> in just 0.01 secs! > > >>>>> > > >>>>> > > >>>>> If JVM is full GC'ing, the application is stopped. > > >>>>> > > >>>> > > >>>> [Times: user=0.08 sys=137.62, real=137.62 secs] > > >>>> > > >>>>> Besides the whole time was used by system. That is what is bugging > > me. > > >>>>> > > >>>>> > > >>>>> The below does not look like a full GC but that is a long pause > in > > >>>>> > > >>>> system > > >>>> time, enough to kill your zk session. > > >>>> > > >>>> You swapping? > > >>>> > > >>>> Hardware is good? > > >>>> > > >>>> St.Ack > > >>>> > > >>>> > > >>>> > > >>>> ... > > >>>> > > >>>>> > > >>>>> 1044.081: [GC 1044.081: [ParNew: 58970K->402K(59008K), 0.0040990 > > secs] > > >>>>> 275097K->216577K(1152704K), 0.0041820 secs] [Times: user=0.03 > > sys=0.00, > > >>>>> real=0.01 secs] > > >>>>> > > >>>>> 1087.319: [GC 1087.319: [ParNew: 52873K->6528K(59008K), 0.0055000 > > secs] > > >>>>> 269048K->223592K(1152704K), 0.0055930 secs] [Times: user=0.04 > > sys=0.01, > > >>>>> real=0.00 secs] > > >>>>> > > >>>>> 1087.834: [GC 1087.834: [ParNew: 59008K->6527K(59008K), 137.6353620 > > >>>>> secs] 276072K->235097K(1152704K), 137.6354700 secs] [Times: > user=0.08 > > >>>>> sys=137.62, real=137.62 secs] > > >>>>> > > >>>>> 1226.638: [GC 1226.638: [ParNew: 59007K->1897K(59008K), 0.0079960 > > secs] > > >>>>> 287577K->230937K(1152704K), 0.0080770 secs] [Times: user=0.05 > > sys=0.00, > > >>>>> real=0.01 secs] > > >>>>> > > >>>>> 1227.251: [GC 1227.251: [ParNew: 54377K->2379K(59008K), 0.0095650 > > secs] > > >>>>> 283417K->231420K(1152704K), 0.0096340 secs] [Times: user=0.06 > > sys=0.00, > > >>>>> real=0.01 secs] > > >>>>> > > >>>>> > > >>>>> I really appreciate you guys helping me to find out what is wrong. > > >>>>> > > >>>>> Thanks, > > >>>>> Pablo > > >>>>> > > >>>>> > > >>>>> > > >>>>> On 03/08/2013 02:11 PM, Stack wrote: > > >>>>> > > >>>>> What RAM says. > > >>>>> > > >>>>>> 2013-03-07 17:24:57,887 INFO > > org.apache.zookeeper.********ClientCnxn: > > >>>>>> Client > > >>>>>> > > >>>>>> session timed out, have not heard from server in 159348ms for > > >>>>>> sessionid > > >>>>>> 0x13d3c4bcba600a7, closing socket connection and attempting > > reconnect > > >>>>>> > > >>>>>> You Full GC'ing around this time? > > >>>>>> > > >>>>>> Put up your configs in a place where we can take a look? > > >>>>>> > > >>>>>> St.Ack > > >>>>>> > > >>>>>> > > >>>>>> On Fri, Mar 8, 2013 at 8:32 AM, ramkrishna vasudevan < > > >>>>>> ramkrishna.s.vasudevan@gmail.******com > > > >>>>>> **** > > >>>>>> com > ramkrishna.s.vasudevan@gmail.com> > > >>>>>> >>> > > >>>>>> wrote: > > >>>>>> > > >>>>>> I think it is with your GC config. What is your heap size? > What > > >>>>>> is > > >>>>>> the > > >>>>>> > > >>>>>> data that you pump in and how much is the block cache size? > > >>>>>>> > > >>>>>>> Regards > > >>>>>>> Ram > > >>>>>>> > > >>>>>>> On Fri, Mar 8, 2013 at 9:31 PM, Ted Yu > > wrote: > > >>>>>>> > > >>>>>>> 0.94 currently doesn't support hadoop 2.0 > > >>>>>>> > > >>>>>>> Can you deploy hadoop 1.1.1 instead ? > > >>>>>>>> > > >>>>>>>> Are you using 0.94.5 ? > > >>>>>>>> > > >>>>>>>> Thanks > > >>>>>>>> > > >>>>>>>> On Fri, Mar 8, 2013 at 7:44 AM, Pablo Musa > > wrote: > > >>>>>>>> > > >>>>>>>> Hey guys, > > >>>>>>>> > > >>>>>>>> as I sent in an email a long time ago, the RSs in my cluster > did > > >>>>>>>>> not > > >>>>>>>>> > > >>>>>>>>> get > > >>>>>>>>> > > >>>>>>>> along > > >>>>>>>> > > >>>>>>>> and crashed 3 times a day. I tried a lot of options we > discussed > > in > > >>>>>>>>> the > > >>>>>>>>> emails, but it not solved the problem. As I used an old version > > of > > >>>>>>>>> > > >>>>>>>>> hadoop I > > >>>>>>>>> > > >>>>>>>> thought this was the problem. > > >>>>>>>> > > >>>>>>>>> So, I upgraded from hadoop 0.20 - hbase 0.90 - zookeeper 3.3.5 > to > > >>>>>>>>> > > >>>>>>>>> hadoop > > >>>>>>>>> > > >>>>>>>> 2.0.0 > > >>>>>>>> > > >>>>>>>> - hbase 0.94 - zookeeper 3.4.5. > > >>>>>>>>> > > >>>>>>>>> Unfortunately the RSs did not stop crashing, and worst! Now > they > > >>>>>>>>> crash > > >>>>>>>>> every > > >>>>>>>>> hour and some times when the RS that holds the .ROOT. crashes > all > > >>>>>>>>> > > >>>>>>>>> cluster > > >>>>>>>>> > > >>>>>>>> get > > >>>>>>>> > > >>>>>>>> stuck in transition and everything stops working. > > >>>>>>>>> In this case I need to clean zookeeper znodes, restart the > master > > >>>>>>>>> and > > >>>>>>>>> > > >>>>>>>>> the > > >>>>>>>>> > > >>>>>>>> RSs. > > >>>>>>>> > > >>>>>>>> To avoid this case I am running on production with only ONE RS > > and > > >>>>>>>>> a > > >>>>>>>>> monitoring > > >>>>>>>>> script that check every minute, if the RS is ok. If not, > restart > > >>>>>>>>> it. > > >>>>>>>>> * This case does not get the cluster stuck. > > >>>>>>>>> > > >>>>>>>>> This is driving me crazy, but I really cant find a solution for > > the > > >>>>>>>>> cluster. > > >>>>>>>>> I tracked all logs from the start time 16:49 from all > interesting > > >>>>>>>>> nodes > > >>>>>>>>> (zoo, > > >>>>>>>>> namenode, master, rs, dn2, dn9, dn10) and copied here what I > > think > > >>>>>>>>> is > > >>>>>>>>> usefull. > > >>>>>>>>> > > >>>>>>>>> There are some strange errors in the DATANODE2, as an error > > >>>>>>>>> copiyng a > > >>>>>>>>> > > >>>>>>>>> block > > >>>>>>>>> > > >>>>>>>> to itself. > > >>>>>>>> > > >>>>>>>>> The gc log points to GC timeout. However it is very weird that > > the > > >>>>>>>>> RS > > >>>>>>>>> > > >>>>>>>>> spend > > >>>>>>>>> > > >>>>>>>> so much time in GC while in the other cases it takes 0.001sec. > > >>>>>>>> > > >>>>>>>>> Besides, > > >>>>>>>>> the time > > >>>>>>>>> spent, is in sys which makes me think that might be a problem > in > > >>>>>>>>> > > >>>>>>>>> another > > >>>>>>>>> > > >>>>>>>> place. > > >>>>>>>> > > >>>>>>>> I know that it is a bunch of logs, and that it is very > difficult > > to > > >>>>>>>>> > > >>>>>>>>> find > > >>>>>>>>> > > >>>>>>>> the > > >>>>>>>> > > >>>>>>>> problem without much context. But I REALLY need some help. If > it > > is > > >>>>>>>>> not > > >>>>>>>>> > > >>>>>>>>> the > > >>>>>>>>> > > >>>>>>>> solution, at least what I should read, where I should look, or > > >>>>>>>> which > > >>>>>>>> > > >>>>>>>>> cases > > >>>>>>>>> > > >>>>>>>> I > > >>>>>>>> > > >>>>>>>>> should monitor. > > >>>>>>>>> > > >>>>>>>>> Thank you very much, > > >>>>>>>>> Pablo Musa > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> > > >> -- > > >> *Regards,* > > >> --- *Sreepathi * > > >> > > > > > > > > > -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) --bcaec5016233bc809604d79cdf96--