Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of yuzhihong@gmail.com
 designates 209.85.215.52 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <515C900D.8050602@psafe.com>
References: <513A0758.3090507@psafe.com>
	<CALte62wk7h2Mg8hj2-pc-+6gnvXJx-A6o6S4EP3MVdjOgozqgg@mail.gmail.com>
	<CAAT7MkoTwPJt4=YerAMaCCTok9QZXTuHXyWKisMZDuNO-cydKg@mail.gmail.com>
	<CADcMMgER-mtEW9V00PaApneY-M+yZ6V-Fa5SNXr+CCTYCFqi7A@mail.gmail.com>
	<513A34D0.80507@psafe.com>
	<CADcMMgFrsdVCBwoXdKY-i21RVxwJpVeJpP_-hJciOwYJwDF5mA@mail.gmail.com>
	<513CD815.9070005@psafe.com>
	<CAPt-PJtrOL4a8vNaaPZX5vUE-LC1rbWxKt504d7bcdO4_nkk=A@mail.gmail.com>
	<513D092F.1020106@psafe.com>
	<CADcMMgEx=r793Gf3Sj9iw1ymbNkY7DKjdG=SYfCd5mtR1u+Wfg@mail.gmail.com>
	<CALr1C9pPbECYRON79Pk65j439RZK5GUWT1GBcfXhA=SuZitA1A@mail.gmail.com>
	<CA+RK=_Dq7kDG00JGW0tQ6jhjQ2pazX1WA3FCaaSk-57ZgrmzXg@mail.gmail.com>
	<513F4D08.3040403@psafe.com>
	<515C7336.9060505@psafe.com>
	<CALte62yroP==BodGLB13ampdMgo2HeXDt7XqtiZi9U1ya1pTkw@mail.gmail.com>
	<515C900D.8050602@psafe.com>
Date: Wed, 3 Apr 2013 14:40:52 -0700
Message-ID: 
 <CALte62zwhhfC8wfdbyityw=HBaK2--CuFDySrUX6rixc4Bc4=A@mail.gmail.com>
Subject: Re: RegionServers Crashing every hour in production env
From: Ted Yu <yuzhihong@gmail.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Content-Type: multipart/alternative; boundary=14dae94734b346bd6a04d97bb51d

--14dae94734b346bd6a04d97bb51d
Content-Type: text/plain; charset=ISO-8859-1

I went over related emails in my Inbox.
One previous case was that other task was running on the region server node
which consumed good portion of CPU. In that case I observed a gap of
activities in region server log. I can send that snippet, after
anonymization since there were some IP addresses at my previous employer.

FYI


On Wed, Apr 3, 2013 at 1:24 PM, Pablo Musa <pablo@psafe.com> wrote:

> Have you looked at http://hbase.apache.org/book.**html#trouble.rs.runtime.
>> **zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>, suggested below ?
>>
>
> Yes I did, but GC is not the issue here.
>
>
>  I think zookeeper timeout should be more closely watched.
>>
>
> What do you mean?
> My ZK timeout today is 150 secs, which is very big. However, my
> problem is finding where this timeout come from?! Any ideas?
>
> Thanks,
> Pablo
>
>
> On 04/03/2013 03:36 PM, Ted Yu wrote:
>
>> Thanks for the sharing.
>>
>> Have you looked at
>> http://hbase.apache.org/book.**html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>, suggested
>> below ?
>>
>> I think zookeeper timeout should be more closely watched.
>>
>>
>> On Wed, Apr 3, 2013 at 11:21 AM, Pablo Musa <pablo@psafe.com> wrote:
>>
>>  Hello guys,
>>> I stopped my research on HBase ZK timeout for while due to
>>> other issues I had to do, but I am back.
>>>
>>> A very weird behavior that I would like your comments is that my
>>> RegionServers perform better (less crashes) under heavy load instead
>>> of light load.
>>> There is, if I let HBase alone with 50 requestsPerSecond along the
>>> whole day the crashes are higher than if I run a mapred Job every hour.
>>>
>>>
>>> Another weird thing is the following:
>>>
>>> RS startTime = Mon Apr 01 13:22:35 BRT 2013
>>>
>>> [...]$ grep slept /var/log/hbase/hbase-hbase-****
>>> regionserver-PSLBHDN00*.log
>>> 2013-04-01 20:09:21,135 WARN org.apache.hadoop.hbase.util.****Sleeper:
>>> We
>>>
>>> slept 45491ms instead of 3000ms, this is likely due to a long garbage
>>> collecting pause and it's usually bad, see
>>> http://hbase.apache.org/book.**** <http://hbase.apache.org/book.**>
>>> html#trouble.rs.runtime.****zkexpired<http://hbase.apache.**
>>> org/book.html#trouble.rs.**runtime.zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>>> >
>>> 2013-04-01 22:45:59,407 WARN org.apache.hadoop.hbase.util.****Sleeper:
>>> We
>>>
>>> slept 101271ms instead of 3000ms, this is likely due to a long garbage
>>> collecting pause and it's usually bad, see
>>> http://hbase.apache.org/book.**** <http://hbase.apache.org/book.**>
>>> html#trouble.rs.runtime.****zkexpired<http://hbase.apache.**
>>> org/book.html#trouble.rs.**runtime.zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>>> >
>>>
>>> [...]$ egrep 'real=[1-9][0-9][0-9][0-9]*\.[****0-9][0-9]'
>>> /var/log/hbase/hbase-hbase-****regionserver-PSLBHDN00*.out
>>>
>>> * the below report is the above command for each time range.
>>> 0.0 - 0.1  secs GCs = 5084
>>> 0.1 - 0.5  secs GCs = 9
>>> 0.5 - 1.0  secs GCs = 3
>>> 1.0 - 010  secs GCs = 0
>>> 010 - 100  secs GCs = 0
>>> 100 - 1000 secs GCs = 0
>>>
>>> So, my script for getting every gc time ("real=... secs") says that
>>> there is no gc that took longer than 1 second.
>>> However the RS log says twice that the RS slept more than 40 seconds
>>> instead of 3.
>>>
>>> "this is likely due to a long garbage collecting pause", yes
>>> this is likely but I dont think it is the case.
>>>
>>> The machine is a huge machine with 70GB RAM, 32 procs, light load,
>>> no swap or iowait.
>>>
>>> Any ideas?
>>>
>>> Thanks,
>>> Pablo
>>>
>>>
>>> On 03/12/2013 12:43 PM, Pablo Musa wrote:
>>>
>>>  Guys,
>>>> thank you very much for the help.
>>>>
>>>> Yesterday I spent 14 hours trying to tune the whole cluster.
>>>> The cluster is not ready yet needs a lot of tunning, but at least is
>>>> working.
>>>>
>>>> My first big problem was namenode + datanode GC. They were not using
>>>> CMS and thus were taking "incremental" time to run. Ii started in 0.01
>>>> ms and
>>>> in 20 minutes was taking 150 secs.
>>>> After setting CMSGC this time is much smaller taking a maximum of 70
>>>> secs,
>>>> which is VERY HIGH, but for now does not stop HBase.
>>>>
>>>> With this issue solved, it was clear that the RS was doing a long pause
>>>> GC,
>>>> taking up to 220 secs. Zookeeper expired the RS and it shutdown.
>>>> I tried a lot of different flags configuration (MORE than 20), and could
>>>> not
>>>> get small gcs. Eventually it would take more than 150 secs (zookeeper
>>>> timeout)
>>>> and shutdown.
>>>>
>>>> Finally I tried a config that so far, 12 hours, is working with a
>>>> maximum
>>>> GC
>>>> time of 90 secs. Which of course is a terrible problem since HBase is a
>>>> database, but at least the cluster is stable while I can tune it a
>>>> little more.
>>>>
>>>> In my opinion, my biggest problem is to have a few "monster" machines in
>>>> the
>>>> cluster instead of a bunch of commodities machines. I don't know if
>>>> there are
>>>> a lot companies using this kind of machines inside a hadoop cluster, but
>>>> a fast search on google could not find a lot of tunes for big heap GCs.
>>>>
>>>> I guess my next step will be search for big heap gc tuning.
>>>>
>>>> Back to some questions ;)
>>>>
>>>>    > You have ganglia or tsdb running?
>>>>
>>>> I use zabbix for now, and no there is nothing going on when the big
>>>> pause happens.
>>>>
>>>>    > When you see the big pause above, can you see anything going on on
>>>> the
>>>>    > machine? (swap, iowait, concurrent fat mapreduce job?)
>>>>    > what are you doing during long GC happened? read or write? if
>>>> reading, what
>>>>    > the block cache size?
>>>>
>>>> The cpu for the RS process goes to 100% and the logs "pause", until it
>>>> gets out.
>>>> Ex: [NewPar
>>>>
>>>> IO and SWAP are normal. There is no MR running, just normal database
>>>> load, which is
>>>> very low. I am probably doing reads AND writes to the database with
>>>> default block
>>>> cache size.
>>>> One problem in this moment might be the big number of regions (1252)
>>>> since I am
>>>> using only one RS to be able to track the problem.
>>>>
>>>> The links and ideas were very helpful. Thank you very much guys.
>>>>
>>>> I will post my future researches as I find a solution ;)
>>>>
>>>> If you have more ideas or info (links, flag suggestions, etc.), please
>>>> post it :)
>>>>
>>>> Abs,
>>>> Pablo
>>>>
>>>> On 03/10/2013 11:24 PM, Andrew Purtell wrote:
>>>>
>>>>  Be careful with GC tuning, throwing changes at an application without
>>>>> analysis of what is going on with the heap is shooting in the dark. One
>>>>> particular good treatment of the subject is here:
>>>>> http://java.dzone.com/****articles/how-tame-java-gc-****pauses<http://java.dzone.com/**articles/how-tame-java-gc-**pauses>
>>>>> <http://java.dzone.com/**articles/how-tame-java-gc-**pauses<http://java.dzone.com/articles/how-tame-java-gc-pauses>
>>>>> >
>>>>>
>>>>>
>>>>> If you have made custom changes to blockcache or memstore
>>>>> configurations,
>>>>> back them out until you're sure everything else is ok.
>>>>>
>>>>> Watch carefully for swapping. Set the vm.swappiness sysctl to 0.
>>>>> Monitor
>>>>> for spikes in page scanning or any swap activity. Nothing brings on
>>>>> "Juliette" pauses better than a JVM partially swapped out. The Java GC
>>>>> starts collection by examining the oldest pages, and those are the
>>>>> first
>>>>> pages the OS swaps out...
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 11, 2013 at 10:13 AM, Azuryy Yu <azuryyyu@gmail.com>
>>>>> wrote:
>>>>>
>>>>>   Hi Pablo,
>>>>>
>>>>>> It'a terrible for a long minor GC. I don't think there are swaping
>>>>>> from
>>>>>> your vmstat log.
>>>>>> but I just suggest you
>>>>>> 1) add following JVM options:
>>>>>> -XX:+DisableExplicitGC -XX:+UseCompressedOops -XX:GCTimeRatio=19
>>>>>> -XX:SoftRefLRUPolicyMSPerMB=0 -XX:SurvivorRatio=2
>>>>>> -XX:MaxTenuringThreshold=3 -XX:+UseFastAccessorMethods
>>>>>>
>>>>>> 2) -Xmn is two small, your total Mem is 74GB, just make -Xmn2g
>>>>>> 3) what are you doing during long GC happened? read or write? if
>>>>>> reading,
>>>>>> what the block cache size?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 11, 2013 at 6:41 AM, Stack <stack@duboce.net> wrote:
>>>>>>
>>>>>>   You could increase your zookeeper session timeout to 5 minutes while
>>>>>>
>>>>>>> you
>>>>>>> are figuring why these long pauses.
>>>>>>> http://hbase.apache.org/book.****html#zookeeper.session.**timeout<http://hbase.apache.org/book.**html#zookeeper.session.timeout>
>>>>>>> <http://hbase.apache.**org/book.html#zookeeper.**session.timeout<http://hbase.apache.org/book.html#zookeeper.session.timeout>
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>> Above, there is an outage for almost 5 minutes:
>>>>>>>
>>>>>>>   We slept 225100ms instead of 3000ms, this is likely due to a long
>>>>>>>
>>>>>>>> You have ganglia or tsdb running?  When you see the big pause above,
>>>>>>>>
>>>>>>> can
>>>>>>> you see anything going on on the machine?  (swap, iowait, concurrent
>>>>>>> fat
>>>>>>> mapreduce job?)
>>>>>>>
>>>>>>> St.Ack
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Mar 10, 2013 at 3:29 PM, Pablo Musa <pablo@psafe.com> wrote:
>>>>>>>
>>>>>>>   Hi Sreepathi,
>>>>>>>
>>>>>>>> they say in the book (or the site), we could try it to see if it is
>>>>>>>>
>>>>>>>>  really
>>>>>>>
>>>>>>>  a timeout error
>>>>>>>> or there is something more. But it is not recomended for production
>>>>>>>> environments.
>>>>>>>>
>>>>>>>> I could give it a try if five minutes will ensure to us that the
>>>>>>>>
>>>>>>>>  problem
>>>>>>> is the GC or
>>>>>>>
>>>>>>>> elsewhere!! Anyway, I think it is hard to beleive a GC is taking
>>>>>>>> 2:30
>>>>>>>> minutes.
>>>>>>>>
>>>>>>>> Abs,
>>>>>>>> Pablo
>>>>>>>>
>>>>>>>>
>>>>>>>> On 03/10/2013 04:06 PM, Sreepathi wrote:
>>>>>>>>
>>>>>>>>   Hi Stack/Ted/Pablo,
>>>>>>>>
>>>>>>>>> Should we increase the hbase.rpc.timeout property to 5 minutes (
>>>>>>>>>
>>>>>>>>>  300000
>>>>>>>>
>>>>>>> ms
>>>>>>>
>>>>>>>  )  ?
>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> - Sreepathi
>>>>>>>>>
>>>>>>>>> On Sun, Mar 10, 2013 at 11:59 AM, Pablo Musa <pablo@psafe.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>     That combo should be fine.
>>>>>>>>>
>>>>>>>>>  Great!!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>     If JVM is full GC'ing, the application is stopped.
>>>>>>>>>>
>>>>>>>>>>  The below does not look like a full GC but that is a long pause
>>>>>>>>>>> in
>>>>>>>>>>> system
>>>>>>>>>>> time, enough to kill your zk session.
>>>>>>>>>>>
>>>>>>>>>>>   Exactly. This pause is really making the zk expire the RS which
>>>>>>>>>>>
>>>>>>>>>> shutsdown
>>>>>>>>>
>>>>>>>> (logs
>>>>>>>>
>>>>>>>>> in the end of the email).
>>>>>>>>>> But the question is: what is causing this pause??!!
>>>>>>>>>>
>>>>>>>>>>     You swapping?
>>>>>>>>>> I don't think so (stats below).
>>>>>>>>>>
>>>>>>>>>>     Hardware is good?
>>>>>>>>>> Yes, it is a 16 processor machine with 74GB of RAM and plenty disk
>>>>>>>>>>
>>>>>>>>>>  space.
>>>>>>>>>
>>>>>>>> Below are some metrics I have heard about. Hope it helps.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> ** I am having some problems with the datanodes[1] which are
>>>>>>>>>> having
>>>>>>>>>> trouble to
>>>>>>>>>> write. I really think the issues are related, but cannot solve any
>>>>>>>>>> of
>>>>>>>>>> them
>>>>>>>>>> :(
>>>>>>>>>>
>>>>>>>>>> Thanks again,
>>>>>>>>>> Pablo
>>>>>>>>>>
>>>>>>>>>> [1] http://mail-archives.apache.
>>>>>>>>>>
>>>>>>>>>>  ****org/mod_mbox/hadoop-hdfs-****user/****
>>>>>>>>>
>>>>>>>> 201303.mbox/%3CCAJzooYfS-F1KS+**************
>>>>>> jGOPUt15PwFjcCSzigE0APeM9FXaCr***
>>>>>> *****
>>>>>>
>>>>>>  QfVbyQ@mail.gmail.com%3E<http:******//mail-archives.apache.**
>>>>>>> org/**** <http://mail-archives.apache.org/****><
>>>>>>> http://mail-archives.**apache.org/**<http://mail-archives.apache.org/**>
>>>>>>> >
>>>>>>>
>>>>>>>> mod_mbox/hadoop-hdfs-user/******201303.mbox/%3CCAJzooYfS-F1KS+**
>>>>>>>>>> ****
>>>>>>>>>> jGOPUt15PwFjcCSzigE0APeM9FXaCr******QfVbyQ@mail.gmail.com%3E<
>>>>>>>>>>
>>>>>>>>>>  http://mail-archives.apache.****org/mod_mbox/hadoop-hdfs-user/**
>>>>>>>>> **
>>>>>>>>>
>>>>>>>> 201303.mbox/%3CCAJzooYfS-F1KS+******jGOPUt15PwFjcCSzigE0APeM9FXaCr*
>>>>>> ***
>>>>>> QfVbyQ@mail.gmail.com%3E<http:**//mail-archives.apache.org/**
>>>>>> mod_mbox/hadoop-hdfs-user/**201303.mbox/%3CCAJzooYfS-F1KS+**
>>>>>> jGOPUt15PwFjcCSzigE0APeM9FXaCr**QfVbyQ@mail.gmail.com%3E<http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201303.mbox/%3CCAJzooYfS-F1KS+jGOPUt15PwFjcCSzigE0APeM9FXaCrQfVbyQ@mail.gmail.com%3E>
>>>>>> >
>>>>>>
>>>>>>  top - 15:38:04 up 297 days, 21:03,  2 users,  load average: 4.34,
>>>>>>>
>>>>>>>> 2.55,
>>>>>>>>>
>>>>>>>> 1.28
>>>>>>>
>>>>>>>> Tasks: 528 total,   1 running, 527 sleeping,   0 stopped,   0 zombie
>>>>>>>>>> Cpu(s):  0.1%us,  0.2%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,
>>>>>>>>>> 0.0%si,
>>>>>>>>>>      0.0%st
>>>>>>>>>> Mem:  74187256k total, 29493992k used, 44693264k free,  5836576k
>>>>>>>>>>
>>>>>>>>>>  buffers
>>>>>>>>>
>>>>>>>> Swap: 51609592k total,   128312k used, 51481280k free,  1353400k
>>>>>>>>
>>>>>>>>> cached
>>>>>>>>>
>>>>>>>> ]$ vmstat -w
>>>>>>>
>>>>>>>> procs -------------------memory-----********------------- ---swap--
>>>>>>>>>>
>>>>>>>>>> -----io----
>>>>>>>>>> --system-- -----cpu-------
>>>>>>>>>>      r  b       swpd       free       buff      cache   si   so
>>>>>>>>>>   bi bo
>>>>>>>>>> in
>>>>>>>>>>       cs  us sy  id wa st
>>>>>>>>>>      2  0     128312   32416928    5838288    5043560    0    0
>>>>>>>>>> 202 53
>>>>>>>>>>     0
>>>>>>>>>>        0   2  1  96  1  0
>>>>>>>>>>
>>>>>>>>>> ]$ sar
>>>>>>>>>> 02:20:01 PM     all     26.18      0.00      2.90      0.63 0.00
>>>>>>>>>> 70.29
>>>>>>>>>> 02:30:01 PM     all      1.66      0.00      1.25      1.05 0.00
>>>>>>>>>> 96.04
>>>>>>>>>> 02:40:01 PM     all     10.01      0.00      2.14      0.75 0.00
>>>>>>>>>> 87.11
>>>>>>>>>> 02:50:01 PM     all      0.76      0.00      0.80      1.03 0.00
>>>>>>>>>> 97.40
>>>>>>>>>> 03:00:01 PM     all      0.23      0.00      0.30      0.71 0.00
>>>>>>>>>> 98.76
>>>>>>>>>> 03:10:01 PM     all      0.22      0.00      0.30      0.66 0.00
>>>>>>>>>> 98.82
>>>>>>>>>> 03:20:01 PM     all      0.22      0.00      0.31      0.76 0.00
>>>>>>>>>> 98.71
>>>>>>>>>> 03:30:01 PM     all      0.24      0.00      0.31      0.64 0.00
>>>>>>>>>> 98.81
>>>>>>>>>> 03:40:01 PM     all      1.13      0.00      2.97      1.18 0.00
>>>>>>>>>> 94.73
>>>>>>>>>> Average:        all      3.86      0.00      1.38      0.88 0.00
>>>>>>>>>> 93.87
>>>>>>>>>>
>>>>>>>>>> ]$ iostat
>>>>>>>>>> Linux 2.6.32-220.7.1.el6.x86_64 (PSLBHDN002)     03/10/2013
>>>>>>>>>> _x86_64_
>>>>>>>>>>      (16 CPU)
>>>>>>>>>>
>>>>>>>>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>>>>>>>>                1.86    0.00    0.96    0.78    0.00   96.41
>>>>>>>>>>
>>>>>>>>>> Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read
>>>>>>>>>> Blk_wrtn
>>>>>>>>>> sda               1.23        20.26        23.53  521533196
>>>>>>>>>> 605566924
>>>>>>>>>> sdb               6.51       921.55       241.90 23717850730
>>>>>>>>>>
>>>>>>>>>>  6225863488
>>>>>>>>>
>>>>>>>> sdc               6.22       921.83       236.41 23725181162
>>>>>>>
>>>>>>>> 6084471192
>>>>>>>>>
>>>>>>>> sdd               6.25       925.13       237.26 23810004970
>>>>>>>
>>>>>>>> 6106357880
>>>>>>>>>
>>>>>>>> sde               6.19       913.90       235.60 23521108818
>>>>>>>
>>>>>>>> 6063722504
>>>>>>>>>
>>>>>>>> sdh               6.26       933.08       237.77 24014594546
>>>>>>>
>>>>>>>> 6119511376
>>>>>>>>>
>>>>>>>> sdg               6.18       914.36       235.31 23532747378
>>>>>>>
>>>>>>>> 6056257016
>>>>>>>>>
>>>>>>>> sdf               6.24       923.66       235.33 23772251810
>>>>>>>
>>>>>>>> 6056604008
>>>>>>>>>
>>>>>>>> Some more logging which reinforce that the RS crash is happening
>>>>>>>
>>>>>>>> because
>>>>>>>>>
>>>>>>>> of
>>>>>>>>
>>>>>>>>> timeout. However this time the GC log is not accusing a big time.
>>>>>>>>>>
>>>>>>>>>> #####RS LOG#####
>>>>>>>>>> 2013-03-10 15:37:46,712 INFO org.apache.zookeeper.********
>>>>>>>>>> ClientCnxn:
>>>>>>>>>>
>>>>>>>>>>  Client
>>>>>>>>>
>>>>>>>> session timed out, have not heard from server in 257739ms for
>>>>>>>>
>>>>>>>>> sessionid
>>>>>>>>>
>>>>>>>> 0x13d3c4bcba6014a, closing socket connection and attempting
>>>>>>> reconnect
>>>>>>>
>>>>>>>> 2013-03-10 15:37:46,712 INFO org.apache.zookeeper.********
>>>>>>>>>> ClientCnxn:
>>>>>>>>>>
>>>>>>>>>>  Client
>>>>>>>>>
>>>>>>>> session timed out, have not heard from server in 226785ms for
>>>>>>>>
>>>>>>>>> sessionid
>>>>>>>>>
>>>>>>>> 0x13d3c4bcba60149, closing socket connection and attempting
>>>>>>> reconnect
>>>>>>>
>>>>>>>> 2013-03-10 15:37:46,712 DEBUG org.apache.hadoop.hbase.io.********
>>>>>>>>>>
>>>>>>>>>> hfile.LruBlockCache:
>>>>>>>>>> Stats: total=61.91 MB, free=1.94 GB, max=2 GB, blocks=1254,
>>>>>>>>>> accesses=60087,
>>>>>>>>>> hits=58811, hitRatio=97.87%, , cachingAccesses=60069,
>>>>>>>>>>
>>>>>>>>>>  cachingHits=58811,
>>>>>>>>>
>>>>>>>> cachingHitsRatio=97.90%, , evictions=0, evicted=0, evictedPerRun=NaN
>>>>>>>>
>>>>>>>>> 2013-03-10 15:37:46,712 WARN
>>>>>>>>>>
>>>>>>>>>>  org.apache.hadoop.hbase.util.********Sleeper:
>>>>>>>>>
>>>>>>>> We
>>>>>>>
>>>>>>>> slept 225100ms instead of 3000ms, this is likely due to a long
>>>>>>>>>>
>>>>>>>>>>  garbage
>>>>>>>>>
>>>>>>>> collecting pause and it's usually bad, see
>>>>>>>
>>>>>>>> http://hbase.apache.org/book.********<http://hbase.apache.org/book.******>
>>>>>>>>>> <http://hbase.apache.org/**book.****<http://hbase.apache.org/book.****>
>>>>>>>>>> ><
>>>>>>>>>> http://hbase.apache.org/book.******<http://hbase.apache.org/book.****><
>>>>>>>>>> http://hbase.apache.org/book.****<http://hbase.apache.org/book.**>
>>>>>>>>>> >
>>>>>>>>>> html#trouble.rs.runtime.********zkexpired<http://hbase.apache.**
>>>>>>>>>> ****
>>>>>>>>>> org/book.html#trouble.rs.******runtime.zkexpired<
>>>>>>>>>>
>>>>>>>>>>  http://hbase.apache.org/book.****html#trouble.rs.runtime.****
>>>>>>>>> zkexpired<http://hbase.apache.org/book.**html#trouble.rs.runtime.**zkexpired>
>>>>>>>>> <http://hbase.apache.**org/book.html#trouble.rs.**
>>>>>>>>> runtime.zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>> 2013-03-10 15:37:46,714 WARN org.apache.hadoop.hdfs.********
>>>>>>>> DFSClient:
>>>>>>>>
>>>>>>>>> DFSOutputStream ResponseProcessor exception  for block
>>>>>>>>>> BP-43236042-172.17.2.10-********1362490844340:blk_-**
>>>>>>>>>>
>>>>>>>>>> 6834190810033122569_25150229
>>>>>>>>>>
>>>>>>>>>> java.io.EOFException: Premature EOF: no length prefix available
>>>>>>>>>>             at org.apache.hadoop.hdfs.******
>>>>>>>>>> protocol.HdfsProtoUtil.**
>>>>>>>>>> vintPrefixed(HdfsProtoUtil.********java:171)
>>>>>>>>>>             at org.apache.hadoop.hdfs.********
>>>>>>>>>> protocol.datatransfer.**
>>>>>>>>>> PipelineAck.readFields(********PipelineAck.java:114)
>>>>>>>>>>             at
>>>>>>>>>>
>>>>>>>>>>  org.apache.hadoop.hdfs.********DFSOutputStream$DataStreamer$***
>>>>>>>>> *****
>>>>>>>>>
>>>>>>>> ResponseProcessor.run(********DFSOutputStream.java:670)
>>>>>>>>
>>>>>>>>> 2013-03-10 15:37:46,716 ERROR org.apache.hadoop.hbase.**
>>>>>>>>>> regionserver.HRegionServer:
>>>>>>>>>> org.apache.hadoop.hbase.ipc.**********
>>>>>>>>>> CallerDisconnectedException:
>>>>>>>>>> Aborting
>>>>>>>>>> call
>>>>>>>>>> get([B@7caf69ed, {"timeRange":[0,********9223372036854775807],"**
>>>>>>>>>> totalColumns":1,"cacheBlocks":********true,"families":{"**
>>>>>>>>>> details"**
>>>>>>>>>> :[**"**
>>>>>>>>>> ALL"]},"maxVersions":1,"row":"********\\x00\\x00\\x00\\x00\\**
>>>>>>>>>> x00\**
>>>>>>>>>>
>>>>>>>>>> \***
>>>>>>>>>> *x12\\x93@"}),
>>>>>>>>>> rpc version=1, client version=29, methodsFingerPrint=1891768260
>>>>>>>>>> from
>>>>>>>>>> 172.17.1.71:51294 after 0 ms, since caller disconnected
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> #####GC LOG#####
>>>>>>>>>> 2716.635: [GC 2716.635: [ParNew: 57570K->746K(59008K), 0.0046530
>>>>>>>>>>
>>>>>>>>>>  secs]
>>>>>>>>>
>>>>>>>> 354857K->300891K(1152704K), 0.0047370 secs] [Times: user=0.03
>>>>>>>
>>>>>>>> sys=0.00,
>>>>>>>>>
>>>>>>>> real=0.00 secs]
>>>>>>>
>>>>>>>> 2789.478: [GC 2789.478: [ParNew: 53226K->1192K(59008K), 0.0041370
>>>>>>>>>>
>>>>>>>>>>  secs]
>>>>>>>>>
>>>>>>>> 353371K->301337K(1152704K), 0.0042220 secs] [Times: user=0.04
>>>>>>>
>>>>>>>> sys=0.00,
>>>>>>>>>
>>>>>>>> real=0.00 secs]
>>>>>>>
>>>>>>>> 2868.435: [GC 2868.435: [ParNew: 53672K->740K(59008K), 0.0041570
>>>>>>>>>>
>>>>>>>>>>  secs]
>>>>>>>>>
>>>>>>>> 353817K->300886K(1152704K), 0.0042440 secs] [Times: user=0.03
>>>>>>>
>>>>>>>> sys=0.00,
>>>>>>>>>
>>>>>>>> real=0.01 secs]
>>>>>>>
>>>>>>>> 2920.309: [GC 2920.309: [ParNew: 53220K->6202K(59008K), 0.0058440
>>>>>>>>>>
>>>>>>>>>>  secs]
>>>>>>>>>
>>>>>>>> 353366K->310443K(1152704K), 0.0059410 secs] [Times: user=0.05
>>>>>>>
>>>>>>>> sys=0.00,
>>>>>>>>>
>>>>>>>> real=0.01 secs]
>>>>>>>
>>>>>>>> 2984.239: [GC 2984.239: [ParNew: 58680K->5282K(59008K), 0.0048070
>>>>>>>>>>
>>>>>>>>>>  secs]
>>>>>>>>>
>>>>>>>> 362921K->310977K(1152704K), 0.0049180 secs] [Times: user=0.03
>>>>>>>
>>>>>>>> sys=0.00,
>>>>>>>>>
>>>>>>>> real=0.01 secs]
>>>>>>>
>>>>>>>> 3050.620: [GC 3050.620: [ParNew: 57524K->5292K(59008K), 0.0055510
>>>>>>>>>>
>>>>>>>>>>  secs]
>>>>>>>>>
>>>>>>>> 363219K->313384K(1152704K), 0.0056520 secs] [Times: user=0.04
>>>>>>>
>>>>>>>> sys=0.00,
>>>>>>>>>
>>>>>>>> real=0.01 secs]
>>>>>>>
>>>>>>>> 3111.331: [GC 3111.331: [ParNew: 57542K->5637K(59008K), 0.0056900
>>>>>>>>>>
>>>>>>>>>>  secs]
>>>>>>>>>
>>>>>>>> 365634K->318079K(1152704K), 0.0057950 secs] [Times: user=0.03
>>>>>>>
>>>>>>>> sys=0.00,
>>>>>>>>>
>>>>>>>> real=0.01 secs]
>>>>>>>
>>>>>>>> 3163.510: [GC 3163.510: [ParNew: 58117K->1138K(59008K), 0.0046540
>>>>>>>>>>
>>>>>>>>>>  secs]
>>>>>>>>>
>>>>>>>> 370559K->316086K(1152704K), 0.0047470 secs] [Times: user=0.03
>>>>>>>
>>>>>>>> sys=0.00,
>>>>>>>>>
>>>>>>>> real=0.01 secs]
>>>>>>>
>>>>>>>> 3217.642: [GC 3217.643: [ParNew: 53618K->1001K(59008K), 0.0038530
>>>>>>>>>>
>>>>>>>>>>  secs]
>>>>>>>>>
>>>>>>>> 368566K->315949K(1152704K), 0.0039450 secs] [Times: user=0.03
>>>>>>>
>>>>>>>> sys=0.00,
>>>>>>>>>
>>>>>>>> real=0.00 secs]
>>>>>>>
>>>>>>>> 3464.067: [GC 3464.067: [ParNew: 53481K->5348K(59008K), 0.0068370
>>>>>>>>>>
>>>>>>>>>>  secs]
>>>>>>>>>
>>>>>>>> 368429K->322343K(1152704K), 0.0069170 secs] [Times: user=0.07
>>>>>>>
>>>>>>>> sys=0.01,
>>>>>>>>>
>>>>>>>> real=0.00 secs]
>>>>>>>
>>>>>>>> 3465.648: [GC 3465.648: [ParNew: 57828K->1609K(59008K), 0.0111520
>>>>>>>>>>
>>>>>>>>>>  secs]
>>>>>>>>>
>>>>>>>> 374823K->321496K(1152704K), 0.0112170 secs] [Times: user=0.07
>>>>>>>
>>>>>>>> sys=0.00,
>>>>>>>>>
>>>>>>>> real=0.01 secs]
>>>>>>>
>>>>>>>> 3466.291: [GC 3466.291: [ParNew: 54089K->2186K(59008K), 0.0093190
>>>>>>>>>>
>>>>>>>>>>  secs]
>>>>>>>>>
>>>>>>>> 373976K->322073K(1152704K), 0.0093800 secs] [Times: user=0.06
>>>>>>>
>>>>>>>> sys=0.00,
>>>>>>>>>
>>>>>>>> real=0.01 secs]
>>>>>>>
>>>>>>>> Heap
>>>>>>>>>>      par new generation   total 59008K, used 15497K
>>>>>>>>>> [0x00000005fae00000,
>>>>>>>>>> 0x00000005fee00000, 0x00000005fee00000)
>>>>>>>>>>       eden space 52480K,  25% used [0x00000005fae00000,
>>>>>>>>>>
>>>>>>>>>>  0x00000005fbaff8c8,
>>>>>>>>>
>>>>>>>> 0x00000005fe140000)
>>>>>>>>
>>>>>>>>>       from space 6528K,  33% used [0x00000005fe140000,
>>>>>>>>>>
>>>>>>>>>>  0x00000005fe362b70,
>>>>>>>>>
>>>>>>>> 0x00000005fe7a0000)
>>>>>>>
>>>>>>>>       to   space 6528K,   0% used [0x00000005fe7a0000,
>>>>>>>>>>
>>>>>>>>>>  0x00000005fe7a0000,
>>>>>>>>>
>>>>>>>> 0x00000005fee00000)
>>>>>>>
>>>>>>>>      concurrent mark-sweep generation total 1093696K, used 319887K
>>>>>>>>>> [0x00000005fee00000, 0x0000000641a10000, 0x00000007fae00000)
>>>>>>>>>>      concurrent-mark-sweep perm gen total 30464K, used 30427K
>>>>>>>>>> [0x00000007fae00000, 0x00000007fcbc0000, 0x0000000800000000)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 03/08/2013 07:02 PM, Stack wrote:
>>>>>>>>>>
>>>>>>>>>>     On Fri, Mar 8, 2013 at 10:58 AM, Pablo Musa <pablo@psafe.com>
>>>>>>>>>>
>>>>>>>>>>  wrote:
>>>>>>>>>
>>>>>>>>      0.94 currently doesn't support hadoop 2.0
>>>>>>>
>>>>>>>>   Can you deploy hadoop 1.1.1 instead ?
>>>>>>>>>>>
>>>>>>>>>>>>      I am using cdh4.2.0 which uses this version as default
>>>>>>>>>>>>>
>>>>>>>>>>>>>  installation.
>>>>>>>>>>>>
>>>>>>>>>>> I think it will be a problem for me to deploy 1.1.1 because I
>>>>>>>> would
>>>>>>>>
>>>>>>>>> need
>>>>>>>>>>>> to
>>>>>>>>>>>> "upgrade" the whole cluster with 70TB of data (backup
>>>>>>>>>>>> everything,
>>>>>>>>>>>>
>>>>>>>>>>>>  go
>>>>>>>>>>>
>>>>>>>>>> offline, etc.).
>>>>>>>
>>>>>>>> Is there a problem to use cdh4.2.0?
>>>>>>>>>>>> I should send my email to cdh list?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>      That combo should be fine.
>>>>>>>>>>>>
>>>>>>>>>>>>         You Full GC'ing around this time?
>>>>>>>>>>>>
>>>>>>>>>>>   The GC shows it took a long time. However it does not make any
>>>>>>>>>>> sense
>>>>>>>>>>>
>>>>>>>>>> to be it, since the same ammount of data was cleaned before and
>>>>>>>
>>>>>>>> AFTER
>>>>>>>>>>>
>>>>>>>>>> in just 0.01 secs!
>>>>>>>
>>>>>>>>
>>>>>>>>>>>>      If JVM is full GC'ing, the application is stopped.
>>>>>>>>>>>>
>>>>>>>>>>>>       [Times: user=0.08 sys=137.62, real=137.62 secs]
>>>>>>>>>>>>
>>>>>>>>>>>   Besides the whole time was used by system. That is what is
>>>>>>>>>>> bugging
>>>>>>>>>>> me.
>>>>>>>>>>>
>>>>>>>>>>      The below does not look like a full GC but that is a long
>>>>>>>> pause
>>>>>>>>
>>>>>>>>> in
>>>>>>>>>>>
>>>>>>>>>> system
>>>>>>>
>>>>>>>> time, enough to kill your zk session.
>>>>>>>>>>>
>>>>>>>>>>> You swapping?
>>>>>>>>>>>
>>>>>>>>>>> Hardware is good?
>>>>>>>>>>>
>>>>>>>>>>> St.Ack
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>        ...
>>>>>>>>>>>
>>>>>>>>>>>   1044.081: [GC 1044.081: [ParNew: 58970K->402K(59008K),
>>>>>>>>>>> 0.0040990
>>>>>>>>>>> secs]
>>>>>>>>>>>
>>>>>>>>>> 275097K->216577K(1152704K), 0.0041820 secs] [Times: user=0.03
>>>>>>>>
>>>>>>>>> sys=0.00,
>>>>>>>>>>>
>>>>>>>>>> real=0.01 secs]
>>>>>>>>
>>>>>>>>> 1087.319: [GC 1087.319: [ParNew: 52873K->6528K(59008K), 0.0055000
>>>>>>>>>>>>
>>>>>>>>>>>>  secs]
>>>>>>>>>>>
>>>>>>>>>> 269048K->223592K(1152704K), 0.0055930 secs] [Times: user=0.04
>>>>>>>>
>>>>>>>>> sys=0.01,
>>>>>>>>>>>
>>>>>>>>>> real=0.00 secs]
>>>>>>>>
>>>>>>>>> 1087.834: [GC 1087.834: [ParNew: 59008K->6527K(59008K),
>>>>>>>>>>>> 137.6353620
>>>>>>>>>>>> secs] 276072K->235097K(1152704K), 137.6354700 secs] [Times:
>>>>>>>>>>>>
>>>>>>>>>>>>  user=0.08
>>>>>>>>>>>
>>>>>>>>>> sys=137.62, real=137.62 secs]
>>>>>>>
>>>>>>>> 1226.638: [GC 1226.638: [ParNew: 59007K->1897K(59008K), 0.0079960
>>>>>>>>>>>>
>>>>>>>>>>>>  secs]
>>>>>>>>>>>
>>>>>>>>>> 287577K->230937K(1152704K), 0.0080770 secs] [Times: user=0.05
>>>>>>>>
>>>>>>>>> sys=0.00,
>>>>>>>>>>>
>>>>>>>>>> real=0.01 secs]
>>>>>>>>
>>>>>>>>> 1227.251: [GC 1227.251: [ParNew: 54377K->2379K(59008K), 0.0095650
>>>>>>>>>>>>
>>>>>>>>>>>>  secs]
>>>>>>>>>>>
>>>>>>>>>> 283417K->231420K(1152704K), 0.0096340 secs] [Times: user=0.06
>>>>>>>>
>>>>>>>>> sys=0.00,
>>>>>>>>>>>
>>>>>>>>>> real=0.01 secs]
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>> I really appreciate you guys helping me to find out what is
>>>>>>>>>>>> wrong.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Pablo
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 03/08/2013 02:11 PM, Stack wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>      What RAM says.
>>>>>>>>>>>>
>>>>>>>>>>>>   2013-03-07 17:24:57,887 INFO
>>>>>>>>>>>> org.apache.zookeeper.************ClientCnxn:
>>>>>>>>>>>>
>>>>>>>>>>> Client
>>>>>>>>
>>>>>>>>> session timed out, have not heard from server in 159348ms for
>>>>>>>>>>>>> sessionid
>>>>>>>>>>>>> 0x13d3c4bcba600a7, closing socket connection and attempting
>>>>>>>>>>>>>
>>>>>>>>>>>>>  reconnect
>>>>>>>>>>>>
>>>>>>>>>>> You Full GC'ing around this time?
>>>>>>>>
>>>>>>>>> Put up your configs in a place where we can take a look?
>>>>>>>>>>>>>
>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Mar 8, 2013 at 8:32 AM, ramkrishna vasudevan <
>>>>>>>>>>>>> ramkrishna.s.vasudevan@gmail.**********com
>>>>>>>>>>>>>
>>>>>>>>>>>>>  <ramkrishna.s.vasudevan@gmail.
>>>>>>>>>>>>
>>>>>>>>>>> ****
>>>>>>>>
>>>>>>>>> com <ramkrishna.s.vasudevan@gmail.******com<
>>>>>>>>>>>>>
>>>>>>>>>>>>>  ramkrishna.s.vasudevan@gmail.****com<ramkrishna.s.vasudevan@*
>>>>>>>>>>>> *gmail.com <ramkrishna.s.vasudevan@gmail.com>>
>>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>       I think it is with your GC config.  What is your heap size?
>>>>>>>>>>>>>
>>>>>>>>>>>>>      What
>>>>>>>>>>>>
>>>>>>>>>>> is
>>>>>>>
>>>>>>>> the
>>>>>>>>>>>>>
>>>>>>>>>>>>>     data that you pump in and how much is the block cache size?
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Regards
>>>>>>>>>>>>>> Ram
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Mar 8, 2013 at 9:31 PM, Ted Yu <yuzhihong@gmail.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>       0.94 currently doesn't support hadoop 2.0
>>>>>>>>
>>>>>>>>>     Can you deploy hadoop 1.1.1 instead ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  Are you using 0.94.5 ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Mar 8, 2013 at 7:44 AM, Pablo Musa <pablo@psafe.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>       Hey guys,
>>>>>>>>
>>>>>>>>>     as I sent in an email a long time ago, the RSs in my cluster
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  did
>>>>>>>>>>>>>>
>>>>>>>>>>>>> not
>>>>>>>
>>>>>>>>      get
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>   along
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     and crashed 3 times a day. I tried a lot of options we
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  discussed
>>>>>>>>>>>>>>
>>>>>>>>>>>>> in
>>>>>>>
>>>>>>>  the
>>>>>>>>
>>>>>>>>> emails, but it not solved the problem. As I used an old
>>>>>>>>>>>>>>>> version
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  of
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>      hadoop I
>>>>>>>>
>>>>>>>>>       thought this was the problem.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   So, I upgraded from hadoop 0.20 - hbase 0.90 - zookeeper
>>>>>>>>>>>>>>> 3.3.5
>>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>      hadoop
>>>>>>>
>>>>>>>>   2.0.0
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     - hbase 0.94 - zookeeper 3.4.5.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Unfortunately the RSs did not stop crashing, and worst! Now
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  they
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> crash
>>>>>>>
>>>>>>>> every
>>>>>>>>>>>>>>>> hour and some times when the RS that holds the .ROOT.
>>>>>>>>>>>>>>>> crashes
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  all
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>      cluster
>>>>>>>
>>>>>>>>   get
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     stuck in transition and everything stops working.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  In this case I need to clean zookeeper znodes, restart the
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  master
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and
>>>>>>>
>>>>>>>>      the
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>   RSs.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     To avoid this case I am running on production with only
>>>>>>>>>>>>>>> ONE
>>>>>>>>>>>>>>> RS
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  and
>>>>>>>>>>>>>>
>>>>>>>>>>>>> a
>>>>>>>>
>>>>>>>>> monitoring
>>>>>>>>>>>>>>>> script that check every minute, if the RS is ok. If not,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  restart
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> it.
>>>>>>>
>>>>>>>> * This case does not get the cluster stuck.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> This is driving me crazy, but I really cant find a solution
>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> cluster.
>>>>>>>>
>>>>>>>>> I tracked all logs from the start time 16:49 from all
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  interesting
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> nodes
>>>>>>>
>>>>>>>> (zoo,
>>>>>>>>>>>>>>>> namenode, master, rs, dn2, dn9, dn10) and copied here what I
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  think
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> is
>>>>>>>>
>>>>>>>>> usefull.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> There are some strange errors in the DATANODE2, as an error
>>>>>>>>>>>>>>>> copiyng a
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>      block
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>       to itself.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   The gc log points to GC timeout. However it is very weird
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> RS
>>>>>>>>
>>>>>>>>>      spend
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>       so much time in GC while in the other cases it takes
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 0.001sec.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   Besides,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> the time
>>>>>>>>>>>>>>>> spent, is in sys which makes me think that might be a
>>>>>>>>>>>>>>>> problem
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  in
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>      another
>>>>>>>
>>>>>>>>   place.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     I know that it is a bunch of logs, and that it is very
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  difficult
>>>>>>>>>>>>>>
>>>>>>>>>>>>> to
>>>>>>>
>>>>>>>       find
>>>>>>>>
>>>>>>>>>   the
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     problem without much context. But I REALLY need some
>>>>>>>>>>>>>>> help.
>>>>>>>>>>>>>>> If
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  it
>>>>>>>>>>>>>>
>>>>>>>>>>>>> is
>>>>>>>
>>>>>>>  not
>>>>>>>>
>>>>>>>>>      the
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>       solution, at least what I should read, where I should
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> look, or
>>>>>>>>>>>>>>> which
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>       cases
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>       I
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>   should monitor.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thank you very much,
>>>>>>>>>>>>>>>> Pablo Musa
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>   --
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Regards,*
>>>>>>>>> --- *Sreepathi *
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  --
>>>>> Best regards,
>>>>>
>>>>>       - Andy
>>>>>
>>>>> Problems worthy of attack prove their worth by hitting back. - Piet
>>>>> Hein
>>>>> (via Tom White)
>>>>>
>>>>>
>

--14dae94734b346bd6a04d97bb51d--