Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
From: Brian Jeltema <bdjeltema@gmail.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_3080E70F-C03F-48F4-BEF1-360055C25D05"
Message-Id: <A34CE157-E128-4FCE-A177-B95B90CBB4FC@gmail.com>
Mime-Version: 1.0 (Mac OS X Mail 9.1 \(3096.5\))
Subject: Re: regions in transition
Date: Wed, 23 Dec 2015 07:20:28 -0500
References: <CD129FB9-789F-4390-B3E2-B237BCBAC9B3@digitalenvoy.net>
 <CALte62wz_LGehRo3mnxRJhxqLcho0bzQyaqOJguRvtUvnv8uYw@mail.gmail.com>
 <552F68EF-1469-4E23-83F0-0294AADFE521@digitalenvoy.net>
 <512A4548-13CA-4B62-8223-0DA7C9E0AF66@gmail.com>
 <CALte62wQhMppYWOfiKu7-AXEbguaqWmHRyUZCNuFQPj2bUKyWw@mail.gmail.com>
 <CAKrkF=tsDkzd3BCyCyP2okYY7TQ72rzp108Bwa1SmaridmC-yA@mail.gmail.com>
 <5E993C94-2A75-439E-B024-044229F758F0@digitalenvoy.net>
 <CAKrkF=usS6NbLMC1iqhaQvnV6MiSCNTuhz9s+ADC8rQF2pQEZA@mail.gmail.com>
 <94302313-851D-4F1A-9DF6-E2A966BBEBB0@digitalenvoy.net>
To: user@hbase.apache.org
In-Reply-To: <94302313-851D-4F1A-9DF6-E2A966BBEBB0@digitalenvoy.net>

--Apple-Mail=_3080E70F-C03F-48F4-BEF1-360055C25D05
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Update on this:

deleting the contents the /hbase-unsecure/region-in-transition node did =
fix my problem with
HBase finding my table regions.

I'm still have a problem though, possibly related. I=E2=80=99m seeing =
OutOfMemory errors in the region server logs (modified slightly):

2015-12-23 06:52:37,466 INFO  [RS_LOG_REPLAY_OPS-p7:60020-0] =
handler.HLogSplitterHandler: worker p7.foo.net,60020,1450871487168 done =
with task =
/hbase-unsecure/splitWAL/WALs%2Fp15.foo.net%2C60020%2C1450535337455-splitt=
ing%2Fp15.foo.net%252C60020%252C1450535337455.1450535339318 in 68348ms
2015-12-23 06:52:37,466 ERROR [RS_LOG_REPLAY_OPS-p7:60020-0] =
executor.EventHandler: Caught throwable while processing event =
RS_LOG_REPLAY
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:713)
        at =
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:=
949)
        at =
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:13=
60)
        at =
java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionSe=
rvice.java:181)
        at =
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$LogRecoveredEditsOut=
putSink.close(HLogSplitter.java:1121)
        at =
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$LogRecoveredEditsOut=
putSink.finishWritingAndClose(HLogSplitter.java:1086)
        at =
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSpl=
itter.java:360)
        at =
org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSpl=
itter.java:220)
        at =
org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.=
java:143)
        at =
org.apache.hadoop.hbase.regionserver.handler.HLogSplitterHandler.process(H=
LogSplitterHandler.java:82)
        at =
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
        at =
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:=
1145)
        at =
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java=
:615)
        at java.lang.Thread.run(Thread.java:744)

The region servers are configured with an 8G heap. I initially thought =
this might be a ulimit problem, so I bumped the
open file limit to about 10K and the process limit  up to 2048, but that =
did not seem to matter. What other parameters
might be causing an OOM error?

Thanks
Brian

> On Dec 22, 2015, at 12:46 PM, Brian Jeltema <bdjeltema@gmail.com> =
wrote:
>=20
>>=20
>> You should really find out where you hmaster ui lives (there is a =
master UI
>> for every node provided by the apache project) because it gives you
>> information on the state of your system,
>=20
> I=E2=80=99m familiar with the HMaster UI. I=E2=80=99m looking at it =
now. It does not contain
> the information you describe. There is a list of region servers and an
> a menu bar that contains: Home    Table Details    Local Logs   Degug =
Dump    Metrics Dump    HBase Configuration
>=20
> If I click on the Table Details item, I get a list of the tables. If I =
click on a table, there is a Tasks section that says
> No tasks currently runining on this node.
>=20
> The region server logs do not contain any records relating to RITs, or =
really even regions.
> The master UI does not contain any information about RITs
> Version:  HDP 2.2 -> HBase 0.98.4
>=20
> The zookeeper node /hbase-unsecure/regions-in-transition contains a =
long list of items
> that are not removed when I restart the service. I think this is a =
side-effect of problems
> I had when I did the HDP 2.1 -> HDP 2.2 upgrade, which did not go =
well.=20
>=20
> I would like to remove or clear the =
/hbase-unsecure/region-in-transition node
> as an experiment. I=E2=80=99m just looking for guidance on whether =
that is a safe thing to do.
>=20
> Brian
>=20
>> but if you want to skip all that,
>> here are the instructions for OfflineRepair, without knowing what is
>> happening with your system (logs, master ui info) you can try this =
but at
>> your own risk.
>>=20
>> OfflineMetaRepair.
>> Description Below:
>> This code is used to rebuild meta off line from file system data. If =
there
>>  * are any problem detected, it will fail suggesting actions for the =
user
>> to do
>>  * to "fix" problems. If it succeeds, it will backup the previous
>> hbase:meta and
>>  * -ROOT- dirs and write new tables in place.
>>=20
>> Stop HBase
>> zookeeper-client rmr /hbase
>> HADOOP_USER_NAME=3Dhbase hbase
>> org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair
>> start hbase
>>=20
>> ^ This has worked for me in some situations where I understood HDFS =
and
>> Zookeeper disagreed on region locations, but keep in mind I have =
tried this
>> on hbase 1.0.0 and your mileage may vary.
>>=20
>> We don't have your hbase version (you can even find this on the hbase =
shell)
>> We don't have logs msgs
>> We don't have master's view of your RITs
>>=20
>>=20
>> On Tue, Dec 22, 2015 at 11:52 AM, Brian Jeltema <bdjeltema@gmail.com> =
wrote:
>>=20
>>> I=E2=80=99m running Ambari 2.0.2 and HPD 2.2. I don=E2=80=99t see =
any of this displayed at
>>> master:60010.
>>>=20
>>> I really think this problem is the result of cruft in ZooKeeper. =
Does
>>> anybody know
>>> if it=E2=80=99s safe to delete the node?
>>>=20
>>>=20
>>>> On Dec 22, 2015, at 11:40 AM, Geovanie Marquez <
>>> geovanie.marquez@gmail.com> wrote:
>>>>=20
>>>> check hmaster:60010 under TASKS (between Software Attributes and =
Tables)
>>>> you will see if you have regions in transition. This will tell you =
which
>>>> regions are transitioning and you can go to those region server =
logs and
>>>> check them, I've run into a couple of these and every time they've =
talk
>>> to
>>>> me about their problem.
>>>>=20
>>>> Also, under Software Attributes you can check the HBase version.
>>>>=20
>>>> On Tue, Dec 22, 2015 at 11:29 AM, Ted Yu <yuzhihong@gmail.com> =
wrote:
>>>>=20
>>>>> =46rom RegionListTmpl.jamon :
>>>>>=20
>>>>> <%if (onlineRegions !=3D null && onlineRegions.size() > 0) %>
>>>>> ...
>>>>> <%else>
>>>>>  <p>Not serving regions</p>
>>>>> </%if>
>>>>>=20
>>>>> The message means that there was no region online on the =
underlying
>>> server.
>>>>>=20
>>>>> FYI
>>>>>=20
>>>>> On Tue, Dec 22, 2015 at 7:18 AM, Brian Jeltema =
<bdjeltema@gmail.com>
>>>>> wrote:
>>>>>=20
>>>>>> Following up, if I look at the MBase Master UI in the Ambari =
console I
>>>>> see
>>>>>> links to
>>>>>> all of the region servers. If I click on those links, the Region =
Server
>>>>>> page comes
>>>>>> up and in the Regions section, is displays =E2=80=98Not serving =
regions=E2=80=99. I=E2=80=99m
>>> not
>>>>>> sure
>>>>>> if that means something is disabled, or it just doesn=E2=80=99t =
have any
>>> regions
>>>>>> to server.
>>>>>>=20
>>>>>>> On Dec 22, 2015, at 6:19 AM, Brian Jeltema <bdjeltema@gmail.com>
>>>>> wrote:
>>>>>>>=20
>>>>>>>>=20
>>>>>>>> Can you pick a few regions stuck in transition and check =
related
>>>>> region
>>>>>>>> server logs to see why they couldn't be assigned ?
>>>>>>>=20
>>>>>>> I don=E2=80=99t see anything in the region logs relating any =
regions.
>>>>>>>=20
>>>>>>>>=20
>>>>>>>> Which release were you using previously ?
>>>>>>>=20
>>>>>>> HDP 2.1 -> HDP 2.2
>>>>>>>=20
>>>>>>> So is it safe to stop HBase and delete the ZK node?
>>>>>>>=20
>>>>>>>>=20
>>>>>>>> Thanks
>>>>>>>>=20
>>>>>>>> On Mon, Dec 21, 2015 at 3:54 PM, Brian Jeltema =
<bdjeltema@gmail.com>
>>>>>> wrote:
>>>>>>>>=20
>>>>>>>>> I am doing a cluster upgrade to the HDP 2.2 stack. For some =
reason,
>>>>>> after
>>>>>>>>> the upgrade HBase
>>>>>>>>> cannot find any regions for existing tables. I believe the =
HDFS file
>>>>>>>>> system is OK. But looking at the ZooKeeper
>>>>>>>>> nodes, I noticed that many (maybe all) of the regions were =
listed in
>>>>>> the
>>>>>>>>> ZooKeeper
>>>>>>>>> /hbase-unsecure/region-in-transition node. I suspect this =
could be
>>>>>> causing
>>>>>>>>> a problem. Is it
>>>>>>>>> safe to stop HBase and delete that node?
>>>>>>>>>=20
>>>>>>>>> Thanks
>>>>>>>>> Brian
>>>>>>>=20
>>>>>>=20
>>>>>>=20
>>>>>=20
>>>=20
>>>=20
>=20
>=20


--Apple-Mail=_3080E70F-C03F-48F4-BEF1-360055C25D05--