Mailing-List: contact hbase-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of saint.ack@gmail.com
 designates 209.85.221.183 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type
         :content-transfer-encoding;
        b=OfhQtemO3GQoRwrMICv7s4X4u7siips3Hypkm4OEDQgHfYfJ+2dLc5Ou3pfsS8/54B
         foyB9srP/sY1yfdQuOnNQtMtmq5DAgMV8Zh1bjzt/GadNYE+73kiLmnbFVgJuK+XwEp0
         HAuU9DXH+9rgbkHDf4H71p5prDuxdC8a8/91A=
MIME-Version: 1.0
Sender: saint.ack@gmail.com
In-Reply-To: <1264632100.1157.1701.camel@puma>
References: <1264568589.1157.516.camel@puma>
	 <7c962aed1001271154o7a2d2042xb9f36ac8506a4a1d@mail.gmail.com>
	 <1264632100.1157.1701.camel@puma>
Date: Wed, 27 Jan 2010 14:51:48 -0800
Message-ID: <7c962aed1001271451s358cc872j96b805eba59a75cc@mail.gmail.com>
Subject: Re: Region gets stuck in transition state
From: Stack <stack@duboce.net>
To: hbase-user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Wed, Jan 27, 2010 at 2:41 PM, James Baldassari <james@dataxu.com> wrote:
>
> First we shut down the master and all region servers and then manually
> removed the /hbase root through hadoop/HDFS. =A0One of my colleagues
> increased some timeout values (I think they were ZooKeeper timeouts).

ticktime?

> Another change was that I recreated the table without LZO compression
> and without setting the IN_MEMORY flag. =A0I learned that we did not have
> the LZO libraries installed, and the table had been created originally
> with compression set to LZO, so I imagine that would cause problems. =A0I
> didn't see any errors about it in the logs, however. =A0Maybe this
> explains why we lost data during our initial testing after shutting down
> HBase. =A0Perhaps it was unable to write the data to HDFS because the LZO
> libraries were not available?
>

If lzo enabled and libs are not in place, no data is written IIRC.
Its a problem.

> Anyway, everything seems to be ok for now. =A0We can restart HBase withou=
t
> data loss or errors, and we can truncate the table without any problems.
> If any other issues crop up we plan on upgrading to 0.20.3, but our
> preference is to stay with the Cloudera distro if we can. =A0We're doing
> additional testing tonight with a larger dataset, so I'll keep an eye on
> it and post back if we learn anything new.

Avoid truncating tables if you are not on 0.20.3.  Its flakey and may
put you back in the spot you complained of orignally.

St.Ack

>
> Thanks again for your help.
>
> -James
>
>
> On Wed, 2010-01-27 at 13:54 -0600, Stack wrote:
>> On Tue, Jan 26, 2010 at 9:03 PM, James Baldassari <james@dataxu.com> wro=
te:
>> >
>> > After running a map/reduce job which inserted around 180,000 rows into
>> > HBase, HBase appeared to be fine. =A0We could do a count on our table,=
 and
>> > no errors were reported. =A0We then tried to truncate the table in
>> > preparation for another test but were unable to do so because the regi=
on
>> > became stuck in a transition state.
>>
>> Yes. =A0In older hbase, truncate of > small tables was flakey. =A0Its
>> better in 0.20.3 (I wrote our brothers over at Cloudera about updating
>> version they bundle especially since 0.20.3 just went out).
>>
>> =A0I restarted each region server
>> > individually, but it did not fix the problem. =A0I tried the
>> > disable_region and close_region commands from the hbase shell, but tha=
t
>> > didn't work either. =A0After doing all of that, a status 'detailed' sh=
owed
>> > this:
>> >
>> > 1 regionsInTransition
>> > =A0 =A0name=3Dretargeting,,1264546222144, unassigned=3Dfalse, pendingO=
pen=3Dfalse, open=3Dfalse, closing=3Dtrue, pendingClose=3Dfalse, closed=3Df=
alse, offlined=3Dfalse
>> >
>> > Then I restarted the master and all region servers, and it looked like=
 this:
>> >
>> > 1 regionsInTransition
>> > =A0 =A0name=3Dretargeting,,1264546222144, unassigned=3Dfalse, pendingO=
pen=3Dtrue, open=3Dfalse, closing=3Dfalse, pendingClose=3Dfalse, closed=3Df=
alse, offlined=3Dfalse
>>
>>
>> Even after a master restart? =A0Above is dump of a master internal
>> datastructure that is kept in-memory. =A0Strange that it would pick up
>> same exact state on restart (As Ryan says, a restart of the master
>> alone is usually a radical but sufficient fix).
>>
>> I was going to say that you try onlining the individual region in the
>> shell but I don't think that'll work either, not unless you update to
>> 0.20.3 era hbase.
>>
>> >
>> > I noticed messages in some of the region server logs indicating that
>> > their zookeeper sessions had expired. =A0I'm not sure if this has anyt=
hing
>> > to do with the problem.
>>
>> It could. =A0The regionservers will restart if their session w/ zk
>> expires. =A0Whats your hbase schema like? =A0How are you doing your
>> upload?
>>
>> I should mention that this scenario is quite
>> > repeatable, and the last few times it has happened we had to shut down
>> > HBase and manually remove the /hbase root from HDFS, then start HBase
>> > and recreate the table.
>> >
>> For sure you've upped file descriptors and xceiver params as per the
>> Getting Started?
>>
>> >
>> > I was also wondering whether it was normal for there to be only one
>> > region with 180,000+ rows. =A0Shouldn't this region be split into seve=
ral
>> > regions and distributed among the region servers? =A0I'm new to HBase,=
 so
>> > maybe my understanding of how it's supposed to work is wrong.
>>
>> Get the regions size on the filesystem: ./bin/hadoop fs -dus
>> /hbase/table/regionname. =A0Region splits when its above a size
>> threshold, 256M usually.
>>
>> St.Ack
>>
>> >
>> > Thanks,
>> > James
>> >
>> >
>> >
>
>