hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "George P. Stathis" <gstat...@traackr.com>
Subject Re: 0.89 Regionserver outage behaviors
Date Fri, 22 Oct 2010 16:49:40 GMT
No, the master cannot recover after the node outage without a full cluster
restart; the whole system comes to a halt because of the missing
regionserver. But that seems to be an hbase-trx bug form what I can tell
(see EOFException in master log stacktrace here
http://gist.github.com/640885) . HDFS is CDH3 beta (version 0.20.2+320,
r9b72d268a0b590b4fd7d13aca17c1c453f8bc957) from June 29th.

-GS

On Fri, Oct 22, 2010 at 11:23 AM, Stack <stack@duboce.net> wrote:

> On Thu, Oct 21, 2010 at 1:25 PM, George P. Stathis <gstathis@traackr.com>
> wrote:
> > Thanks St.Ack. So, looking at these configs, the default behavior should
> be
> > to sync each individual entry directly to HDFS or at least synch to HDFS
> > every second. Right?
> >
>
> Yes.
>
> > I ran this quick test:
> >
> >   - 1 Master, 4 Regionservers all 5 on separate EC2 instances,
> > default flushlogentries
> >   and optionallogflushinterval settings on all boxes
> >   - Tailed all 4 regionserver logs while making a single edit through our
> >   application
> >   - Edit was submitted via the hbase Java client API and I was able to
> see
> >   which regionserver was hit through the logs
> >   - Waited for about 30 seconds and then did a kill -9 on the
> regionserver
> >   process that received the edit
> >   - Restarted the hbase cluster (left DFS running)
> >   - Edit was lost
> >
> > I'll try some different config values to see if it makes a difference but
> > I'm wondering if you agree this is a valid test. Disclaimer: I'm using
> > http://github.com/jameskennedy/hbase/tree/HLogSplit_0.89.20100726 , so
> there
> > could be a bug in that fork for all I know. If that's the case, then
> don't
> > let me waste your time here, I'll have to look into it with James.
> >
>
> That looks right.  A single edit into fresh cluster is an interesting
> test.  Which HDFS version are you using?  You see log recovery going
> on in the master after you kill the server (after its lease in zk
> expires) or on restart of the cluster?
>
> St.Ack
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message