hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: hbck -fix
Date Mon, 04 Jul 2011 17:09:24 GMT
On Sun, Jul 3, 2011 at 10:12 AM, Wayne <wav100@gmail.com> wrote:
> HBase needs to evolve a little more before organizations
> like ours can just "use it" without having to become experts.

I'd agree with this.  In its current state, at least a part-time,
seasoned operations engineer (per Andrew's description) is necessary
if a substantial production deploy.  I don't think that an onerous
expectation for a critical piece of infrastructure.  It'd certainly
broaden our appeal though if we could get into the mysql calibre of
ease-of-use....

That said, the issue you ran into where an 'incident' make it so a
'smart' fellow was unable to reconstitute his store needs addressing.
We'll work on this.

St.Ack


> I have to say the community behind HBase is fantastic and goes above and
> beyond to help greenies like ourselves be successful. With just a little
> more polish around the edges I think it can and will really
> become successful for a much wider audience. Thanks for everyones help.
>
>
> On Sun, Jul 3, 2011 at 4:08 AM, Andrew Purtell <apurtell@apache.org> wrote:
>
>> I shorthanded this a bit:
>>
>> > Certainly a seasoned operations engineer would be a good investment for
>> anyone.
>>
>>
>> Let's try instead:
>>
>> Certainly a seasoned operations engineer [with Java experience] would be a
>> good investment for anyone [running Hadoop based systems].
>>
>> I'm not sure what I wrote earlier adequately conveyed the thought.
>>
>>
>>   - Andy
>>
>>
>>
>>
>> > From: Andrew Purtell <apurtell@apache.org>
>> > To: "user@hbase.apache.org" <user@hbase.apache.org>
>> > Cc:
>> > Sent: Sunday, July 3, 2011 12:39 AM
>> > Subject: Re: hbck -fix
>> >
>> > Wayne,
>> >
>> > Did you by chance have your NameNode configured to write the edit log to
>> only
>> > one disk, and in this case only the root volume of the NameNode host? As
>> I'm
>> > sure you are now aware, the NameNode's edit log was corrupted, at least
>> the
>> > tail of it anyway, when the volume upon which it was being written was
>> filled by
>> > an errant process. The HDFS NameNode has a special critical role and it
>> really
>> > must be treated with the utmost care. It can and should be configured to
>> write
>> > the fsimage and edit log to multiple local dedicated disks. And, user
>> processes
>> > should never run on it.
>> >
>> >
>> >>  Hope has long since flown out the window. I just changed my opinion of
>> what
>> >>  it takes to manage hbase. A Java engineer is required on staff.
>> >
>> > Perhaps.
>> >
>> > Certainly a seasoned operations engineer would be a good investment for
>> anyone.
>> >
>> >>  Having
>> >>  RF=3 in HDFS offers no insurance against hbase lossing its shirt and
>> having
>> >>  .META. getting corrupted.
>> >
>> > This is a valid point. If HDFS loses track of blocks containing META
>> table data
>> > due to fsimage corruption on the NameNode, having those blocks on 3
>> DataNodes is
>> > of no use.
>> >
>> >
>> > I've done exercises in the past like delete META on disk and recreate it
>> > with the earlier set of utilities (add_table.rb). This always "worked for
>> > me" when I've tried it.
>> >
>> >
>> > Results from torture tests that HBase was subjected to in the timeframe
>> leading
>> > up to 0.90 also resulted in better handling of .META. table related
>> errors. They
>> > are fortunately demonstrably now rare.
>> >
>> >
>> > Clearly however there is room for further improvement here.
>> > I will work on https://issues.apache.org/jira/browse/HBASE-4058 and
>> hopefully
>> > produce a unit test that fully exercises the ability of HBCK to
>> reconstitute
>> > META and gives
>> > reliable results that can be incorporated into the test suite. My concern
>> here
>> > is getting repeatable results demonstrating HBCK weaknesses will be
>> challenging.
>> >
>> >
>> > Best regards,
>> >
>> >
>> >        - Andy
>> >
>> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via
>> > Tom White)
>> >
>> >
>> > ----- Original Message -----
>> >>  From: Wayne <wav100@gmail.com>
>> >>  To: user@hbase.apache.org
>> >>  Cc:
>> >>  Sent: Saturday, July 2, 2011 9:55 AM
>> >>  Subject: Re: hbck -fix
>> >>
>> >>  It just returns a ton of errors (import: command not found). Our
>> cluster is
>> >>  hosed anyway. I am waiting to get it completely re-installed from
>> scratch.
>> >>  Hope has long since flown out the window. I just changed my opinion of
>> what
>> >>  it takes to manage hbase. A Java engineer is required on staff. I also
>> >>  realized now a backup strategy is more important than for a RDBMS.
>> Having
>> >>  RF=3 in HDFS offers no insurance against hbase lossing its shirt and
>> having
>> >>  .META. getting corrupted. I think I just found the achilles heel.
>> >>
>> >>
>> >>  On Sat, Jul 2, 2011 at 12:40 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>> >>
>> >>>   Have you tried running check_meta.rb with --fix ?
>> >>>
>> >>>   On Sat, Jul 2, 2011 at 9:19 AM, Wayne <wav100@gmail.com> wrote:
>> >>>
>> >>>   > We are running 0.90.3. We were testing the table export not
>> > realizing
>> >>  the
>> >>>   > data goes to the root drive and not HDFS. The export filled
the
>> >>  master's
>> >>>   > root partition. The logger had issues and HDFS got corrupted
>> >>>   > ("java.io.IOException:
>> >>>   > Incorrect data format. logVersion is -18 but writables.length
is
>> >>  0"). We
>> >>>   > had
>> >>>   > to run hadoop fsck -move to fix the corrupted hdfs files. Were
>> > were
>> >>  able
>> >>>   to
>> >>>   > get hdfs running without issues but hbase ended up with the
>> > region
>> >>>   issues.
>> >>>   >
>> >>>   > We also had another issue making it worse with Ganglia. We had
>> > moved
>> >>  the
>> >>>   > Ganglia host to the master server and Ganglia took up so many
>> >>  resources
>> >>>   > that
>> >>>   > it actually caused timeouts talking to the master and most nodes
>> > ended
>> >>  up
>> >>>   > shutting down. I guess Ganglia is a pig in terms or resources...
>> >>>   >
>> >>>   > I just tried to manually edit the .META. table removing the
>> > remnants
>> >>  of
>> >>>   the
>> >>>   > old table but the shell went haywire on me and turned to control
>> >>>   > characters..??...I ended up corrupting the whole thing and had
to
>> >
>> >>  delete
>> >>>   > all
>> >>>   > tables...we have just not had a good week.
>> >>>   >
>> >>>   > I will add comments to HBASE-3695 in terms of suggestions.
>> >>>   >
>> >>>   > Thanks.
>> >>>   >
>> >>>   > On Fri, Jul 1, 2011 at 4:55 PM, Stack <stack@duboce.net>
>> > wrote:
>> >>>   >
>> >>>   > > What version of hbase are you on Wayne?
>> >>>   > >
>> >>>   > > On Fri, Jul 1, 2011 at 8:32 AM, Wayne
>> > <wav100@gmail.com>
>> >>  wrote:
>> >>>   > > > I ran the hbck command and found 14 inconsistencies.
>> > There
>> >>  were files
>> >>>   > in
>> >>>   > > > hdfs not used for region
>> >>>   > >
>> >>>   > > These are usually harmless.  Bad accounting on our part.
>> > Need to
>> >>  plug
>> >>>   > the
>> >>>   > > hole.
>> >>>   > >
>> >>>   > > >, regions with the same start key, a hole in the
>> >>>   > > > region chain, and a missing start region with an empty
>> > key.
>> >>>   > >
>> >>>   > > These are pretty serious.
>> >>>   > >
>> >>>   > > How'd the master running out of root partition do this?
>> >
>> >>  I'd be
>> >>>   > > interested to know.
>> >>>   > >
>> >>>   > > > We are not in production so we have the luxury to
start
>> >
>> >>  again, but
>> >>>   the
>> >>>   > > > damage to our confidence is severe. Is there work
going
>> > on
>> >>  to improve
>> >>>   > > hbck
>> >>>   > > > -fix to actually be able to resolve these types of
>> > issues?
>> >>  Do we need
>> >>>   > to
>> >>>   > > > expect to run a production hbase cluster to be able
to
>> > move
>> >>  around
>> >>>   and
>> >>>   > > > rebuild the region definitions and the .META. table
by
>> > hand?
>> >>  Things
>> >>>   > just
>> >>>   > > got
>> >>>   > > > a lot scarier fast for us, especially since we were
>> > hoping
>> >>  to go into
>> >>>   > > > production next month. Running out of disk space on
the
>> >
>> >>  master's root
>> >>>   > > > partition can bring down the entire cluster? This
is
>> >>  scary...
>> >>>   > > >
>> >>>   > >
>> >>>   > > Understood.
>> >>>   > >
>> >>>   > > St.Ack
>> >>>   > >
>> >>>   >
>> >>>
>> >>
>> >----- Original Message -----
>>
>>
>

Mime
View raw message