hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bradford Stephens <bradfordsteph...@gmail.com>
Subject Re: Story of my HBase Bugs / Feature Suggestions
Date Wed, 26 Aug 2009 02:14:06 GMT
As a side note, we've been beating on RC2 for a week solid, and it's very
stable. We're really only limited by our RAM and GC, now :)

On Sat, Aug 22, 2009 at 6:59 AM, Andrew Purtell <apurtell@apache.org> wrote:

> Jon,
>
> Cool. I suspected as much. I'm really glad to see those bugs were found and
> fixed...
>
>   - Andy
>
>
>
>
> ________________________________
> From: Jonathan Gray <jlist@streamy.com>
> To: hbase-user@hadoop.apache.org
> Sent: Saturday, August 22, 2009 12:24:51 AM
> Subject: Re: Story of my HBase Bugs / Feature Suggestions
>
> Andy,
>
> Bradford ran his imports when there was both a Scanner bug related to
> snapshotting that opened up a race condition, as well as the nasty bugs in
> getClosestBefore used to look things up in META.
>
> It was most likely a combination of both of these things making for some
> rather nasty behavior.
>
> JG
>
> Andrew Purtell wrote:
> > There are plans to host live region assignments in ZK and keep only an
> up-to-date copy of this state in META for use on cold boot. This is on the
> roadmap for 0.21 but perhaps could be considered for 0.20.1 also. This may
> help here.
> > A TM development group saw the same behavior on a 0.19 cluster. We
> > postponed looking at this because 0.20 has a significant rewrite of
> > region assignment. However, it is interesting to hear such a similar
> > description. I worry the underlying cause may be scanners getting stale
> data on the RS as opposed to some master problem which could be solved by
> the above, a more pervasive problem. Bradford, any chance you kept around
> logs or similar which may provide clues?
> >
> >    - Andy
> >
> >
> >
> >
> > ________________________________
> > From: Bradford Stephens <bradfordstephens@gmail.com>
> > To: hbase-user@hadoop.apache.org
> > Sent: Friday, August 21, 2009 6:48:17 AM
> > Subject: Story of my HBase Bugs / Feature Suggestions
> >
> > Hey there,
> >
> > I'm sending out this summary of how I diagnosed what was wrong with my
> > cluster in hopes that you can glean some knowledge/suggestions from it :)
> > Thanks for the diagnostic footwork.
> >
> > A few days ago,  I noticed that simple MR jobs I was running against
> .20-RC2
> > were failing. Scanners were reaching the end of a region, and then simply
> > freezing. The only indication I had of this was the Mapper timing out
> after
> > 1000 seconds -- there were no error messages in the logs for either
> Hadoop
> > or HBase.
> >
> > It turns out that my table was corrupt:
> >
> > 1. Doing a 'GET' from the shell on a row near the end of a region
> resulted
> > in an error: "Row not in expected region", or something to that effect.
> It
> > re-appeared several times, and I never got the row content.
> > 2. What the Master UI indicated for the region distribution was totally
> > different from what the RS reported. Row key ranges were on different
> > servers than the UI knew about, and the nodes reported different start
> and
> > end keys for a region than the UI.
> >
> > I'm not sure how this arose: I noticed after a heavy insert job that when
> we
> > tried to shut down our cluster, it took 30 dots and more -- so we
> manually
> > killed master. Would that lead to corruption?
> >
> > I finally resolved the problem by dropping the table and re-loading the
> data
> >
> > A few suggestions going forward:
> > 1. More useful scanner error messages: GET reported that there was a
> problem
> > finding a certain row, why couldn't Scanner? There wasn't even a timeout
> or
> > anything -- it just sat there.
> > 2. A fsck / restore would be useful for HBase. I imagine you can recreate
> > .META. using .regioninfo and scanning blocks out of HDFS. This would play
> > nice with the HBase bulk loader story, I suppose.
> >
> > I'll be happy to work on these in my spare time, if I ever get any ;)
> >
> > Cheers,
> > Bradford
> >
> >
>
>
>
>
>



-- 
http://www.roadtofailure.com -- The Fringes of Scalability, Social Media,
and Computer Science

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message