hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Baldassari <jbaldass...@gmail.com>
Subject Re: Region server request throughput drops to zero
Date Wed, 06 Oct 2010 14:55:41 GMT
After removing IHBase the cluster has been stable for 24 hours with no
issues whatsoever.

So now that I can't use IHBase's filtered index scan, I was hoping someone
could clear up a question I have about the standard scan filters.  I have a
column for ICVs, and I'd like to do a full table scan filtering out all rows
in which the long value of that ICV column is less than a fixed threshold.
It looks like SingleColumnValueFilter will do what I want, but the docs seem
to indicate that the default BinaryComparator won't be sufficient for
comparing ICV (long) vales:

"If this is not sufficient (eg you want to deserialize a long and then
compare it to a fixed long value), then you can pass in your own comparator
instead"

That's exactly what I want to do (deserialize a long and compare it to a
fixed long value).  Has this been done already?  I don't see any
WritableByteArrayComparable subclasses for comparing longs.  If there is no
existing comparator for this, and I have to write my own, that means that
I'll have to push a jar with that comparator out to the region servers,
right?  I'd really like to avoid that and stick with the included
comparators if possible.

Thanks,
James


On Mon, Oct 4, 2010 at 2:34 PM, James Baldassari <jbaldassari@gmail.com>wrote:

> Hey Stack.  Here's the region server log from this morning's crash:
> http://pastebin.com/b7cEUT3U
>
> Not much happening there.  I also found the log from last night's crash,
> which appears to be more interesting: http://pastebin.com/8VqpUYSV
>
> It looks like it's having some problems doing ICVs, and there was this
> weird error:
>
> 2010-10-04 00:25:59,876 WARN org.apache.hadoop.hbase.regionserver.Store:
> Failed open of
> hdfs://rts-nn01.sldc.[domain].net:50001/hbase/users/1958649137/data/5261945116444723281;
> presumption is that file was corrupted at flush and lost edits picked up by
> commit log replay. Verify!
> java.io.IOException: Trailer 'header' is wrong; does the trailer size match
> content?
>
> You can see that I had to kill the RS and restart it near the end of the
> snippet.  I wonder if this problem has anything to do with IHBase because an
> index scan was running around the time of the crash.  Everything was stable
> before our release about a week ago, which included introducing IHBase.  We
> also added a couple new region servers and a new client app, so that wasn't
> the only change.  Still, I think I might try removing IHBase temporarily to
> see if that improves things.
>
> -James
>
>
>
> On Mon, Oct 4, 2010 at 1:26 PM, Stack <stack@duboce.net> wrote:
>
>> And a log snippet from the regionserver at that time would help James...
>> thanks.
>> St.Ack
>>
>> On Mon, Oct 4, 2010 at 8:53 AM, James Baldassari <jbaldassari@gmail.com>
>> wrote:
>> > It happened again this morning, and this time I have full jstacks.  I
>> didn't
>> > realize jstack had to be run as the same user that owns the process.
>> >
>> > Here's one of the region servers: http://pastebin.com/VeWXDQcu
>> > And the master: http://pastebin.com/pk1eAszJ
>> >
>> > These seem to indicate that most threads are waiting on take(), which I
>> > guess means they're idle waiting for requests to come in?  That sounds
>> > strange to me because I know the clients are trying to send requests.
>> >
>> > -James
>> >
>> >
>> > On Mon, Oct 4, 2010 at 10:18 AM, James Baldassari <
>> jbaldassari@gmail.com>wrote:
>> >
>> >> Thanks for the tip, Ryan.  The cluster got into that weird state again
>> last
>> >> night, and I tried to jstack everything.  I did have some trouble,
>> though.
>> >> It only worked with the -F flag, and even then I couldn't get any stack
>> >> traces.  According to the docs, the fact that I needed to use -F means
>> that
>> >> the JVM was hung for some reason.  I'm not really sure what could cause
>> >> that.  Like I mentioned before, I don't see any long GC pauses in the
>> logs.
>> >>
>> >> Here is the jstack output I was able to get for one of the region
>> servers:
>> >> http://pastebin.com/A9W1ti5S
>> >> And the master: http://pastebin.com/jb2cvmFC
>> >>
>> >> Both indicate that all the threads are blocked except one.  I also got
>> a
>> >> thread dump on a couple of the region servers.  Here's one:
>> >> http://pastebin.com/KkWcY5mf
>> >>
>> >> It looks like most of the threads are blocked in
>> >> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get or
>> >> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.release.  Is
>> that
>> >> normal?
>> >>
>> >> Thanks,
>> >> James
>> >>
>> >>
>> >>
>> >> On Sun, Oct 3, 2010 at 11:55 PM, Ryan Rawson <ryanobjc@gmail.com>
>> wrote:
>> >>
>> >>> During the event try jstack'ing the affected regionservers. That is
>> >>> usually
>> >>> extremely illuminating.
>> >>> On Oct 3, 2010 8:06 PM, "James Baldassari" <jbaldassari@gmail.com>
>> wrote:
>> >>> > Hi,
>> >>> >
>> >>> > We've been having a strange problem with our HBase cluster recently
>> >>> (0.20.5
>> >>> > + HBASE-2599 + IHBase-0.20.5). Everything will be working fine,
>> doing
>> >>> > mostly gets at 5-10k/sec and an hourly bulk insert (using HTable
>> puts)
>> >>> that
>> >>> > can spike the total throughput up to 15-50k ops/sec, but at some
>> point
>> >>> the
>> >>> > cluster gets into this state where the request throughput (gets
and
>> >>> puts)
>> >>> > drops to zero across 5 of our 6 region servers. Restarting the
whole
>> >>> > cluster is the only way to fix the problem, but it gets back into
>> that
>> >>> bad
>> >>> > state again after 4-12 hours.
>> >>> >
>> >>> > Nothing in the region server or master logs indicates any errors
>> except
>> >>> > occasional DFS client timeouts. The logs look exactly like they
do
>> >>> during
>> >>> > normal operation, even with debug logging on. I have GC logging
on
>> as
>> >>> well,
>> >>> > and there are no long GC pauses (the region servers have 11G of
>> heap).
>> >>> When
>> >>> > the request rate drops the load is low on the region servers, there
>> is
>> >>> > little to no I/O wait, and there are no messages in the region
>> server
>> >>> logs
>> >>> > indicating that the region servers are busy doing anything like
a
>> >>> > compaction. It seems like the region servers just decided to stop
>> >>> > processing requests. We have three different client applications
>> sending
>> >>> > requests to HBase, and they all drop to zero requests/second at
the
>> same
>> >>> > time, so I don't think it's an issue on the client side. There
are
>> no
>> >>> > errors in our client logs either.
>> >>> >
>> >>> > Our hbase-site.xml is here: http://pastebin.com/cJ4cnH5W
>> >>> >
>> >>> > Any ideas what could be causing the cluster to freeze up? I guess
my
>> >>> next
>> >>> > plan is to get thread dumps on the region servers and the clients
>> the
>> >>> next
>> >>> > time it happens. Is there somewhere else I should look other than
>> the
>> >>> > master and region server logs?
>> >>> >
>> >>> > Thanks,
>> >>> > James
>> >>>
>> >>
>> >>
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message