hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From feedly team <feedly...@gmail.com>
Subject Re: sporadic hbase "outages"
Date Tue, 29 Mar 2016 20:15:10 GMT
The stack traces of the region servers during the mapper executions didn't
really show much activity. Thus we are focusing on getting a region server
stack trace during the slow downs during regular application activity. It's
still in progress, but we have captured a few traces where all the ipc
threads are blocked by HLog.sync().

I am not 100% convinced this is the issue because the slow log write metric
is zero for all our servers. From my understanding, this metric is
incremented if a log write takes longer than 10 seconds. I assume I would
need a stack trace from the hdfs node to debug this further?


"IPC Server handler 4 on 60020" daemon prio=10 tid=0x00007fb87c133800
nid=0xa911 in Object.wait() [0x00007fb861860000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:503)
- locked <0x00000004ab3e5758> (a java.util.LinkedList)
at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
at org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:995)
at sun.reflect.GeneratedMethodAccessor51.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.hbase.regionserver.wal.HLog.syncer(HLog.java:1361)
at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:1476)

On Tue, Mar 22, 2016 at 11:07 AM, Ted Yu <yuzhihong@gmail.com> wrote:

> bq. a small number will take 20 minutes or more
> Were these mappers performing selective scan on big regions ?
> Can you pastebin the stack trace of region server(s) which served such
> regions during slow mapper operation ?
> Pastebin of region server log would also give us more clue.
> On Tue, Mar 22, 2016 at 10:57 AM, feedly team <feedlydev@gmail.com> wrote:
> > Recently we have been experiencing short downtimes (~2-5 minutes) in our
> > hbase cluster and are trying to understand why. Many times we have HLog
> > write spikes around the down times, but not always. Not sure if this is a
> > red herring.
> >
> > We have looked a bit farther back in time and have noticed many metrics
> > deteriorating over the past few months:
> >
> > The compaction queue size seems to be growing.
> >
> > The flushQueueSize and flushSizeAvgTime are growing.
> >
> > Some map reduce tasks run extremely slowly. Maybe 90% will complete
> within
> > a couple minutes, but a small number will take 20 minutes or more. If I
> > look at the slow mappers, there is a high value for the
> > MILLIS_BETWEEN_NEXTS counter (these mappers didn't run data local).
> >
> > We have seen application performance worsening, during slowdowns usually
> > threads are blocked on hbase connection operations
> > (HConnectionManager$HConnectionImplementation.processBatch).
> >
> >
> > This is a bit puzzling as our data nodes' os load values are really low.
> In
> > the past, we had performance issues when load got too high. The region
> > server log doesn't have anything interesting, the only messages we get
> are
> > a handful of responseTooSlow messages
> > Do these symptoms point to anything or is there something else we should
> > look at? We are (still) running 0.94.20. We are going to upgrade soon,
> but
> > we want to diagnose this issue first.
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message