hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis.gospodne...@gmail.com>
Subject Re: Is this a long GC pause, or something else?
Date Tue, 10 Jun 2014 22:31:28 GMT
Here are some graphs:
JVM GC: https://apps.sematext.com/spm-reports/s/mYKcXNXBMl
JVM threads: https://apps.sematext.com/spm-reports/s/eJAVT8TUoB (so you can
see threads just "disappear" for blocks of time)

Meanwhile, the server's not dead - here's the CPU showing it's not dead and
it's not 100% idle OR 100% busy:
https://apps.sematext.com/spm-reports/s/Ess9S9JnYF

We just noticed this today when we switched from OpenJDK to Oracle VJM
update 60.
This is actually from a cluster running on R3 instances on EC2.

These lockups come and go, as you can see, and appear on all nodes in the
cluster, just not at the same time.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Jun 10, 2014 at 5:43 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Does it repeat?
> We are seeing this with u60 oracle JVM too!  SPM shows the whole JVM
> blocking for about 16 minutes every M minutes.
>
> Otis
>
>
>
> > On Jun 10, 2014, at 2:05 PM, Tom Brown <tombrown52@gmail.com> wrote:
> >
> > Last night a regionserver in my cluster stopped responding in a timely
> > manner for about 20 minutes. I know that stop-the-world GC can cause this
> > type of behavior, but 20 minutes seems excessive.
> >
> > The server is a 2 core VM with 16GB of RAM, (hbase max heap is 12GB). We
> > are using the latest java 7 from oracle. HDFS is provided by an Isilon
> > cluster.
> >
> > The server workload is read/write: the writing process reads all rows it
> is
> > about to write, updates them if they exist, and then writes all the rows
> > (replacing ones that were updated).
> >
> > The last messages before the pause were regarding an HLog roll:
> >
> > DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll requested
> > INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> > getDefaultReplication
> > INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support
> > getDefaultBlockSize
> >
> > During the next 20 minutes there were a handful of sporadic LruBlockCache
> > stats messages but nothing else. After 20 minutes, normal operation
> resumed.
> >
> > Is 20 minutes for a GC pause expected given the operational load and
> > machine specs? Could a GC pause include periodic log messages? If it
> wasn't
> > a GC pause, what else could it be?
> >
> > --Tom
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message