hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bharath Vissapragada <bhara...@cloudera.com>
Subject Re: how to tell there is a OOM in regionserver
Date Tue, 02 Dec 2014 05:59:54 GMT
I agree with Otis' response. Adding a few more details, there is a ".out"
 file in the logs/ directory, that is the stdout for each of these daemons
and incase of  an OOM crash, it prints something like this

# java.lang.OutOfMemoryError: Java heap space

# -XX:OnOutOfMemoryError="kill -9 %p"

#   Executing /bin/sh -c "kill -9 <pid>"...



On Tue, Dec 2, 2014 at 11:06 AM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Hi Ming,
>
> 1) There typically is an OOM message from the JVM itself
>
> 2) I would monitor the server instead of relying on log messages mentioning
> OOMs.  For example, in SPM <http://sematext.com/spm/> we have "hearbeat
> alerts" that tell us when we stop hearing from RegionServers and other
> types of servers.  It also helps when servers simply die for reasons other
> than OOM.
>
> 3) You could (should?) monitor individual memory pools and possibly set
> alerts or anomaly detection on those.  If you have that, if there was an
> OOM, you will typically see one of the memory pools approach 100%
> utilization.  I personally really like this report in SPM because it gives
> a bit more insight than just "heap size/utilization".  So I'd point the
> admin to this sort of monitoring report.
>
> 4) High GC counts/time, or jump in those metrics, and then typically also
> jump in CPU usage is what often precedes OOMs.
>
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Tue, Dec 2, 2014 at 12:22 AM, Liu, Ming (HPIT-GADSC) <ming.liu2@hp.com>
> wrote:
>
> > Hi, all,
> >
> > Recently, one of our HBase 0.98.5 instance meet with issues: when run
> some
> > specific workload, all region servers will suddenly shut down at same
> time,
> > but master is still running. When I check the log, in master log, I can
> see
> > messages like
> > 2014-12-01 08:28:11,072 DEBUG [main-EventThread] master.ServerManager:
> > Added=n008.cluster,60020,1417413986550 to dead servers, submitted
> shutdown
> > handler to be executed meta=false
> > And on n008, regionserver log file, there is no ERROR message, the last
> > log entry looks very like a ZooKeeper startup message. The log just
> stopped
> > with that last ZooKeeper startup message, and the Region Server process
> was
> > gone when we check with 'jps'.
> >
> > We then increased the heap size of regionserver, and it work fine.
> > RegionServer no longer disappear. So we doubt there was a Out Of Memory
> > issue, so the region server processes are killed. But my questions are:
> >
> > 1.       What log message will indicate there is a OOM? Since the region
> > server is 'kill -9', so I think there is no message can tell this.
> >
> > 2.       If there is no typical log message about OOM, then how can an
> > admin make sure there is a region server OOM happened? We just guess, but
> > can not make sure. We hope there is a method to tell OOM occured for
> sure.
> >
> > 3.       Does the Zookeeper message appears every time with RegionServer
> > OOM (if it is a OOM). Or it is just a random event just in our system?
> >
> > So in sum, I want to know what is the typical clue that people can make
> > sure there is a OOM issue in HBase region server?
> >
> > Thank you,
> > Ming
> >
>



-- 
Bharath Vissapragada
<http://www.cloudera.com>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message