hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From N Dm <nid...@gmail.com>
Subject [Question] better way to deal with Out of Memory on Region Server?
Date Fri, 26 Apr 2013 20:07:13 GMT
hi, folks,

pretty sure this question has been discussed a few times before, and
addressed to some degree.  I am wondering whether there is an active JIRA
or best practice to improve this? Appreciate if I can get a few pointers.

Currently, if a Region Server is running Out of Memory, checkOOM() is
called, and this Region Server will be kill to protect Master.

For example: assuming each row of 'usertable' is ~1K, and HBASE_HEAPSIZE is
1GB(as default)
@hbase shell> count 'usertable', INTERVAL=>2000000,CACHE =>1000000
the count will bring down one of the region Server.

The above problem can be fixed by either use less CACHE, or increase
HEAPSIZE. The 1GB heap is small, 1M row cache is kind of large anyway. So
this particular example won't make me concern too much, and the region
server can be restarted within a minute.

What worry me is this example:
1)  production system with 20 RegionServer each has a reasonable
HeapSize(8~16GB), and increase the heap dynamically won't be a good idea
without new physical memory.
2) a few hundreds of client threads,  each run a reasonable application,
but added up to a large number of memory requested. At a point, the
HEAPSIZE is reached on one of the regionserver, and bring it down. This is
not too bad as we still have 19 up. However, the problem is that the
clients can (and mostlikely will) resubmit their jobs just as I can
resubmit the count-cmd by two keystrokes, which brought down the next
In this case, I can't stop clients requests, and can't add new hardware
immediately(at least not within minutes). Only thing I can do is to  watch
the whole cluster be brought down from the domino effect.

With that, I am wondering:
1) is there an active item to prevent the first RegionServer going down?
for example, put a 90% of HEAPSIZE as threshold?
2) or a way to prevent client to resubmit the jobs if system is unhealthy.
For example, queue the jobs if a few RegionServers is down?

I was able to find some of the discussions back in 2009 and 2011 from the
email archive.  Wondering anything active/new?   I am new in this
community, and really appreciate any inputs.



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message