hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wayne <wav...@gmail.com>
Subject Re: JVM OOM
Date Wed, 05 Jan 2011 17:13:52 GMT
It was carrying ~9k writes/sec and has been for the last 24+ hours. There
are 500+ regions on that node. I could not find the heap dump (location?)
but we do have some errant big rows that have crashed before. When we query
those big rows thrift has been crashing. Maybe major compaction kicked in
for those rows (see last log entry below)? There are 30 million columns with
all small cell values but the 30 million is definitely too much.

Here are some errors from the hadoop log. It looks like it kept getting
stuck on something which may point to the data being too big? The error
below occured 12 times in a row.

org.apache.hadoop.ipc.RemoteException: java.io.IOException:
blk_2176268114489978801_654636 is already commited, storedBlock == null.

Here is the entry from the HBase log.

15:26:44,946 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
java.io.IOException: Broken pipe
15:26:44,946 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer:
Aborting region server serverName=sacdnb08.dataraker.net,60020,1294089592450,
load=(requests=0, regions=552, usedHeap=7977, maxHeap=7987): Uncaught
exception in service thread regionserver60020.compactor
java.lang.OutOfMemoryError: Java heap space

Thanks.

On Wed, Jan 5, 2011 at 11:45 AM, Stack <stack@duboce.net> wrote:

> What was the server carrying?  How many regions?  What kinda of
> loading was on the cluster?  We should not be OOME'ing.  Do you have
> the heap dump lying around (We dump heap on OOME... its named *.hprof
> or something.  If you have it, want to put it somewhere for me to pull
> it so I can take a look?).  Any chance of a errant big cells?  Lots of
> them?  What JVM version?
>
> St.Ack
>
> On Wed, Jan 5, 2011 at 8:10 AM, Wayne <wav100@gmail.com> wrote:
> > I am still struggling with the JVM. We just had a hard OOM crash of a
> region
> > server after only running for 36 hours. Any help would be greatly
> > appreciated. Do we need to restart nodes every 24 hours under load?  GC
> > Pauses are something we are trying to plan for, but full out OOM crashes
> are
> > a new problem.
> >
> > The message below seems to be where it starts going bad. It is followed
> by
> > no less than 63 Concurrent Mode Failure errors over a 16 minute period.
> >
> > *GC locker: Trying a full collection because scavenge failed*
> >
> > Lastly here is the end (after the 63 CMF errors).
> >
> > Heap
> >  par new generation   total 1887488K, used 303212K [0x00000005fae00000,
> > 0x000000067ae00000, 0x000000067ae00000)
> >  eden space 1677824K,  18% used [0x00000005fae00000, 0x000000060d61b078,
> > 0x0000000661480000)
> >  from space 209664K,   0% used [0x000000066e140000, 0x000000066e140000,
> > 0x000000067ae00000)
> >  to   space 209664K,   0% used [0x0000000661480000, 0x0000000661480000,
> > 0x000000066e140000)
> >  concurrent mark-sweep generation total 6291456K, used 2440155K
> > [0x000000067ae00000, 0x00000007fae00000, 0x00000007fae00000)
> >  concurrent-mark-sweep perm gen total 31704K, used 18999K
> > [0x00000007fae00000, 0x00000007fccf6000, 0x0000000800000000)
> >
> > Here again are our custom settings in case there are some suggestions out
> > there. Are we making it worse with these settings? What should we try
> next?
> >
> >        -XX:+UseCMSInitiatingOccupancyOnly
> >        -XX:CMSInitiatingOccupancyFraction=60
> >        -XX:+CMSParallelRemarkEnabled
> >        -XX:SurvivorRatio=8
> >        -XX:NewRatio=3
> >        -XX:MaxTenuringThreshold=1
> >
> >
> > Thanks!
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message