hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stuart Smith <stu24m...@yahoo.com>
Subject Re: Avoiding OutOfMemory Java heap space in region servers
Date Thu, 12 Aug 2010 18:56:12 GMT

Hello Stack,

> Rather than let the damaged server continue, HBase is
> conservative

Ah. I see. That does make sense.

> Are you doing large multiputs?  Do you have lots of
> handlers running?

Actually this was a M/R task doing lots of reads. But I do have automation (standalone java
hbase client, not M/R) that runs every hour doing lots of puts. I think the two could have
overlapped and caused issues.

> What size heap are you running with?

Hbase has 4GB, Hadoop had 2GB (on regionserver/datanode/tasktracker computers).

What I actually ended up doing was catching the OOME's in my M/R tasks, and looking at the
cell size. One of the cells was 500 MB :|. So that was bad. I've taken to avoiding large cells
in the M/R task, and things have smoothed out.

It looks like I should just be a little more circumspect with how much data I cram in a cell.
Mostly I limit them to 64 MB, but for one particular tasks I limited to 512 MB.. and I'm getting
a decent amount of data now, so inevitably I hit the limit...

Thanks!

Take care,
  -stu




--- On Tue, 8/10/10, Stack <stack@duboce.net> wrote:

> From: Stack <stack@duboce.net>
> Subject: Re: Avoiding OutOfMemory Java heap space in region servers
> To: user@hbase.apache.org
> Date: Tuesday, August 10, 2010, 6:40 PM
> OOME may manifest in one place but be
> caused by some other behavior
> altogether.  Its an Error.  You can't tell for
> sure what damage its
> done to the running process (Though, in your stack trace,
> an OOME
> during the array copy could likely be because of very large
> cells).
> Rather than let the damaged server continue, HBase is
> conservative and
> shuts itself down to minimize possible dataloss whenever it
> gets an
> OOME (It has kept aside an emergency memory supply that it
> releases on
> OOME so the shutdown can 'complete' successfully).
> 
> Are you doing large multiputs?  Do you have lots of
> handlers running?
> If the multiputs are held up because things are running
> slow, memory
> used out on the handlers could throw you over especially if
> your heap
> is small.
> 
> What size heap are you running with?
> 
> St.Ack
> 
> 
> 
> On Tue, Aug 10, 2010 at 3:26 PM, Stuart Smith <stu24mail@yahoo.com>
> wrote:
> > Hello,
> >
> >   I'm seeing errors like so:
> >
> > 010-08-10 12:58:38,938 DEBUG
> org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher:
> Got ZooKeeper event, state: Disconnected, type: None, path:
> null
> > 2010-08-10 12:58:38,939 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got
> ZooKeeper event, state: Disconnected, type: None, path:
> null
> >
> > 2010-08-10 12:58:38,941 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> OutOfMemoryError, aborting.
> > java.lang.OutOfMemoryError: Java heap space
> >        at
> java.util.Arrays.copyOf(Arrays.java:2786)
> >        at
> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:133)
> >        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:942)
> >
> > Then I see:
> >
> > 2010-08-10 12:58:39,408 INFO
> org.apache.hadoop.ipc.HBaseServer: IPC Server handler 79 on
> 60020, call close(-2793534857581898004) from
> 192.168.195.88:41233: error: java.io.IOException: Server not
> running, aborting
> > java.io.IOException: Server not running, aborting
> >
> > And finally:
> >
> > 2010-08-10 12:58:39,514 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Stop
> requested, clearing toDo despite exception
> > 2010-08-10 12:58:39,515 INFO
> org.apache.hadoop.ipc.HBaseServer: Stopping server on 60020
> > 2010-08-10 12:58:39,515 INFO
> org.apache.hadoop.ipc.HBaseServer: IPC Server handler 1 on
> 60020: exiting
> >
> > And the server begins to shut down.
> >
> > Now, it's very likely these are due to retrieving
> unusually large cells - in fact, that's my current
> assumption.. I'm seeing M/R tasks fail with intermittently
> with the same issue on the read of cell data.
> >
> > My question is why does this bring the whole
> regionserver down? I would think the regionserver would just
> fail the Get(), and move on...
> >
> > Am I misdiagnosing the error? Or is it the case that
> if I want different behavior, I should pony up with some
> code? :)
> >
> > Take care,
> >  -stu
> >
> >
> >
> >
> >
> 


      

Mime
View raw message