hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jamal B <jm151...@gmail.com>
Subject Re: HBase issues since upgrade from 0.92.4 to 0.94.6
Date Mon, 15 Jul 2013 11:28:14 GMT
I believe that your workload after the upgrade caused the process to exceed
it's 1 GB memory allocation, and your jvm flag -XX:OnOutOfMemoryError=kill
-9 %p worked as expected and killed it.  I would remove the kill hook, or
at least put out some sort of log entry to the syslog or something before
it kills the pid, otherwise you have no log entry to point back to when the
pid abruptly dies, like in this case.

Also, someone please correct me if I'm wrong, but I thought that the
hbase.hregion.max.filesize config property does not enforce the max size of
a region, but only a max size before compaction is required.


On Fri, Jul 12, 2013 at 12:15 PM, David Koch <ogdude@googlemail.com> wrote:

> Hello,
>
> This is the command that is used to launch the region servers:
>
> /usr/java/jdk1.7.0_25/bin/java -XX:OnOutOfMemoryError=kill -9 %p -Xmx1000m
> -Djava.net.preferIPv4Stack=true -Xmx1073741824 -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC -XX:-CMSConcurrentMTEnabled
> -XX:CMSInitiatingOccupancyFraction=70 -XX:+CMSParallelRemarkEnabled
> -Dhbase.log.dir=/var/log/hbase
> -Dhbase.log.file=hbase-cmf-hbase1-REGIONSERVER-big-4.ezakus.net.log.out
> -Dhbase.home.dir=/opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hbase
> -Dhbase.id.str= -Dhbase.root.logger=INFO,RFA -Djava.library.path=<... libs
> ...>
>
> so garbage collection logging is not activated it seems. I can try and
> re-launch with the -verbose:gc flag
>
> All HBase settings are left at their (CDH 4.3) default for example:
> hfile.block.cache.size=0.25
> hbase.hregion.max.filesize=1GB
>
> except:
> hbase.hregion.majorcompaction=0
>
> speculative execution is off.
>
> The only solution we have found so far is lowering the workload by running
> less jobs in parallel.
>
> /David
>
>
> On Fri, Jul 12, 2013 at 1:48 PM, Azuryy Yu <azuryyyu@gmail.com> wrote:
>
> > I do think your JVM on the RS crashed. do you have GC log?
> >
> > do you set MR *mapred*.map.tasks.*speculative.execution=false *when you
> > using map jobs to read or write HBASE?
> >
> > and if you have a heavy read/write load, how did you tune the HBase? such
> > as block cache size, compaction, memstore etc.
> >
> >
> > On Fri, Jul 12, 2013 at 7:42 PM, David Koch <ogdude@googlemail.com>
> wrote:
> >
> > > Thank you for your responses. With respect to the version of Java I
> found
> > > that Cloudera recommend<
> > >
> >
> http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Requirements-and-Supported-Versions/cdhrsv_topic_3.html
> > > >1.7.x
> > > for CDH4.3.
> > >
> > >
> > > On Fri, Jul 12, 2013 at 1:32 PM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org> wrote:
> > >
> > > > Might want to run memtest also, just to be sure there is no memory
> > issue.
> > > > It should not since it was working fine with 0.92.4, but costs
> > nothing...
> > > >
> > > > the last version of Java 6 is 45... Might also worst to give it a try
> > if
> > > > you are running with 1.6.
> > > >
> > > > 2013/7/12 Asaf Mesika <asaf.mesika@gmail.com>
> > > >
> > > > > You need to see the jvm crash in .out log file and see if maybe its
> > the
> > > > .so
> > > > > native Hadoop code that making the problem. In our case we
> > > > > Downgraded from jvm 1.6.0-37 to 33 and it solved the issue.
> > > > >
> > > > >
> > > > > On Friday, July 12, 2013, David Koch wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > NOTE: I posted the same message in the the Cloudera group.
> > > > > >
> > > > > > Since upgrading from CDH 4.0.1 (HBase 0.92.4) to 4.3.0 (HBase
> > 0.94.6)
> > > > we
> > > > > > systematically experience problems with region servers crashing
> > > > silently
> > > > > > under workloads which used to pass without problems. More
> > > specifically,
> > > > > we
> > > > > > run about 30 Mapper jobs in parallel which read from HDFS and
> > insert
> > > in
> > > > > > HBase.
> > > > > >
> > > > > > region server log
> > > > > > NOTE: no trace of crash, but server is down and shows up as
such
> in
> > > > > > Cloudera Manager.
> > > > > >
> > > > > > 2013-07-12 10:22:12,050 WARN
> > > > > > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: File
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://XXXXXXX:8020/hbase/.logs/XXXXXXX,60020,1373616547696-splitting/XXXXXXX%2C60020%2C1373616547696.1373617004286
> > > > > > might be still open, length is 0
> > > > > > 2013-07-12 10:22:12,051 INFO
> > > org.apache.hadoop.hbase.util.FSHDFSUtils:
> > > > > > Recovering file
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://XXXXXXX:8020/hbase/.logs/XXXXXXX,60020,1373616547696-splitting/XXXXXXX
> > > > > > t%2C60020%2C1373616547696.1373617004286
> > > > > > 2013-07-12 10:22:13,064 INFO
> > > org.apache.hadoop.hbase.util.FSHDFSUtils:
> > > > > > Finished lease recover attempt for
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> hdfs://XXXXXXX:8020/hbase/.logs/XXXXXXX,60020,1373616547696-splitting/XXXXXXX%2C60020%2C1373616547696.1373617004286
> > > > > > 2013-07-12 10:22:14,819 INFO
> > org.apache.hadoop.io.compress.CodecPool:
> > > > Got
> > > > > > brand-new compressor [.deflate]
> > > > > > 2013-07-12 10:22:14,824 INFO
> > org.apache.hadoop.io.compress.CodecPool:
> > > > Got
> > > > > > brand-new compressor [.deflate]
> > > > > > ...
> > > > > > 2013-07-12 10:22:14,850 INFO
> > org.apache.hadoop.io.compress.CodecPool:
> > > > Got
> > > > > > brand-new compressor [.deflate]
> > > > > > 2013-07-12 10:22:15,530 INFO
> > org.apache.hadoop.io.compress.CodecPool:
> > > > Got
> > > > > > brand-new compressor [.deflate]
> > > > > > < -- last log entry, region server is down here -- >
> > > > > >
> > > > > >
> > > > > > datanode log, same machine
> > > > > >
> > > > > > 2013-07-12 10:22:04,811 ERROR
> > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > > > > XXXXXXX:50010:DataXceiver
> > > > > > error processing WRITE_BLOCK operation  src: /YYY.YY.YYY.YY:36024
> > > dest:
> > > > > > /XXX.XX.XXX.XX:50010
> > > > > > java.io.IOException: Premature EOF from inputStream
> > > > > > at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
> > > > > > at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
> > > > > > at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
> > > > > > at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
> > > > > > at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
> > > > > > at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635)
> > > > > > at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:564)
> > > > > > at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:103)
> > > > > > at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:67)
> > > > > > at
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
> > > > > > at java.lang.Thread.run(Thread.java:724)
> > > > > > < -- many repetitions of this -- >
> > > > > >
> > > > > > What could have caused this difference in stability?
> > > > > >
> > > > > > We did not change any configuration settings with respect to
the
> > > > previous
> > > > > > CDH 4.0.1 setup. In particular, we left ulimit and
> > > > > > dfs.datanode.max.xcievers at 32k. If need be, I can provide
more
> > > > complete
> > > > > > log/configuration information.
> > > > > >
> > > > > > Thank you,
> > > > > >
> > > > > > /David
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message