hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhenyu Zhong <zhongresea...@gmail.com>
Subject Re: regionserver disconnection
Date Wed, 11 Nov 2009 20:29:02 GMT
Thanks a lot for the help.

I understand that the change of disk configurations should be done
progressively such that HDFS can balance and recover from a disk going away.

The data we are trying to import into HBase is about 200million rows, each
row is about hundreds of bytes. Basically, we read in a text file from HDFS,
then  for each row in the text file, we fetch data from HBase table, we
compare the column value and update the latest value back into HBase. I was
suggested to use CheckAndPut.  Anyway this is the common usecase on my side.


Many thanks
zhenyu



On Wed, Nov 11, 2009 at 3:03 PM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> 6 disks, 19 machines and you see those kind of issues? I'm quite
> surprised and I want to learn more about what you trying to to.
>
> But first, if hbase is managing ZK for you then you can change the
> value of hbase.zookeeper.property.dataDir so that it points to its own
> mount. Make sure you also change dfs.data.dir in conf/hdfs-site.xml
> (hadoop stuff) so that it stops pointing there (you should have 6
> values there at the moment if you configured it correctly). Doing this
> could result in data loss if you do it all at once so do the change
> progressively. This also means that you may have to keep a seperate
> HDFS configuration for those particular nodes. Please ask more
> questions if it's not very clear.
>
> WRT to dedicating nodes to Zookeeeper, it makes sense only if you have
> a lot of nodes but in your case, unless there's also CPU contention,
> you're good with only a dedicated disk.
>
> Now for what you are trying to do, is this a massive import that goes
> over many hours? How big is each row? I'd love to get a lot of details
> to help you optimize your cluster for your particular problem.
>
> Thx,
>
> J-D
>
> On Wed, Nov 11, 2009 at 11:51 AM, Zhenyu Zhong <zhongresearch@gmail.com>
> wrote:
> > J-D,
> >
> > Thanks a lot for the insightful comments.
> > We are running 5 zookeeper quorums now. On these 5 quorum servers, we
> also
> > run datanode, tasktracker and regionserver.
> > Now we have 6 disks per machine.
> > If it is the disk IO saturation, is there anyway to configure zookeeper
> to
> > write to a seperate disk which is not the same one used by HDFS?
> > Or shall we run zookeeper quorums on different nodes other than HDFS?
> >
> > Are HLogs generated by Regionservers? Would it alleviate the issue if we
> use
> > the first 14 for regionserver, and the rest of 5 for zookeeper?(In that
> > sense, the node running zookeeper quorum won't run regionserver)
> >
> > I am very appreciated for your comments.
> >
> > Best,
> > zhenyu
> >
> >
> >
> > On Wed, Nov 11, 2009 at 2:30 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> The problem seems to be that the IO is saturated. When you see this:
> >>
> >> #
> >> 2009-11-11 04:28:10,308 WARN
> >> org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler 9 on
> >> 60021 took 6819ms appending an edit to hlog; editcount=370
> >> #
> >> 2009-11-11 04:28:10,308 WARN
> >> org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler 2 on
> >> 60021 took 4412ms appending an edit to hlog; editcount=371
> >>
> >> It's a sign that HDFS is slow. I guess that the zookeeper quorum
> >> members are on those same nodes using the same disks? If so, there's
> >> the problem because Zookeeper needs to be able to commit transactions
> >> very fast and if your disks are already hogged then they have to wait
> >> and you can get disconnections. Do you still have 19 quorum members?
> >> How many disks do you have per machine?
> >>
> >> Thx
> >>
> >> J-D
> >>
> >> On Wed, Nov 11, 2009 at 11:22 AM, Zhenyu Zhong <zhongresearch@gmail.com
> >
> >> wrote:
> >> > Please take a look at the following regionserver log.
> >> > The error started to come out from 4:24am UTC to 4:29am UTC.
> >> >
> >> > http://pastebin.com/m3717d064
> >> >
> >> > Thanks in advance.
> >> >
> >> > Best,
> >> > zhenyu
> >> >
> >> >
> >> > On Wed, Nov 11, 2009 at 2:01 PM, Jean-Daniel Cryans <
> jdcryans@apache.org
> >> >wrote:
> >> >
> >> >> Could you paste a bigger log trace (interested to see at least all
> 300
> >> >> lines before that) in a pastebin?
> >> >>
> >> >> Thx,
> >> >>
> >> >> J-D
> >> >>
> >> >> On Wed, Nov 11, 2009 at 10:56 AM, Zhenyu Zhong <
> zhongresearch@gmail.com
> >> >
> >> >> wrote:
> >> >> > J-D,
> >> >> >
> >> >> > Thanks for the reply.
> >> >> >
> >> >> > The problem I have is one or more of the regionservers disconnect
> from
> >> >> the
> >> >> > cluster.
> >> >> >
> >> >> > I checked the resource usage including memory Heap usage for
> >> >> regionserver,
> >> >> > and found nothing suspicious. The regionserver heap usage is way
> below
> >> >> the
> >> >> > maxHeap defined in jvm.
> >> >> >
> >> >> > So I checked zookeeper log, then I found the following
> >> >> > BufferUnderflowException, eventually it leads to SocketException:
> >> >> Transport
> >> >> > endpoint is not connected.
> >> >> >
> >> >> >
> >> >> > I also checked the regionserver log. It looks like the regionserver
> >> tried
> >> >> to
> >> >> > connect to other zookeeper quorum server. It connects successfully,
> >> but
> >> >> then
> >> >> > it complained about SocketException: transport endpoint is not
> >> connected.
> >> >> >
> >> >> > So both zookeeper log and regionserver log told me that the
> transport
> >> >> > endpoint is not connected. I don't know why.
> >> >> >
> >> >> > I would be very appreciated if you could point me to the right
> >> direction.
> >> >> >
> >> >> > Best,
> >> >> >
> >> >> >
> >> >> > Regionserver log:
> >> >> > 2009-11-11 04:24:53,718 INFO org.apache.zookeeper.ClientCnxn:
> >> Attempting
> >> >> > connection to server superpyxis0018/192.168.100.132:2181
> >> >> > 2009-11-11 04:24:53,718 INFO org.apache.zookeeper.ClientCnxn:
> Priming
> >> >> > connection to java.nio.channels.SocketChannel[connected local=/
> >> >> > 192.168.100.131:43065 remote=superpyxis0018/192.168.100.132:2181]
> >> >> > 2009-11-11 04:24:53,718 INFO org.apache.zookeeper.ClientCnxn:
> Server
> >> >> > connection successful
> >> >> > 2009-11-11 04:24:53,768 INFO
> >> >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper
> >> event,
> >> >> > state: Disconnected, type: None, path: null
> >> >> > 2009-11-11 04:24:54,118 INFO org.apache.zookeeper.ClientCnxn:
> >> Attempting
> >> >> > connection to server superpyxis0017/192.168.100.131:2181
> >> >> > 2009-11-11 04:24:54,118 INFO org.apache.zookeeper.ClientCnxn:
> Priming
> >> >> > connection to java.nio.channels.SocketChannel[connected local=/
> >> >> > 192.168.100.131:46903 remote=superpyxis0017/192.168.100.131:2181]
> >> >> > 2009-11-11 04:24:54,118 INFO org.apache.zookeeper.ClientCnxn:
> Server
> >> >> > connection successful
> >> >> > 2009-11-11 04:24:54,118 WARN org.apache.zookeeper.ClientCnxn:
> >> Exception
> >> >> > closing session 0x24dfd90c810002 to
> >> sun.nio.ch.SelectionKeyImpl@2dd78ccf
> >> >> > java.io.IOException: Read error rc = -1
> >> java.nio.DirectByteBuffer[pos=0
> >> >> > lim=4 cap=4]
> >> >> >        at
> >> >> >
> org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:701)
> >> >> >        at
> >> >> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:945)
> >> >> > 2009-11-11 04:24:54,118 WARN org.apache.zookeeper.ClientCnxn:
> Ignoring
> >> >> > exception during shutdown input
> >> >> > java.net.SocketException: Transport endpoint is not connected
> >> >> >        at sun.nio.ch.SocketChannelImpl.shutdown(Native Method)
> >> >> >        at
> >> >> >
> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
> >> >> >        at
> >> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
> >> >> >        at
> >> >> >
> >> org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:999)
> >> >> >        at
> >> >> > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:970)
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Wed, Nov 11, 2009 at 1:40 PM, Jean-Daniel Cryans <
> >> jdcryans@apache.org
> >> >> >wrote:
> >> >> >
> >> >> >> I'm really not sure exactly what the situation is with the
> >> >> >> java.nio.BufferUnderflowException, but what do you mean by
> >> >> >> "regionserver disconnection"? Could you give more details?
> >> >> >>
> >> >> >> Thx,
> >> >> >>
> >> >> >> J-D
> >> >> >>
> >> >> >> On Wed, Nov 11, 2009 at 10:34 AM, Zhenyu Zhong <
> >> zhongresearch@gmail.com
> >> >> >
> >> >> >> wrote:
> >> >> >> > I searched around and found HBASE-1645 "HBase could be
a little
> >> more
> >> >> >> dainty
> >> >> >> > when zk is pulled out from under it"
> >> >> >> >
> >> >> >> > I am wondering whether my problem is related to this
HBase-1645.
> Or
> >> >> >> probably
> >> >> >> > has nothing to do with it?
> >> >> >> >
> >> >> >> > thanks
> >> >> >> >
> >> >> >> >
> >> >> >> > On Wed, Nov 11, 2009 at 12:02 PM, Zhenyu Zhong <
> >> >> zhongresearch@gmail.com
> >> >> >> >wrote:
> >> >> >> >
> >> >> >> >> Hi,
> >> >> >> >>
> >> >> >> >> I have been monitoring the system resources such
as memory Heap
> >> usage
> >> >> >> for
> >> >> >> >> Regionserver, load average, network bandwidth, etc.
> >> >> >> >>
> >> >> >> >> I experienced a disconnection last night. When I
checked the
> >> memory
> >> >> Heap
> >> >> >> >> for Regionserver, it was 2336 which is way below
the max Heap
> >> 4096. I
> >> >> >> don't
> >> >> >> >> think it is Regionserver problem. The requests that
the
> >> regionserver
> >> >> got
> >> >> >> was
> >> >> >> >> about 126 which doesn't seem to be high. The load
average is
> kind
> >> of
> >> >> >> high
> >> >> >> >> >100. (but I saw disconnections while the load
average is
> around
> >> 30).
> >> >> >> The
> >> >> >> >> bandwidth is not a problem.
> >> >> >> >>
> >> >> >> >> However, I found error messages from the zookeeper.
Please see
> the
> >> >> >> >> following error messages. There are some
> >> >> >> >> BufferUnderflowException error, IOException error,etc.
> >> >> >> >> May I ask what might be the causes that generate
these error
> >> >> messages?
> >> >> >> >>
> >> >> >> >> Many thanks
> >> >> >> >> zhenyu
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> 2009-11-11 04:29:00,810 WARN
> org.apache.zookeeper.server.Request:
> >> >> >> Ignoring
> >> >> >> >> exception during toString
> >> >> >> >> java.nio.BufferUnderflowException
> >> >> >> >>         at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:127)
> >> >> >> >>         at java.nio.ByteBuffer.get(ByteBuffer.java:675)
> >> >> >> >>         at
> >> >> >> org.apache.zookeeper.server.Request.toString(Request.java:199)
> >> >> >> >>         at java.lang.String.valueOf(String.java:2827)
> >> >> >> >>         at
> java.lang.StringBuilder.append(StringBuilder.java:115)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.zookeeper.server.quorum.CommitProcessor.processRequest(CommitProcessor.java:167)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.zookeeper.server.quorum.FollowerRequestProcessor.run(FollowerRequestProcessor.java:68)
> >> >> >> >> 2009-11-11 04:29:00,810 WARN
> >> >> org.apache.zookeeper.server.NIOServerCnxn:
> >> >> >> >> Exception causing close of session 0x424dfd908c50009
> >> >> >> >> due to java.io.IOException: Read error
> >> >> >> >> 2009-11-11 04:29:00,810 DEBUG
> >> >> org.apache.zookeeper.server.NIOServerCnxn:
> >> >> >> >> IOException stack trace
> >> >> >> >> java.io.IOException: Read error
> >> >> >> >>         at
> >> >> >> >>
> >> >>
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:372)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:221)
> >> >> >> >> 2009-11-11 04:29:00,810 DEBUG
> >> >> >> >> org.apache.zookeeper.server.quorum.CommitProcessor:
Processing
> >> >> request::
> >> >> >> >> sessionid:0x424dfd908c50009 type:setWatches
> >> cxid:0xfffffffffffffff8
> >> >> >> >> zxid:0xfffffffffffffffe txntype:unknown n/a
> >> >> >> >> 2009-11-11 04:29:00,810 WARN
> org.apache.zookeeper.server.Request:
> >> >> >> Ignoring
> >> >> >> >> exception during toString
> >> >> >> >> java.nio.BufferUnderflowException
> >> >> >> >>         at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:127)
> >> >> >> >>         at java.nio.ByteBuffer.get(ByteBuffer.java:675)
> >> >> >> >>         at
> >> >> >> org.apache.zookeeper.server.Request.toString(Request.java:199)
> >> >> >> >>         at java.lang.String.valueOf(String.java:2827)
> >> >> >> >>         at
> java.lang.StringBuilder.append(StringBuilder.java:115)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:74)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
> >> >> >> >> 2009-11-11 04:29:00,810 DEBUG
> >> >> >> >> org.apache.zookeeper.server.FinalRequestProcessor:
Processing
> >> >> request::
> >> >> >> >> sessionid:0x424dfd908c50009 type:setWatches
> >> cxid:0xfffffffffffffff8
> >> >> >> >> zxid:0xfffffffffffffffe txntype:unknown n/a
> >> >> >> >> 2009-11-11 04:29:00,810 WARN
> org.apache.zookeeper.server.Request:
> >> >> >> Ignoring
> >> >> >> >> exception during toString
> >> >> >> >> java.nio.BufferUnderflowException
> >> >> >> >>         at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:127)
> >> >> >> >>         at java.nio.ByteBuffer.get(ByteBuffer.java:675)
> >> >> >> >>         at
> >> >> >> org.apache.zookeeper.server.Request.toString(Request.java:199)
> >> >> >> >>         at
> >> >> >> >>
> >> org.apache.log4j.or.DefaultRenderer.doRender(DefaultRenderer.java:36)
> >> >> >> >>         at
> >> >> >> >>
> org.apache.log4j.or.RendererMap.findAndRender(RendererMap.java:80)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.log4j.spi.LoggingEvent.getRenderedMessage(LoggingEvent.java:362)
> >> >> >> >> at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.log4j.helpers.PatternParser$BasicPatternConverter.convert(PatternParser.java:403)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.log4j.helpers.PatternConverter.format(PatternConverter.java:65)
> >> >> >> >>         at
> >> >> org.apache.log4j.PatternLayout.format(PatternLayout.java:502)
> >> >> >> >>         at
> >> >> >> >>
> org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:302)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileAppender.java:359)
> >> >> >> >>         at
> >> >> >> org.apache.log4j.WriterAppender.append(WriterAppender.java:160)
> >> >> >> >>         at
> >> >> >> >>
> >> org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
> >> >> >> >>         at
> >> org.apache.log4j.Category.callAppenders(Category.java:206)
> >> >> >> >>         at
> org.apache.log4j.Category.forcedLog(Category.java:391)
> >> >> >> >>         at org.apache.log4j.Category.debug(Category.java:260)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:147)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
> >> >> >> >> 2009-11-11 04:29:00,810 WARN
> org.apache.zookeeper.server.Request:
> >> >> >> Ignoring
> >> >> >> >> exception during toString
> >> >> >> >> sessionid:0x424dfd908c50009 type:setWatches
> >> cxid:0xfffffffffffffff8
> >> >> >> >> zxid:0xfffffffffffffffe txntype:unknown n/a
> >> >> >> >> 2009-11-11 04:29:00,810 INFO
> >> >> org.apache.zookeeper.server.NIOServerCnxn:
> >> >> >> >> closing session:0x424dfd908c50009 NIOServerCnxn:
> >> >> >> >> java.nio.channels.SocketChannel[connected local=/
> >> >> 192.168.100.131:2181
> >> >> >> remote=/
> >> >> >> >> 192.168.100.131:35500]
> >> >> >> >> 2009-11-11 04:29:00,810 DEBUG
> >> >> org.apache.zookeeper.server.NIOServerCnxn:
> >> >> >> >> ignoring exception during input shutdown
> >> >> >> >> java.net.SocketException: Transport endpoint is not
connected
> >> >> >> >>         at sun.nio.ch.SocketChannelImpl.shutdown(Native
Method)
> >> >> >> >>         at
> >> >> >> >>
> >> >>
> sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChannelImpl.java:640)
> >> >> >> >>         at
> >> >> >> sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.java:360)
> >> >> >> >>         at
> >> >> >> >>
> >> >>
> org.apache.zookeeper.server.NIOServerCnxn.close(NIOServerCnxn.java:848)
> >> >> >> >>         at
> >> >> >> >>
> >> >>
> org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:498)
> >> >> >> >>         at
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:221)
> >> >> >> >>
> >> >> >> >
> >> >> >>
> >> >> >
> >> >>
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message