hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stuart Smith <stu24m...@yahoo.com>
Subject Re: Trying to write too much to stdout destabilizes cluster across reboots
Date Wed, 14 Jul 2010 01:38:07 GMT
Last follow up:

It appears just rebooting, babysitting regionservers, watching it die.. and repeating the
process eventually recovered the cluster.

I still haven't gone through all the errors very carefully. I did have hundreds of threads
of hbase clients uploading data while I was running my M/R task that dumped the output. Based
on the logs, I'm wondering if it was just getting swamped trying to recover & rebuild
all the writes? Not sure.

Well, it worked eventually, so I'm going to just upgrade to 0.20.5 ASAP, and try to be more
careful about how much I write to stdout.

Take care,
  -stu

--- On Tue, 7/13/10, Stuart Smith <stu24mail@yahoo.com> wrote:

> From: Stuart Smith <stu24mail@yahoo.com>
> Subject: Re: Trying to write too much to stdout destabilizes cluster across reboots
> To: user@hbase.apache.org
> Date: Tuesday, July 13, 2010, 9:10 PM
> One follow up.
> 
> I noticed the master was saying this as well:
> 
> 2010-07-13 17:48:17,595 WARN
> org.apache.hadoop.hbase.master.HMaster: Processing pending
> operations: ProcessRegionClose of
> filestore,5c4ee356835d82de844600895a7df1595c8b53bd378436f6e8d2a1a99acb550c,1279043794914,
> false, reassign: true
> java.lang.RuntimeException: java.lang.NullPointerException
>     at
> org.apache.hadoop.hbase.master.RetryableMetaOperation.doWithRetries(RetryableMetaOperation.java:96)
>     at
> org.apache.hadoop.hbase.master.ProcessRegionClose.process(ProcessRegionClose.java:63)
>     at
> org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:494)
>     at
> org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:429)
> Caused by: java.lang.NullPointerException
> 
> Which lead me to this:
> 
> https://issues.apache.org/jira/browse/HBASE-2428
> 
> I guess I shouldn't have put off upgrading to 0.20.5 :(
> I'm curious if by this comment:
> 
> "Since delayqueue is processed first, all we did was try to
> process the close even though in the other more live todo
> queue, we might have the process of the regionserver
> shutdown waiting, the process of the regionserver that was
> carrying .meta. We'd never recover. "
> 
> They mean never as in _never_ or just as in until the next
> reboot. Or is my cluster (and my data) dead and gone now?
> 
> Take care,
>   -stu
> 
> --- On Tue, 7/13/10, Stuart Smith <stu24mail@yahoo.com>
> wrote:
> 
> > From: Stuart Smith <stu24mail@yahoo.com>
> > Subject: Trying to write too much to stdout
> destabilizes cluster across reboots
> > To: user@hbase.apache.org
> > Date: Tuesday, July 13, 2010, 8:58 PM
> > Hello,
> >  
> >    I have a M/R task running over hbase
> > tables. I added
> >  a one line of output in my reduce tasks running
> over a
> > set
> >  of tens of millions of records. This was a
> little silly,
> > but
> >  I was kind of curious what would happen. 
> >  
> >  Well the reduce ran very slowly, and eventually,
> I had to
> >  kill the tasks, and I re-started the whole
> cluster, as
> > other
> >  clients were slowing up considerably, and the
> M/R job
> >  started dumping all the stdout directly to the
> console it
> >  was running on. 
> >  
> >  I've re-started it twice, but I couldn't get
> hbase to
> > come
> >  backup (fully). And when I shut down, the region
> servers
> >  will go down when I tell them, but the master
> needs to be
> >  killed with -s 9 (SIG_KILL? I'm a little rusty).
> It
> >  eventually seemed to hobble towards coming back
> up.. it
> > took
> >  a looong.., with messages like:
> >  
> >  
> >  2010-07-13 16:32:47,722 INFO
> >  org.apache.hadoop.hdfs.DFSClient: Abandoning
> block
> >  blk_-9203944767743886537_123926 
> >  2010-07-13 16:32:48,207 DEBUG
> >  org.apache.hadoop.hbase.regionserver.HLog:
> Waiting for
> > hlog
> >  writers to terminate, iteration #93 
> >  2010-07-13 16:32:52,205 INFO
> >  org.apache.hadoop.hdfs.DFSClient: Exception in
> >  createBlockOutputStream java.io.EOFException 
> >  2010-07-13 16:32:52,205 INFO
> >  org.apache.hadoop.hdfs.DFSClient: Abandoning
> block
> >  blk_-7742458689391305082_123926 
> >  2010-07-13 16:32:53,072 INFO
> >  org.apache.hadoop.hdfs.DFSClient: Exception in
> >  createBlockOutputStream java.io.EOFException 
> >  2010-07-13 16:32:53,072 INFO
> >  org.apache.hadoop.hdfs.DFSClient: Abandoning
> block
> >  blk_-281711885322381263_123926 
> >  2010-07-13 16:32:53,207 DEBUG
> >  org.apache.hadoop.hbase.regionserver.HLog:
> Waiting for
> > hlog
> >  writers to terminate, iteration #94 
> >  2010-07-13 16:32:53,726 INFO
> >  org.apache.hadoop.hdfs.DFSClient: Exception in
> >  createBlockOutputStream java.io.IOException: Bad
> connect
> > ack
> >  with firstBadLink 192.168.193.155:50010 
> >  2010-07-13 16:32:53,726 INFO
> >  org.apache.hadoop.hdfs.DFSClient: Abandoning
> block
> >  blk_-2730636850915812524_123926
> >  
> >  This just seems odd.. dumping that much standard
> output
> > was
> >  kind silly, but I wouldn't think it would wedge
> the
> > cluster
> >  like this... or did something else happen?
> Everything has
> >  been fine for a while until I did this.
> >  
> >  I let the cluster sit around for a while, and it
> did come
> >  back for a little while, after printing out this
> over and
> > over (on
> >  master):
> >  
> >  da7738a3ad9,1279039241464 from
> >  ubuntu-hadoop-2,60020,1279066792598; 1 of 1
> >  2010-07-13 17:28:30,897 INFO
> >  org.apache.hadoop.hbase.master.ServerManager:
> Processing
> >  MSG_REPORT_PROCESS_OPEN:
> > 
> >
> filemetadata,unknown_23bf6f189f81218a877a41d1ccd82751527abf19beb61281b8a0f6475e682764,1278533813300
> >  from ubuntu-hadoop-4,60020,1279066792615; 1 of
> 3
> >  2010-07-13 17:28:30,897 INFO
> >  org.apache.hadoop.hbase.master.ServerManager:
> Processing
> >  MSG_REPORT_PROCESS_OPEN:
> > 
> >
> filestore,7f45c6a2f5c7d09ae8618bf6d0db5b04f346c583a1dbf24d9db135f2c8da9b57,1279050466715
> >  from ubuntu-hadoop-4,60020,1279066792615; 2 of
> 3
> >  2010-07-13 17:28:30,897 INFO
> >  org.apache.hadoop.hbase.master.ServerManager:
> Processing
> >  MSG_REPORT_PROCESS_OPEN: .META.,,1 from
> >  ubuntu-hadoop-4,60020,1279066792615; 3 of 3
> >  
> >  But regionservers kept coming down, and bringing
> down the
> >  .META region.. eventually the master seemed to
> recover,
> > and
> >  I would bring the regionservers back up
> individually..
> > one
> >  of them would soon crash again, repeating the
> cycle. And
> > the
> >  master would would die with this on the web
> page:
> >  
> >  Trying to contact region server
> 192.168.193.155:60020 for
> >  region .META.,,1, row '', but failed after 3
> attempts.
> >  Exceptions:
> >  java.net.ConnectException: Connection refused
> >  java.net.ConnectException: Connection refused
> >  java.net.ConnectException: Connection refused
> >  
> >  RequestURI=/master.jsp
> >  
> >  The regionservers appear to start dying with
> this
> >  statement:
> >  
> > 
> >  010-07-13 02:12:59,257 DEBUG
> > 
> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> >  Compaction requested for region
> > 
> >
> filestore,5a1ec52137fb8bb38affb5ca034ec3817c6611adc2f5a2b399a4ab92effcdedd,1278970890717/1436533481
> >  because:
> regionserver/192.168.193.145:60020.cacheFlusher
> >  2010-07-13 02:13:19,336 WARN
> >  org.apache.hadoop.hdfs.DFSClient: Failed to
> connect to
> >  /192.168.193.145:50010 for file
> > 
> /hbase/.META./1028785192/info/3836661676936096933 for
> > block
> >  -649923918996291451:java.io.EOFException
> >          at
> > 
> >
> java.io.DataInputStream.readShort(DataInputStream.java:298)
> >          at
> > 
> >
> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1384)
> >          at
> > 
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchBlockByteRange(DFSClient.java:1853)
> >          at
> > 
> >
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1922)
> >          at
> > 
> >
> org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:46)
> >          at
> > 
> >
> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:101)
> >  
> >  
> >  Note that it is failing to contact
> 192.168.193.145 ...
> > this
> >  is the ip of the node itself (the dying one).
> >  
> >  There's a long sequence of exceptions, and the
> >  regionservers death ends with this:
> >  
> >  010-07-13 17:36:57,068 INFO
> >  org.apache.zookeeper.ClientCnxn: EventThread
> shut down
> >  2010-07-13 17:36:57,069 ERROR
> >  org.apache.hadoop.hdfs.DFSClient: Exception
> closing file
> > 
> >
> /hbase/filestore/compaction.dir/1408434857/4172852130952668434
> >  : java.io.EOFException
> >  java.io.EOFException
> >          at
> > 
> >
> java.io.DataInputStream.readByte(DataInputStream.java:250)
> >          at
> > 
> >
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
> >          at
> > 
> >
> org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
> >          at
> > 
> org.apache.hadoop.io.Text.readString(Text.java:400)
> >          at
> > 
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream
> >  
> >  It just seems like a lot of trouble from one
> little
> >  System.out.println() statement..
> > 
> >  Restarting hbase just repeats the cycle "Bad
> connect ack
> > with firstBadLink", eventual recovery, regionservers
> start
> > crashing, cluster  becomes unusable.. 
> > 
> >  In summary, it would seem printing out too much
> data to
> >  stdout can render the whole cluster unstable -
> across
> >  reboots - and I haven't figured out how to
> stabilize it
> >  again. 
> >  
> >  Take care,
> >    -stu
> >  
> >  PS. The address for the mailing list on the
> hbase page is
> > wrong. It shows:
> > 
> > user@hhbase.apache.org
> > But it should be
> > user@hbase.apache.org.
> > I hope! (If this mail goes through, I got it right).
> > 
> > 
> >       
> > 
> 
> 
> 
> 


      

Mime
View raw message