incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Samuel Guo" <guosi...@gmail.com>
Subject Re: Frequent downs of region server
Date Wed, 14 Jan 2009 09:33:37 GMT
3 node cluster of Hadoop?
3 node cluster of HBase?

Can you attach the logs of hadoop namenode, datanodes, hbase master, and
hbase regionservers? Thanks in advance.
I am doubting that too many files opened cause datanode use out all the
xceivers .so DFSClient can create new block.

On Wed, Jan 14, 2009 at 5:20 PM, Edward J. Yoon <edwardyoon@apache.org>wrote:

> I tried to 10,000 by 10,000 mat-mat mult on 3 node.
>
> -random matrices successfully generated.
> -collecting jobs are successfully done.
> -successfully mult them in the map phase.
>
> And, during reduce job (sum operation and data insert operation) , the
> following is happened.
>
> ---------- Forwarded message ----------
> From: stack <stack@duboce.net>
> Date: Wed, Jan 14, 2009 at 3:50 PM
> Subject: Re: Frequent downs of region server
> To: hbase-user@hadoop.apache.org
>
>
> Edward J. Yoon wrote:
> > During write operation in reduce phase, region servers are killed.
> > (64,000 rows with 10,000 columns, 3 node)
>
> 10k columns is probably over what hbase is currently able to do
> (hbase-867).
>
> You've seen the notes at end of the
> http://wiki.apache.org/hadoop/Hbase/Troubleshooting page?
>
> See other notes below:
>
> > ----
> > 09/01/14 13:07:59 INFO mapred.JobClient:  map 100% reduce 36%
> > 09/01/14 13:11:38 INFO mapred.JobClient:  map 100% reduce 33%
> > 09/01/14 13:11:38 INFO mapred.JobClient: Task Id :
> > attempt_200901140952_0010_r_000017_1, Status : FAILED
> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
> > contact region server 61.247.201.163:60020 for region
> > DenseMatrix_randgnegu,,1231905480938, row '000000000000287', but
> > failed after 10 attempts.
> > Exceptions:
> > java.io.IOException: java.io.IOException: Server not running, aborting
> >         at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2103)
> >         at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdates(HRegionServer.java:1611)
> > ----
> >
> You upped the hbase client timeouts?
>
> > And, I can't stop the hbase.
> >
> > [d8g053:/root]# hbase-trunk/bin/stop-hbase.sh
> > stopping
>
> master...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
> >
> > Can it be recovered?
>
> What does master log say?  Why ain't it going down?  On tail of the log
> it'll usually say why its staying up.  Probably a particular HRegionServer?
>
> >
> > ----
> > Region server log:
> >
> > 2009-01-14 13:03:56,591 WARN org.apache.hadoop.hdfs.DFSClient:
> > DataStreamer Exception: java.io.IOException: Unable to create new
> > block.
> >         at
>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2723)
> >         at
>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
> >         at
>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>
> These look like issue that config. on the troubleshooting page might
> address
> (check your datanode logs).  You are using 0.18.0 hbase?
>
> St.Ack
>
>
>
> On Tue, Jan 13, 2009 at 8:42 PM, Edward J. Yoon <edwardyoon@apache.org
> >wrote:
>
> > During write operation in reduce phase, region servers are killed.
> > (64,000 rows with 10,000 columns, 3 node)
> >
> > ----
> > 09/01/14 13:07:59 INFO mapred.JobClient:  map 100% reduce 36%
> > 09/01/14 13:11:38 INFO mapred.JobClient:  map 100% reduce 33%
> > 09/01/14 13:11:38 INFO mapred.JobClient: Task Id :
> > attempt_200901140952_0010_r_000017_1, Status : FAILED
> > org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
> > contact region server 61.247.201.163:60020 for region
> > DenseMatrix_randgnegu,,1231905480938, row '000000000000287', but
> > failed after 10 attempts.
> > Exceptions:
> > java.io.IOException: java.io.IOException: Server not running, aborting
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2103)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdates(HRegionServer.java:1611)
> > ----
> >
> > And, I can't stop the hbase.
> >
> > [d8g053:/root]# hbase-trunk/bin/stop-hbase.sh
> > stopping
> >
> master...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
> >
> > Can it be recovered?
> >
> > ----
> > Region server log:
> >
> > 2009-01-14 13:03:56,591 WARN org.apache.hadoop.hdfs.DFSClient:
> > DataStreamer Exception: java.io.IOException: Unable to create new
> > block.
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2723)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> > 2009-01-14 13:03:56,591 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-4005955194083205373_14543 bad datanode[0]
> > nodes == null
> > 2009-01-14 13:03:56,591 WARN org.apache.hadoop.hdfs.DFSClient: Could
> > not get block locations. Aborting...
> > 2009-01-14 13:03:56,629 ERROR
> > org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> > Compaction/Split failed for region
> > DenseMatrix_randllnma,000000000000,18,7-29116,1231898419257
> > java.io.IOException: Could not read from stream
> >        at
> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:119)
> >        at java.io.DataInputStream.readByte(DataInputStream.java:248)
> >        at
> > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:325)
> >        at
> > org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:346)
> >        at org.apache.hadoop.io.Text.readString(Text.java:400)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2779)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2704)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> > 2009-01-14 13:03:56,631 INFO
> > org.apache.hadoop.hbase.regionserver.HRegion: starting  compaction on
> > region DenseMatrix_randllnma,00000000000,16,19-26373,1231898311583
> > 2009-01-14 13:03:56,692 INFO org.apache.hadoop.io.compress.CodecPool:
> > Got brand-new decompressor
> > 2009-01-14 13:03:56,692 INFO org.apache.hadoop.io.compress.CodecPool:
> > Got brand-new decompressor
> > 2009-01-14 13:03:56,693 INFO org.apache.hadoop.io.compress.CodecPool:
> > Got brand-new decompressor
> > 2009-01-14 13:03:56,693 INFO org.apache.hadoop.io.compress.CodecPool:
> > Got brand-new decompressor
> > 2009-01-14 13:03:57,521 INFO org.apache.hadoop.io.compress.CodecPool:
> > Got brand-new compressor
> > 2009-01-14 13:03:57,810 INFO org.apache.hadoop.hdfs.DFSClient:
> > Exception in createBlockOutputStream java.io.IOException: Could not
> > read from stream
> > 2009-01-14 13:03:57,810 INFO org.apache.hadoop.hdfs.DFSClient:
> > Abandoning block blk_-2612702056484946948_14554
> > 2009-01-14 13:03:59,343 WARN org.apache.hadoop.hdfs.DFSClient:
> > DataStreamer Exception: java.io.IOException: Unable to create new
> > block.
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2723)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> >
> > 2009-01-14 13:03:59,344 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-5255885897790790367_14543 bad datanode[0]
> > nodes == null
> > 2009-01-14 13:03:59,344 WARN org.apache.hadoop.hdfs.DFSClient: Could
> > not get block locations. Aborting...
> > 2009-01-14 13:03:59,344 FATAL
> > org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Replay of hlog
> > required. Forcing server shutdown
> > org.apache.hadoop.hbase.DroppedSnapshotException: region:
> > DenseMatrix_randgnegu,,1231905480938
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:896)
> >        at
> > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:789)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.MemcacheFlusher.flushRegion(MemcacheFlusher.java:227)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.MemcacheFlusher.run(MemcacheFlusher.java:137)
> > Caused by: java.io.IOException: Could not read from stream
> >        at
> > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:119)
> >        at java.io.DataInputStream.readByte(DataInputStream.java:248)
> >        at
> > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:325)
> >        at
> > org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:346)
> >        at org.apache.hadoop.io.Text.readString(Text.java:400)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2779)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2704)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
> >        at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> > 2009-01-14 13:03:59,359 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> > request=15, regions=48, stores=192, storefiles=756,
> > storefileIndexSize=6, memcacheSize=338, usedHeap=395, maxHeap=971
> > 2009-01-14 13:03:59,359 INFO
> > org.apache.hadoop.hbase.regionserver.MemcacheFlusher:
> > regionserver/0:0:0:0:0:0:0:0:60020.cacheFlusher exiting
> > 2009-01-14 13:03:59,368 INFO
> > org.apache.hadoop.hbase.regionserver.HLog: Closed
> > hdfs://
> >
> dev3.nm2.naver.com:9000/hbase/log_61.247.201.165_1231894400437_60020/hlog.dat.1231905813472
> > ,
> > entries=896500. New log writer:
> > /hbase/log_61.247.201.165_1231894400437_60020/hlog.dat.1231905839367
> >
> > 2009-01-14 13:03:59,368 INFO
> > org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon @ NHN, corp.
> > edwardyoon@apache.org
> > http://blog.udanax.org
> >
>
>
>
> --
> Best Regards, Edward J. Yoon @ NHN, corp.
> edwardyoon@apache.org
> http://blog.udanax.org
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message