hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Fwd: Frequent downs of region server
Date Wed, 14 Jan 2009 09:20:19 GMT
I tried to 10,000 by 10,000 mat-mat mult on 3 node.

-random matrices successfully generated.
-collecting jobs are successfully done.
-successfully mult them in the map phase.

And, during reduce job (sum operation and data insert operation) , the
following is happened.

---------- Forwarded message ----------
From: stack <stack@duboce.net>
Date: Wed, Jan 14, 2009 at 3:50 PM
Subject: Re: Frequent downs of region server
To: hbase-user@hadoop.apache.org


Edward J. Yoon wrote:
> During write operation in reduce phase, region servers are killed.
> (64,000 rows with 10,000 columns, 3 node)

10k columns is probably over what hbase is currently able to do (hbase-867).

You've seen the notes at end of the
http://wiki.apache.org/hadoop/Hbase/Troubleshooting page?

See other notes below:

> ----
> 09/01/14 13:07:59 INFO mapred.JobClient:  map 100% reduce 36%
> 09/01/14 13:11:38 INFO mapred.JobClient:  map 100% reduce 33%
> 09/01/14 13:11:38 INFO mapred.JobClient: Task Id :
> attempt_200901140952_0010_r_000017_1, Status : FAILED
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
> contact region server 61.247.201.163:60020 for region
> DenseMatrix_randgnegu,,1231905480938, row '000000000000287', but
> failed after 10 attempts.
> Exceptions:
> java.io.IOException: java.io.IOException: Server not running, aborting
>         at
org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2103)
>         at
org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdates(HRegionServer.java:1611)
> ----
>
You upped the hbase client timeouts?

> And, I can't stop the hbase.
>
> [d8g053:/root]# hbase-trunk/bin/stop-hbase.sh
> stopping
master...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
>
> Can it be recovered?

What does master log say?  Why ain't it going down?  On tail of the log
it'll usually say why its staying up.  Probably a particular HRegionServer?

>
> ----
> Region server log:
>
> 2009-01-14 13:03:56,591 WARN org.apache.hadoop.hdfs.DFSClient:
> DataStreamer Exception: java.io.IOException: Unable to create new
> block.
>         at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2723)
>         at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
>         at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)

These look like issue that config. on the troubleshooting page might address
(check your datanode logs).  You are using 0.18.0 hbase?

St.Ack



On Tue, Jan 13, 2009 at 8:42 PM, Edward J. Yoon <edwardyoon@apache.org>wrote:

> During write operation in reduce phase, region servers are killed.
> (64,000 rows with 10,000 columns, 3 node)
>
> ----
> 09/01/14 13:07:59 INFO mapred.JobClient:  map 100% reduce 36%
> 09/01/14 13:11:38 INFO mapred.JobClient:  map 100% reduce 33%
> 09/01/14 13:11:38 INFO mapred.JobClient: Task Id :
> attempt_200901140952_0010_r_000017_1, Status : FAILED
> org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to
> contact region server 61.247.201.163:60020 for region
> DenseMatrix_randgnegu,,1231905480938, row '000000000000287', but
> failed after 10 attempts.
> Exceptions:
> java.io.IOException: java.io.IOException: Server not running, aborting
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2103)
>        at
> org.apache.hadoop.hbase.regionserver.HRegionServer.batchUpdates(HRegionServer.java:1611)
> ----
>
> And, I can't stop the hbase.
>
> [d8g053:/root]# hbase-trunk/bin/stop-hbase.sh
> stopping
> master...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
>
> Can it be recovered?
>
> ----
> Region server log:
>
> 2009-01-14 13:03:56,591 WARN org.apache.hadoop.hdfs.DFSClient:
> DataStreamer Exception: java.io.IOException: Unable to create new
> block.
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2723)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> 2009-01-14 13:03:56,591 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-4005955194083205373_14543 bad datanode[0]
> nodes == null
> 2009-01-14 13:03:56,591 WARN org.apache.hadoop.hdfs.DFSClient: Could
> not get block locations. Aborting...
> 2009-01-14 13:03:56,629 ERROR
> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
> Compaction/Split failed for region
> DenseMatrix_randllnma,000000000000,18,7-29116,1231898419257
> java.io.IOException: Could not read from stream
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:119)
>        at java.io.DataInputStream.readByte(DataInputStream.java:248)
>        at
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:325)
>        at
> org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:346)
>        at org.apache.hadoop.io.Text.readString(Text.java:400)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2779)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2704)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> 2009-01-14 13:03:56,631 INFO
> org.apache.hadoop.hbase.regionserver.HRegion: starting  compaction on
> region DenseMatrix_randllnma,00000000000,16,19-26373,1231898311583
> 2009-01-14 13:03:56,692 INFO org.apache.hadoop.io.compress.CodecPool:
> Got brand-new decompressor
> 2009-01-14 13:03:56,692 INFO org.apache.hadoop.io.compress.CodecPool:
> Got brand-new decompressor
> 2009-01-14 13:03:56,693 INFO org.apache.hadoop.io.compress.CodecPool:
> Got brand-new decompressor
> 2009-01-14 13:03:56,693 INFO org.apache.hadoop.io.compress.CodecPool:
> Got brand-new decompressor
> 2009-01-14 13:03:57,521 INFO org.apache.hadoop.io.compress.CodecPool:
> Got brand-new compressor
> 2009-01-14 13:03:57,810 INFO org.apache.hadoop.hdfs.DFSClient:
> Exception in createBlockOutputStream java.io.IOException: Could not
> read from stream
> 2009-01-14 13:03:57,810 INFO org.apache.hadoop.hdfs.DFSClient:
> Abandoning block blk_-2612702056484946948_14554
> 2009-01-14 13:03:59,343 WARN org.apache.hadoop.hdfs.DFSClient:
> DataStreamer Exception: java.io.IOException: Unable to create new
> block.
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2723)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
>
> 2009-01-14 13:03:59,344 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-5255885897790790367_14543 bad datanode[0]
> nodes == null
> 2009-01-14 13:03:59,344 WARN org.apache.hadoop.hdfs.DFSClient: Could
> not get block locations. Aborting...
> 2009-01-14 13:03:59,344 FATAL
> org.apache.hadoop.hbase.regionserver.MemcacheFlusher: Replay of hlog
> required. Forcing server shutdown
> org.apache.hadoop.hbase.DroppedSnapshotException: region:
> DenseMatrix_randgnegu,,1231905480938
>        at
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:896)
>        at
> org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:789)
>        at
> org.apache.hadoop.hbase.regionserver.MemcacheFlusher.flushRegion(MemcacheFlusher.java:227)
>        at
> org.apache.hadoop.hbase.regionserver.MemcacheFlusher.run(MemcacheFlusher.java:137)
> Caused by: java.io.IOException: Could not read from stream
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:119)
>        at java.io.DataInputStream.readByte(DataInputStream.java:248)
>        at
> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:325)
>        at
> org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:346)
>        at org.apache.hadoop.io.Text.readString(Text.java:400)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2779)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2704)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
> 2009-01-14 13:03:59,359 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
> request=15, regions=48, stores=192, storefiles=756,
> storefileIndexSize=6, memcacheSize=338, usedHeap=395, maxHeap=971
> 2009-01-14 13:03:59,359 INFO
> org.apache.hadoop.hbase.regionserver.MemcacheFlusher:
> regionserver/0:0:0:0:0:0:0:0:60020.cacheFlusher exiting
> 2009-01-14 13:03:59,368 INFO
> org.apache.hadoop.hbase.regionserver.HLog: Closed
> hdfs://
> dev3.nm2.naver.com:9000/hbase/log_61.247.201.165_1231894400437_60020/hlog.dat.1231905813472
> ,
> entries=896500. New log writer:
> /hbase/log_61.247.201.165_1231894400437_60020/hlog.dat.1231905839367
>
> 2009-01-14 13:03:59,368 INFO
> org.apache.hadoop.hbase.regionserver.LogRoller: LogRoller exiting.
>
>
>
> --
> Best Regards, Edward J. Yoon @ NHN, corp.
> edwardyoon@apache.org
> http://blog.udanax.org
>



-- 
Best Regards, Edward J. Yoon @ NHN, corp.
edwardyoon@apache.org
http://blog.udanax.org

Mime
View raw message