hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michel Segel <michael_se...@hotmail.com>
Subject Re: Occasional regionserver crashes following socket errors writing to HDFS
Date Thu, 10 May 2012 11:53:39 GMT
Silly question...
Why are you using a reducer when working w HBase?

Second silly question... What is the max file size of your table that you are writing to?

Third silly question... How many regions are on each of your region servers

Fourth silly question ... There is this bandwidth setting... Default is 10MB...  Did you modify
it?



Sent from a remote device. Please excuse any typos...

Mike Segel

On May 10, 2012, at 6:33 AM, Eran Kutner <eran@gigya.com> wrote:

> Thanks Igal, but we already have that setting. These are the relevant
> setting from hdfs-site.xml :
>  <property>
>    <name>dfs.datanode.max.xcievers</name>
>    <value>65536</value>
>  </property>
>  <property>
>    <name>dfs.datanode.handler.count</name>
>    <value>10</value>
>  </property>
>  <property>
>    <name>dfs.datanode.socket.write.timeout</name>
>    <value>0</value>
>  </property>
> 
> Other ideas?
> 
> -eran
> 
> 
> 
> On Thu, May 10, 2012 at 12:25 PM, Igal Shilman <igals@wix.com> wrote:
> 
>> Hi Eran,
>> Do you have: dfs.datanode.socket.write.timeout set in hdfs-site.xml ?
>> (We have set this to zero in our cluster, which means waiting as long as
>> necessary for the write to complete)
>> 
>> Igal.
>> 
>> On Thu, May 10, 2012 at 11:17 AM, Eran Kutner <eran@gigya.com> wrote:
>> 
>>> Hi,
>>> We're seeing occasional regionserver crashes during heavy write
>> operations
>>> to Hbase (at the reduce phase of large M/R jobs). I have increased the
>> file
>>> descriptors, HDFS xceivers, HDFS threads to the recommended settings and
>>> actually way above.
>>> 
>>> Here is an example of the HBase log (showing only errors):
>>> 
>>> 2012-05-10 03:34:54,291 WARN org.apache.hadoop.hdfs.DFSClient:
>>> DFSOutputStream ResponseProcessor exception  for block
>>> blk_-8928911185099340956_5189425java.io.IOException: Bad response 1 for
>>> block blk_-8928911185099340956_5189425 from datanode 10.1.104.6:50010
>>>       at
>>> 
>>> 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:
>>> 2986)
>>> 
>>> 2012-05-10 03:34:54,494 WARN org.apache.hadoop.hdfs.DFSClient:
>> DataStreamer
>>> Exception: java.io.InterruptedIOException: Interruped while waiting for
>> IO
>>> on channel java.nio.channels.SocketChannel[connected
>>> local=/10.1.104.9:59642remote=/
>>> 10.1.104.9:50010]. 0 millis timeout left.
>>>       at
>>> 
>>> 
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:349)
>>>       at
>>> 
>>> 
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
>>>       at
>>> 
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
>>>       at
>>> 
>> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
>>>       at
>> java.io.BufferedOutputStream.write(BufferedOutputStream.java:105)
>>>       at java.io.DataOutputStream.write(DataOutputStream.java:90)
>>>       at
>>> 
>>> 
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:
>>> 2848)
>>> 
>>> 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>> Recovery for block blk_-8928911185099340956_5189425 bad datanode[2]
>>> 10.1.104.6:50010
>>> 2012-05-10 03:34:54,531 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>> Recovery for block blk_-8928911185099340956_5189425 in pipeline
>>> 10.1.104.9:50010, 10.1.104.8:50010, 10.1.104.6:50010: bad datanode
>>> 10.1.104.6:50010
>>> 2012-05-10 03:48:30,174 FATAL
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>> server
>>> serverName=hadoop1-s09.farm-ny.gigya.com,60020,1336476100422,
>>> load=(requests=15741, regions=789, usedHeap=6822, maxHeap=7983):
>>> regionserver:60020-0x2372c0e8a2f0008 regionserver:60020-0x2372c0e8a2f0008
>>> received expired from ZooKeeper, aborting
>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>> KeeperErrorCode = Session expired
>>>       at
>>> 
>>> 
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
>>>       at
>>> 
>>> 
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
>>>       at
>>> 
>>> 
>> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
>>>       at
>>> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
>>> java.io.InterruptedIOException: Aborting compaction of store properties
>> in
>>> region
>>> 
>>> 
>> gs_users,6155551|QoCW/euBIKuMat/nRC5Xtw==,1334983658004.878522ea91f41cd76b903ea06ccd17f9.
>>> because user requested stop.
>>>       at
>>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:998)
>>>       at
>>> org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:779)
>>>       at
>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:776)
>>>       at
>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:721)
>>>       at
>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
>>> 
>>> 
>>> This is from 10.1.104.9 (same machine running the region server that
>>> crashed):
>>> 2012-05-10 03:31:16,785 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>> blk_-8928911185099340956_5189425 src: /10.1.104.9:59642 dest: /
>>> 10.1.104.9:50010
>>> 2012-05-10 03:35:39,000 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>> Connection reset
>>> 2012-05-10 03:35:39,052 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>> receiveBlock
>>> for block blk_-8928911185099340956_5189425
>>> java.nio.channels.ClosedByInterruptException
>>> 2012-05-10 03:35:39,053 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>>> blk_-8928911185099340956_5189425 received exception java.io.IOException:
>>> Interrupted receiveBlock
>>> 2012-05-10 03:35:39,055 ERROR
>>> org.apache.hadoop.security.UserGroupInformation:
>> PriviledgedActionException
>>> as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
>>> blk_-8928911185099340956_5189425 length is 24384000 does not match block
>>> file length 24449024
>>> 2012-05-10 03:35:39,055 INFO org.apache.hadoop.ipc.Server: IPC Server
>>> handler 3 on 50020, call
>>> startBlockRecovery(blk_-8928911185099340956_5189425) from
>> 10.1.104.8:50251
>>> :
>>> error: java.io.IOException: Block blk_-8928911185099340956_5189425 length
>>> is 24384000 does not match block file length 24449024
>>> java.io.IOException: Block blk_-8928911185099340956_5189425 length is
>>> 24384000 does not match block file length 24449024
>>> 2012-05-10 03:35:39,077 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>> Broken pipe
>>> 2012-05-10 03:35:39,077 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>> Socket closed
>>> 2012-05-10 03:35:39,108 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>> Socket closed
>>> 2012-05-10 03:35:39,136 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>> Socket closed
>>> 2012-05-10 03:35:39,165 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>> Socket closed
>>> 2012-05-10 03:35:39,196 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>> Socket closed
>>> 2012-05-10 03:35:39,221 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>> Socket closed
>>> 2012-05-10 03:35:39,246 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>> Socket closed
>>> 2012-05-10 03:35:39,271 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>> Socket closed
>>> 2012-05-10 03:35:39,296 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder
>>> blk_-8928911185099340956_5189425 2 Exception java.net.SocketException:
>>> Socket closed
>>> 
>>> This is the log from 10.1.104.6 datanode for
>>> "blk_-8928911185099340956_5189425":
>>> 2012-05-10 03:31:16,772 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>>> blk_-8928911185099340956_5189425 src: /10.1.104.8:43828 dest: /
>>> 10.1.104.6:50010
>>> 2012-05-10 03:35:39,041 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>> receiveBlock
>>> for block blk_-8928911185099340956_5189425 java.net.SocketException:
>>> Connection reset
>>> 2012-05-10 03:35:39,043 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
>>> block blk_-8928911185099340956_5189425 Interrupted.
>>> 2012-05-10 03:35:39,043 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for
>>> block blk_-8928911185099340956_5189425 terminating
>>> 2012-05-10 03:35:39,043 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>>> blk_-8928911185099340956_5189425 received exception
>>> java.net.SocketException: Connection reset
>>> 
>>> 
>>> Any idea why is this happening?
>>> 
>>> Thanks.
>>> 
>>> -eran
>>> 
>> 

Mime
View raw message