hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Kozlov <ale...@cloudera.com>
Subject Re: Lots of Different Kind of Datanode Errors
Date Fri, 04 Jun 2010 19:18:03 GMT
Hi Jeff,

Can you also check what's your machine swappiness is set by running
'/sbin/sysctl vm.swappiness'?  HBase recommends to set it very low (0 or 5).

Alex K

On Fri, Jun 4, 2010 at 12:03 PM, Todd Lipcon <todd@cloudera.com> wrote:

> Hi Jeff,
>
> That seems like a reasonable config, but the error message you pasted
> indicated xceivers was set to 2048 instead of 4096.
>
> Also, in my experience SocketTimeoutExceptions are usually due to swapping.
> Verify that your machines aren't swapping when you're under load.
>
> BTW since this is hbase-related, may be better to move this to the hbase
> user list.
>
> -Todd
>
> On Fri, Jun 4, 2010 at 9:37 AM, Jeff Whiting <jeffw@qualtrics.com> wrote:
>
>>  I've tried to follow it the best I can.  I already increased the ulimit
>> to 32768.  This is what I now have in my hdfs-site.xml.  Am I missing
>> anything?
>> <?xml version="1.0"?>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>> <configuration>
>> <property>
>>   <name>dfs.data.dir</name>
>>   <value>/media/sdb,/media/sdc,/media/sdd</value>
>> </property>
>>
>>   <property>
>>     <name>dfs.replication</name>
>>     <value>3</value>
>>   </property>
>>   <property>
>>     <name>dfs.datanode.max.xcievers</name>
>>     <value>4096</value>
>>   </property>
>>   <property>
>>     <name>dfs.datanode.handler.count</name>
>>     <value>10</value>
>>   </property>
>> </configuration>
>>
>>
>> .
>>
>> Todd Lipcon wrote:
>>
>> Hi Jeff,
>>
>>  Have you followed the HDFS configuration guide from the HBase wiki? You
>> need to bump up the transceiver count and probably ulimit as well. Looks
>> like you already tuned to 2048 but isn't high enough if you're still getting
>> the "exceeds the limit" message.
>>
>>  The EOFs and Connection Reset messages are when DFS clients are
>> disconnecting prematurely from a client stream (probably due to xceiver
>> errors on other streams)
>>
>>  -Todd
>>
>> On Fri, Jun 4, 2010 at 8:56 AM, jeff whiting <jeffw@qualtrics.com> wrote:
>>
>>> I had my HRegionServers go down due to hdfs exception.  In the datanode
>>> logs I'm seeing a lot of different and varied exceptions.  I've increased
>>> the data xceiver count now but these other ones don't make a lot of sense.
>>>
>>> Among them are:
>>>
>>> :2010-06-04 07:41:56,917 ERROR datanode.DataNode
>>> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010,
>>> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075,
>>> ipcPort=50020):DataXceiver
>>> -java.io.EOFException
>>> -       at java.io.DataInputStream.readByte(DataInputStream.java:250)
>>> -       at
>>> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
>>> -       at
>>> org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
>>> -       at org.apache.hadoop.io.Text.readString(Text.java:400)
>>> -       at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:313)
>>> -       at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
>>> -       at java.lang.Thread.run(Thread.java:619)
>>>
>>>
>>> :2010-06-04 08:49:56,389 ERROR datanode.DataNode
>>> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010,
>>> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075,
>>> ipcPort=50020):DataXceiver
>>> -java.io.IOException: Connection reset by peer
>>> -       at sun.nio.ch.FileDispatcher.read0(Native Method)
>>> -       at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
>>> -       at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
>>> -       at sun.nio.ch.IOUtil.read(IOUtil.java:206)
>>> -       at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
>>> -       at
>>> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55)
>>> -       at
>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>>> -       at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>>>
>>>
>>> :2010-06-04 05:36:54,840 ERROR datanode.DataNode
>>> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010,
>>> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075,
>>> ipcPort=50020):DataXceiver
>>> -java.io.IOException: xceiverCount 2049 exceeds the limit of concurrent
>>> xcievers 2047
>>> -       at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:88)
>>> -       at java.lang.Thread.run(Thread.java:619)
>>>
>>> :2010-06-04 05:36:48,848 ERROR datanode.DataNode
>>> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010,
>>> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075,
>>> ipcPort=50020):DataXceiver
>>> -java.net.SocketTimeoutException: 480000 millis timeout while waiting for
>>> channel to be ready for write. ch :
>>> java.nio.channels.SocketChannel[connected local=/192.168.1.184:50010remote=/
>>> 192.168.1.184:55349]
>>> -       at
>>> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
>>> -       at
>>> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159)
>>> -       at
>>> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198)
>>> -       at
>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313)
>>> -       at
>>> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400)
>>> -       at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180)
>>> -       at
>>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95)
>>> -       at java.lang.Thread.run(Thread.java:619)
>>> --
>>>
>>> The EOFException is the most common one I get.  I'm also unsure how I
>>> would get a connection reset by peer when I'm connecting locally.  Why is
>>> the file prematurely ending? Any idea of what is going on?
>>>
>>> Thanks,
>>> ~Jeff
>>>
>>> --
>>> Jeff Whiting
>>> Qualtrics Senior Software Engineer
>>> jeffw@qualtrics.com
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>>
>> --
>> Jeff Whiting
>> Qualtrics Senior Software Engineerjeffw@qualtrics.com
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Mime
View raw message