hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Carroll <phobos...@gmail.com>
Subject Re: Terribly long HDFS timeouts while appending to HLog
Date Wed, 07 Nov 2012 15:25:26 GMT
Sorry. It's early in the morning here. Did not see the 'read timeout'. +1
to Nicolas's response.

On Wed, Nov 7, 2012 at 7:22 AM, Jeremy Carroll <phobos182@gmail.com> wrote:

> One trick I have used for a while is to
> set dfs.datanode.socket.write.timeout in hdfs-site.xml to 0 (disabled).
> It's not going to solve your underlying IOPS capacity issue with your
> servers, but it can help for short bursty periods. Basically it's hiding
> the real issue, but it can help in the short term.
>
>
> On Wed, Nov 7, 2012 at 1:43 AM, Varun Sharma <varun@pinterest.com> wrote:
>
>> Hi,
>>
>> I am seeing extremely long HDFS timeouts - and this seems to be associated
>> with the loss of a DataNode. Here is the RS log:
>>
>> 12/11/07 02:17:45 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor
>> exception  for block blk_2813460962462751946_78454java.io.IOException: Bad
>> response 1 for block blk_2813460962462751946_78454 from datanode
>> 10.31.190.107:9200
>>         at
>>
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3084)
>>
>> 12/11/07 02:17:45 WARN hdfs.DFSClient: Error Recovery for block
>> blk_2813460962462751946_78454 bad datanode[1] 10.31.190.107:9200
>> 12/11/07 02:17:45 WARN hdfs.DFSClient: Error Recovery for block
>> blk_2813460962462751946_78454 in pipeline 10.31.138.245:9200,
>> 10.31.190.107:9200, 10.159.19.90:9200: bad datanode 10.31.190.107:9200
>> 12/11/07 02:17:45 WARN wal.HLog: IPC Server handler 35 on 60020 took 65955
>> ms appending an edit to hlog; editcount=476686, len~=76.0
>> 12/11/07 02:17:45 WARN wal.HLog: HDFS pipeline error detected. Found 2
>> replicas but expecting no less than 3 replicas.  Requesting close of hlog.
>>
>> The corresponding DN log goes like this
>>
>> 2012-11-07 02:17:45,142 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode (PacketResponder 2 for
>> Block blk_2813460962462751946_78454): PacketResponder
>> blk_2813460962462751946_78454 2 Exception java.net.SocketTimeoutException:
>> 66000 millis timeout while waiting for channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected local=/10.31.138.245:33965
>> remote=/
>> 10.31.190.107:9200]
>>         at
>>
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>>         at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>>         at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>>         at java.io.DataInputStream.readLong(DataInputStream.java:399)
>>         at
>>
>> org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:124)
>>         at
>>
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:806)
>>         at java.lang.Thread.run(Thread.java:662)
>>
>> It seems like the DataNode local to the region server is trying to grab
>> the
>> block from another DN and that is timing out because of this other data
>> node being bad. All in all this causes response times to be terribly poor.
>> Is there a way around this or am I missing something ?
>>
>> Varun
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message