hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: DFSClient write error when DN down
Date Fri, 04 Dec 2009 17:04:23 GMT
On Fri, Dec 4, 2009 at 12:01 PM, Arvind Sharma <arvind321@yahoo.com> wrote:
> Thanks Todd !
>
> Just wanted another confirmation I guess :-)
>
> Arvind
>
>
>
>
> ________________________________
> From: Todd Lipcon <todd@cloudera.com>
> To: common-user@hadoop.apache.org
> Sent: Fri, December 4, 2009 8:35:56 AM
> Subject: Re: DFSClient write error when DN down
>
> Hi Arvind,
>
> Looks to me like you've identified the JIRAs that are causing this.
> Hopefully they will be fixed soon.
>
> -Todd
>
> On Fri, Dec 4, 2009 at 4:43 AM, Arvind Sharma <arvind321@yahoo.com> wrote:
>
>> Any suggestions would be welcome :-)
>>
>> Arvind
>>
>>
>>
>>
>>
>>
>> ________________________________
>> From: Arvind Sharma <arvind321@yahoo.com>
>> To: common-user@hadoop.apache.org
>> Sent: Wed, December 2, 2009 8:02:39 AM
>> Subject: DFSClient write error when DN down
>>
>>
>>
>> I have seen similar error logs in the Hadoop Jira (Hadoop-2691, HDFS-795 )
>> but not sure this one is exactly the same scenario.
>>
>> Hadoop - 0.19.2
>>
>> The client side DFSClient fails to write when few of the DN in a grid goes
>> down.  I see this error :
>>
>> ***************************
>>
>> 2009-11-13 13:45:27,815 WARN DFSClient | DFSOutputStream
>> ResponseProcessor exception for block
>> blk_3028932254678171367_1462691java.io.IOException: Bad response 1 for
>> block blk_30289322
>> 54678171367_1462691 from datanode 10.201.9.225:50010
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2341)
>> 2009-11-13 13:45:27,815 WARN             DFSClient |  Error Recovery for
>> block blk_3028932254678171367_1462691 bad datanode[2] 10.201.9.225:50010
>> 2009-11-13 13:45:27,815 WARN DFSClient | Error Recovery for block
>> blk_3028932254678171367_1462691 in pipeline 10.201.9.218:50010,
>> 10.201.9.220:50010, 10.201.9.225:50010: bad datanode 10
>> ...201.9.225:50010
>> 2009-11-13 13:45:37,433 WARN DFSClient | DFSOutputStream
>> ResponseProcessor exception for block
>> blk_-6619123912237837733_1462799java.io.IOException: Bad response 1 for
>> block blk_-661912
>> 3912237837733_1462799 from datanode 10.201.9.225:50010
>> at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2341)2009-11-13
>> 13:45:37,433 WARN             DFSClient |  Error Recovery for block
>> blk_-6619123912237837733_1462799 bad datanode[1] 10.201.9.225:50010
>> 2009-11-13 13:45:37,433 WARN DFSClient | Error Recovery for block
>> blk_-6619123912237837733_1462799 in pipeline 10.201.9.218:50010,
>> 10.201.9.225:50010: bad datanode 10.201.9.225:50010
>>
>>
>> ***************************
>>
>> The only way I could get my client program to write successfully to the DFS
>> was to re-start it.
>>
>> Any suggestions how to get around this problem on the client side ?  As I
>> understood, the DFSClient APIs will take care of situations like this and
>> the clients don't need to worry about if some of the DN goes down.
>>
>> Also, the replication factor is 3 in my setup and there are 10 DN (out of
>> which TWO went down)
>>
>>
>> Thanks!
>> Arvind
>>
>>
>>
>>
>
>
>
>

I will give you another confirmation...

This has happened on my dev cluster (5 nodes). I had an 18.3 cluster
at the time. Replication was at 3, I had 2 nodes go down. I did not
look into this very deeply. My hunch was that new files were created
by a map/reduce program, and they were replicated only to the two
nodes that went down. This caused the job to die, and the file system
was not 'right' until I brought the two data nodes back online. FSCK
did not think anything was wrong with the filesystem and everything
that did not deal with the parent Paths the files were in was fine.
However M/R apps that tried to use the parent Path failed. I tried
restarts to no avail for NameNode and all DataNodes.

In that case, I just did anything I could to bring those data nodes
up. Even if you can bring them up without much storage, the act of
bringing them up cleared the issue. Sorry for the off the cuff,
unconfirmed description. This event only happened to me once so I
never looked into it again. If it is any consolation, signs point to
it not happening very often.

Edward

Mime
View raw message