Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns;
	h=message-id:date:from:user-agent:mime-version:to:subject:
	references:in-reply-to:content-type:content-transfer-encoding;
	b=HxG6JUFyXR9HcqpvdFz9AqCb39BcXgexZMGK8T5+nLtxivX/SWolyIrczXhr9URe
Message-ID: <4A819D49.5010504@yahoo-inc.com>
Date: Tue, 11 Aug 2009 09:33:13 -0700
From: Raghu Angadi <rangadi@yahoo-inc.com>
User-Agent: Thunderbird 2.0.0.22 (X11/20090608)
MIME-Version: 1.0
To: common-user@hadoop.apache.org
Subject: Re: corrupt filesystem
References: <4A809A0D.7060704@casalemedia.com>
 <4A80A10E.8010700@yahoo-inc.com> <4A80A9F5.9000603@casalemedia.com>
In-Reply-To: <4A80A9F5.9000603@casalemedia.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit


Note that there are multiple log files (one for each day). Make sure you 
searched all the relevant days. You can also check datanode log for this 
block.

HDFS writes to all three datanodes at the time you write the data. It is 
possible that other two datanodes also encountered errors.

This would result in an error when you tried to copy and such corrupt 
block should not even appear in HDFS. Did you restart the cluster after 
copying? 0.18.3 has various fixes related to handling block replication 
correctly.

Please include the complete log lines (at the end of your response), it 
makes it simpler to interpret. Alternately you file a JIRA and attach 
log files there.

Raghu.

Mayuran Yogarajah wrote:
> Hello,
> 
>> If you are interested, you could try to trace one of these block ids in
>> NameNode log to see what happened it. We are always eager to hear about
>> irrecoverable errors. Please mention hadoop version you are using.
>>
>>   
> I'm using Hadoop 0.18.3.  I just checked namenode log for one of the bad 
> blocks. I see entries from Saturday saying:
> ask 1.1.1.6:50010 to replicate blk_1697509332927954816_8724 to 
> datanode(s) < all other data nodes >
> 
> I only loaded this data Saturday, and the .6 data node became full at 
> some point.
> When data is first loaded into the cluster, does the name node send the 
> data to as many nodes as
> it can to satisfy the replication factor, or does it send it to one node 
> and ask that node send it to others?
> 
> If its the latter then its possible that the block became corrupt when I 
> first loaded it to .6 (since it was full),
> and since it was designated to send the block to other nodes none of the 
> nodes would have a non-corrupt
> copy.
> 
> Raghu, please let me know what you think.
> 
> thanks,
> 
> M