Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 85138 invoked from network); 7 Jun 2006 22:40:03 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 7 Jun 2006 22:40:03 -0000 Received: (qmail 49762 invoked by uid 500); 7 Jun 2006 22:39:59 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 49709 invoked by uid 500); 7 Jun 2006 22:39:59 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 49675 invoked by uid 99); 7 Jun 2006 22:39:59 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Jun 2006 15:39:59 -0700 X-ASF-Spam-Status: No, hits=1.4 required=10.0 tests=DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_WHOIS X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [216.145.54.172] (HELO mrout2.yahoo.com) (216.145.54.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Jun 2006 15:39:58 -0700 Received: from [10.72.104.183] (enoughhot-lx.corp.yahoo.com [10.72.104.183]) by mrout2.yahoo.com (8.13.6/8.13.4/y.out) with ESMTP id k57Mce6D094566 for ; Wed, 7 Jun 2006 15:38:40 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=message-id:date:from:user-agent:x-accept-language: mime-version:to:subject:references:in-reply-to:content-type:content-transfer-encoding; b=LM9hmrfPkBh9HvXmhvZln9JvrrzkI5uUb9901Tl8hUYOyyOfowA8P0znsckp74To Message-ID: <44875570.10104@yahoo-inc.com> Date: Wed, 07 Jun 2006 15:38:40 -0700 From: Konstantin Shvachko User-Agent: Mozilla Thunderbird 1.0.6 (Windows/20050716) X-Accept-Language: en-us, en MIME-Version: 1.0 To: hadoop-dev@lucene.apache.org Subject: Re: dfs incompatibility .3 and .4-dev? References: <79433D67-1259-4A20-9B84-C703B96B4A91@media-style.com> <4486055C.5020307@yahoo-inc.com> <4487239F.6010106@yahoo-inc.com> <44874E96.5070902@dragonflymc.com> <44875182.8000201@dragonflymc.com> In-Reply-To: <44875182.8000201@dragonflymc.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N That might be the same problem. Related changes to hadoop have been committed just 1 hour before your initial email. So they are probably not in nutch yet. Although "exactly one block missing in each file" looks suspicious. Try bin/hadoop dfs -report to see how many data nodes you have now. If all of them are reported then this is different. --Konstantin Dennis Kubes wrote: > Another interesting thing is that every single file is corrupt and > missing exactly one block. > > Dennis Kubes wrote: > >> I don't know if this is the same problem or not but here is what I am >> experiencing. >> >> I have an 11 node cluster deployed a fresh nutch install with 3.1. >> Startup completed fine. Filesystem healthy. Performed 1st inject, >> generate, fetch for 1000 urls. Filesystem intact. Performed 2nd >> inject, generate, fetch for 1000 urls. Filesystem healthy. Merged >> crawldbs. Filesystem healthy. Merged segments. Filesystem >> healthy. Inverted links. Healthy. Indexed. Healthy. Performed >> searches. Healthy. Now here is where it gets interested. Shutdown >> all servers via stop-all.sh. Started all server via start-all.sh. >> Filesystem reports healthy. Performed inject and generate of 1000 >> urls. Filesystem reports healthy. Performed fetch of the new >> segments and get errors below and full corrupted filesystem (both new >> segments and old data). >> >> java.io.IOException: Could not obtain block: blk_6625125900957460239 >> file=/user/phoenix/temp/segments1/20060607165425/crawl_generate/part-00006 >> offset=0 >> at >> org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:529) >> >> at >> org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:638) >> at >> org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:84) >> >> at >> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:159) >> >> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) >> at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) >> at java.io.BufferedInputStream.read(BufferedInputStream.java:313) >> at java.io.DataInputStream.readFully(DataInputStream.java:176) >> at java.io.DataInputStream.readFully(DataInputStream.java:152) >> at >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:263) >> at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:247) >> at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:237) >> at >> org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:36) >> >> at >> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:53) >> >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:105) >> at >> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:847) >> >> Hope this helps in tracking down the problem if it is even the same >> problem. >> >> Dennis >> >> Konstantin Shvachko wrote: >> >>> Thanks Stefan. >>> >>> I spend some time investigating the problem. >>> There are 3 of them actually. >>> 1). At startup data nodes are now registering with the name node. If >>> registering doesn't work, >>> because the name node is busy at the moment, which could easily be >>> the case if it is loading >>> a two week long log, then the data node would just fail and won't >>> start at all. >>> See HADOOP-282. >>> 2). When the cluster is running and the name node gets busy, and the >>> data node as the result >>> fails to connect to it, then the data node falls into an infinite >>> loop doing nothing but throwing an >>> exception. So for the name node it is dead, since it is not sending >>> any heartbeats. >>> See HADOOP-285. >>> 3). People say that they have seen loss of recent data, while the >>> old data is still present. >>> And this is happening when the cluster was brought down (for the >>> upgrade), and restarted. >>> We know HADOOP-227 that logs/edits are accumulating as long as the >>> cluster is running. >>> So if it was up for 2 weeks then the edits file is most probably >>> huge. If it is corrupted then >>> the data is lost. >>> I could not reproduce that, just don't have any 2-week old edits >>> files yet. >>> I thoroughly examined one cluster and found missing blocks on the >>> nodes that pretended to be >>> up as in (2) above. Didn't see any data loss at all. I think large >>> edits files should be further >>> investigated. >>> >>> There are patches fixing HADOOP-282 and HADOOP-285. We do not have >>> patch >>> for HADOOP-227 yet, so people need to restart the name node (just >>> the name node) depending >>> on the activity on the cluster, namely depending on the size of the >>> edits file. >>> >>> >>> Stefan Groschupf wrote: >>> >>>> Hi Konstantin, >>>> >>>>> Could you give some more information about what happened to you. >>>>> - what is your cluster size >>>> >>>> >>>> 9 datanode, 1 namenode. >>>> >>>>> - amount of data >>>> >>>> >>>> Total raw bytes: 6023680622592 (5609.98 Gb) >>>> Used raw bytes: 2357053984804 (2195.17 Gb) >>>> >>>>> - how long did dfs ran without restarting the name node before >>>>> upgrading >>>> >>>> >>>> I would say 2 weeks. >>>> >>>>>> I would love to figure out what was my problem today. :) >>>>> >>>>> >>>>> we discussed the three kinds of data looses, hardware, software >>>>> or human errors. >>>>> >>>>> Looks like you are not alone :-( >>>> >>>> >>>> Too bad that the other didn't report it earlier. :) >>> >>> >>> Everything was happening in the same time. >>> >>>>>> + updated from hadoop .2.1 to .4. >>>>>> + problems to get all datanodes started >>>>> >>>>> >>>>> >>>>> what was the problem with datanodes? >>>> >>>> >>>> Scenario: >>>> I don't think there was a real problem. I notice that the >>>> datanodes was not able to connect to the namenode. >>>> Later one I just add a "sleep 5" into the dfs starting script >>>> after starteing the name node and that sloved the problem. >>> >>> >>> That is right, we did the same. >>> >>>> However at this time I updated, notice that problem, was thinking >>>> "ok, not working yet, lets wait another week", downgrading. >>> >>> >>>>>> + downgrade to hadoop .3.1 >>>>>> + error message of incompatible dfs (I guess . already had >>>>>> started to write to the log) >>>>> >>>>> >>>>> >>>>> What is the message? >>>> >>>> >>>> >>>> Sorry I can not find the exception anymore in the logs. :-( >>>> Something like "version conflict -1 vs -2" :-o Sorry didn't >>>> remember exactly. >>> >>> >>> Yes. You are running the old version (-1) code that would not accept >>> the "future" version (-2) images. >>> The image was converted to v. -2 when you tried to run the upgraded >>> hadoop. >>> >>> Regards, >>> Konstantin >> >> > > >