hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <nutch-...@dragonflymc.com>
Subject Re: dfs incompatibility .3 and .4-dev?
Date Wed, 07 Jun 2006 22:47:23 GMT
sorry that should have been now reporting healthy.  Everything is 
working after the restart.

Dennis Kubes wrote:
> All of the data nodes are there via bin/hadoop dfs -report.  One 
> interesting thing.  I shut down via stop-all.sh again and restarted 
> via start-all.sh and everything seems to be working.  I reran fsck and 
> everything is not reporting healthy.  I have not tried another fetch 
> yet but a generate was successful as was and updatedb and readdb.  I 
> am seeing alot of the errors below in the log but I think those are 
> fixed by some of the recent patches.
>
> 2006-06-07 17:44:19,916 INFO org.apache.hadoop.dfs.DataNode: Lost 
> connection to namenode.  Retrying...
> 2006-06-07 17:44:24,920 INFO org.apache.hadoop.dfs.DataNode: 
> Exception: java.lang.IllegalThreadStateException
> 2006-06-07 17:44:24,921 INFO org.apache.hadoop.dfs.DataNode: Lost 
> connection to namenode.  Retrying...
> 2006-06-07 17:44:29,925 INFO org.apache.hadoop.dfs.DataNode: 
> Exception: java.lang.IllegalThreadStateException
> 2006-06-07 17:44:29,925 INFO org.apache.hadoop.dfs.DataNode: Lost 
> connection to namenode.  Retrying...
> 2006-06-07 17:44:34,929 INFO org.apache.hadoop.dfs.DataNode: 
> Exception: java.lang.IllegalThreadStateException
> 2006-06-07 17:44:34,929 INFO org.apache.hadoop.dfs.DataNode: Lost 
> connection to namenode.  Retrying...
>
> Dennis
>
> Konstantin Shvachko wrote:
>> That might be the same problem.
>> Related changes to hadoop have been committed just 1 hour before your 
>> initial email.
>> So they are probably not in nutch yet.
>> Although "exactly one block missing in each file" looks suspicious.
>> Try
>> bin/hadoop dfs -report
>> to see how many data nodes you have now.
>> If all of them are reported then this is different.
>>
>> --Konstantin
>>
>> Dennis Kubes wrote:
>>
>>> Another interesting thing is that every single file is corrupt and 
>>> missing exactly one block.
>>>
>>> Dennis Kubes wrote:
>>>
>>>> I don't know if this is the same problem or not but here is what I 
>>>> am experiencing.
>>>>
>>>> I have an 11 node cluster deployed a fresh nutch install with 3.1.  
>>>> Startup completed fine.  Filesystem healthy.  Performed 1st inject, 
>>>> generate, fetch for 1000 urls.  Filesystem intact.  Performed 2nd 
>>>> inject, generate, fetch for 1000 urls.  Filesystem healthy. Merged 
>>>> crawldbs.  Filesystem healthy.  Merged segments.  Filesystem 
>>>> healthy.  Inverted links. Healthy.  Indexed.  Healthy.  Performed 
>>>> searches. Healthy.  Now here is where it gets interested.  Shutdown 
>>>> all servers via stop-all.sh.  Started all server via start-all.sh.  
>>>> Filesystem reports healthy.  Performed inject and generate of 1000 
>>>> urls.  Filesystem reports healthy.  Performed fetch of the new 
>>>> segments and get errors below and full corrupted filesystem (both 
>>>> new segments and old data).
>>>>
>>>> java.io.IOException: Could not obtain block: 
>>>> blk_6625125900957460239 
>>>> file=/user/phoenix/temp/segments1/20060607165425/crawl_generate/part-00006

>>>> offset=0
>>>>     at 
>>>> org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:529)

>>>>
>>>>     at 
>>>> org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:638) 
>>>>
>>>>     at 
>>>> org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:84)

>>>>
>>>>     at 
>>>> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:159)

>>>>
>>>>     at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>>>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>>>>     at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>>>>     at java.io.DataInputStream.readFully(DataInputStream.java:176)
>>>>     at java.io.DataInputStream.readFully(DataInputStream.java:152)
>>>>     at 
>>>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:263)
>>>>     at 
>>>> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:247)
>>>>     at 
>>>> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:237)
>>>>     at 
>>>> org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:36)

>>>>
>>>>     at 
>>>> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:53)

>>>>
>>>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:105)
>>>>     at 
>>>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:847)
>>>>
>>>> Hope this helps in tracking down the problem if it is even the same 
>>>> problem.
>>>>
>>>> Dennis
>>>>
>>>> Konstantin Shvachko wrote:
>>>>
>>>>> Thanks Stefan.
>>>>>
>>>>> I spend some time investigating the problem.
>>>>> There are 3 of them actually.
>>>>> 1). At startup data nodes are now registering with the name node. 
>>>>> If registering doesn't work,
>>>>> because the name node is busy at the moment, which could easily be 
>>>>> the case if it is loading
>>>>> a two week long log, then the data node would just fail and won't 
>>>>> start at all.
>>>>> See HADOOP-282.
>>>>> 2). When the cluster is running and the name node gets busy, and 
>>>>> the data node as the result
>>>>> fails to connect to it, then the data node falls into an infinite 
>>>>> loop doing nothing but throwing an
>>>>> exception. So for the name node it is dead, since it is not 
>>>>> sending any heartbeats.
>>>>> See HADOOP-285.
>>>>> 3). People say that they have seen loss of recent data, while the 
>>>>> old data is still present.
>>>>> And this is happening when the cluster was brought down (for the 
>>>>> upgrade), and restarted.
>>>>> We know HADOOP-227 that logs/edits are accumulating as long as the 
>>>>> cluster is running.
>>>>> So if it was up for 2 weeks then the edits file is most probably 
>>>>> huge. If it is corrupted then
>>>>> the data is lost.
>>>>> I could not reproduce that, just don't have any 2-week old edits 
>>>>> files yet.
>>>>> I thoroughly examined one cluster and found missing blocks on the 
>>>>> nodes that pretended to be
>>>>> up as in (2) above. Didn't see any data loss at all. I think large 
>>>>> edits files should be further
>>>>> investigated.
>>>>>
>>>>> There are patches fixing HADOOP-282 and HADOOP-285. We do not have 
>>>>> patch
>>>>> for HADOOP-227 yet, so people need to restart the name node (just 
>>>>> the name node) depending
>>>>> on the activity on the cluster, namely depending on the size of 
>>>>> the edits file.
>>>>>
>>>>>
>>>>> Stefan Groschupf wrote:
>>>>>
>>>>>> Hi Konstantin,
>>>>>>
>>>>>>> Could you give some more information about what happened to you.
>>>>>>> - what is your cluster size
>>>>>>
>>>>>>
>>>>>> 9 datanode, 1 namenode.
>>>>>>
>>>>>>> - amount of data
>>>>>>
>>>>>>
>>>>>> Total raw bytes: 6023680622592 (5609.98 Gb)
>>>>>> Used raw bytes: 2357053984804 (2195.17 Gb)
>>>>>>
>>>>>>> - how long did dfs ran without restarting the name node before
 
>>>>>>> upgrading
>>>>>>
>>>>>>
>>>>>> I would say 2 weeks.
>>>>>>
>>>>>>>> I would love to figure out what was my problem today. :)
>>>>>>>
>>>>>>>
>>>>>>> we discussed the three  kinds of data looses, hardware, 
>>>>>>> software  or  human errors.
>>>>>>>
>>>>>>> Looks like you are not alone :-(
>>>>>>
>>>>>>
>>>>>> Too bad that the other didn't report it earlier. :)
>>>>>
>>>>>
>>>>> Everything was happening in the same time.
>>>>>
>>>>>>>> + updated from hadoop .2.1 to .4.
>>>>>>>> + problems to get all datanodes started
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> what was the problem with datanodes?
>>>>>>
>>>>>>
>>>>>> Scenario:
>>>>>> I don't think there was a real problem. I notice that the 
>>>>>> datanodes  was not able to connect to the namenode.
>>>>>> Later one I just add a "sleep 5" into the dfs starting script 
>>>>>> after  starteing the name node and that sloved the problem.
>>>>>
>>>>>
>>>>> That is right, we did the same.
>>>>>
>>>>>> However at this time I updated, notice that problem, was 
>>>>>> thinking  "ok, not working yet, lets wait another week", 
>>>>>> downgrading.
>>>>>
>>>>>
>>>>>>>> + downgrade to hadoop .3.1
>>>>>>>> + error message of incompatible dfs (I guess . already had

>>>>>>>> started  to  write to the log)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> What is the message?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sorry I can not find the exception anymore in the logs. :-(
>>>>>> Something like "version conflict -1 vs -2" :-o Sorry didn't 
>>>>>> remember  exactly.
>>>>>
>>>>>
>>>>> Yes. You are running the old version (-1) code that would not 
>>>>> accept the "future" version (-2) images.
>>>>> The image was converted to v. -2 when you tried to run the 
>>>>> upgraded hadoop.
>>>>>
>>>>> Regards,
>>>>> Konstantin
>>>>
>>>>
>>>
>>>
>>>
>>

Mime
View raw message