hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <nutch-...@dragonflymc.com>
Subject Re: dfs incompatibility .3 and .4-dev?
Date Wed, 07 Jun 2006 22:21:54 GMT
Another interesting thing is that every single file is corrupt and 
missing exactly one block.

Dennis Kubes wrote:
> I don't know if this is the same problem or not but here is what I am 
> experiencing.
>
> I have an 11 node cluster deployed a fresh nutch install with 3.1.  
> Startup completed fine.  Filesystem healthy.  Performed 1st inject, 
> generate, fetch for 1000 urls.  Filesystem intact.  Performed 2nd 
> inject, generate, fetch for 1000 urls.  Filesystem healthy. Merged 
> crawldbs.  Filesystem healthy.  Merged segments.  Filesystem healthy.  
> Inverted links. Healthy.  Indexed.  Healthy.  Performed searches. 
> Healthy.  Now here is where it gets interested.  Shutdown all servers 
> via stop-all.sh.  Started all server via start-all.sh.  Filesystem 
> reports healthy.  Performed inject and generate of 1000 urls.  
> Filesystem reports healthy.  Performed fetch of the new segments and 
> get errors below and full corrupted filesystem (both new segments and 
> old data).
>
> java.io.IOException: Could not obtain block: blk_6625125900957460239 
> file=/user/phoenix/temp/segments1/20060607165425/crawl_generate/part-00006 
> offset=0
>     at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:529) 
>
>     at 
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:638)
>     at 
> org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:84) 
>
>     at 
> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:159)

>
>     at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>     at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>     at java.io.DataInputStream.readFully(DataInputStream.java:176)
>     at java.io.DataInputStream.readFully(DataInputStream.java:152)
>     at 
> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:263)
>     at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:247)
>     at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:237)
>     at 
> org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:36)

>
>     at 
> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:53)

>
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:105)
>     at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:847)
>
> Hope this helps in tracking down the problem if it is even the same 
> problem.
>
> Dennis
>
> Konstantin Shvachko wrote:
>> Thanks Stefan.
>>
>> I spend some time investigating the problem.
>> There are 3 of them actually.
>> 1). At startup data nodes are now registering with the name node. If 
>> registering doesn't work,
>> because the name node is busy at the moment, which could easily be 
>> the case if it is loading
>> a two week long log, then the data node would just fail and won't 
>> start at all.
>> See HADOOP-282.
>> 2). When the cluster is running and the name node gets busy, and the 
>> data node as the result
>> fails to connect to it, then the data node falls into an infinite 
>> loop doing nothing but throwing an
>> exception. So for the name node it is dead, since it is not sending 
>> any heartbeats.
>> See HADOOP-285.
>> 3). People say that they have seen loss of recent data, while the old 
>> data is still present.
>> And this is happening when the cluster was brought down (for the 
>> upgrade), and restarted.
>> We know HADOOP-227 that logs/edits are accumulating as long as the 
>> cluster is running.
>> So if it was up for 2 weeks then the edits file is most probably 
>> huge. If it is corrupted then
>> the data is lost.
>> I could not reproduce that, just don't have any 2-week old edits 
>> files yet.
>> I thoroughly examined one cluster and found missing blocks on the 
>> nodes that pretended to be
>> up as in (2) above. Didn't see any data loss at all. I think large 
>> edits files should be further
>> investigated.
>>
>> There are patches fixing HADOOP-282 and HADOOP-285. We do not have patch
>> for HADOOP-227 yet, so people need to restart the name node (just the 
>> name node) depending
>> on the activity on the cluster, namely depending on the size of the 
>> edits file.
>>
>>
>> Stefan Groschupf wrote:
>>
>>> Hi Konstantin,
>>>
>>>> Could you give some more information about what happened to you.
>>>> - what is your cluster size
>>>
>>> 9 datanode, 1 namenode.
>>>
>>>> - amount of data
>>>
>>> Total raw bytes: 6023680622592 (5609.98 Gb)
>>> Used raw bytes: 2357053984804 (2195.17 Gb)
>>>
>>>> - how long did dfs ran without restarting the name node before  
>>>> upgrading
>>>
>>> I would say 2 weeks.
>>>
>>>>> I would love to figure out what was my problem today. :)
>>>>
>>>> we discussed the three  kinds of data looses, hardware, software  
>>>> or  human errors.
>>>>
>>>> Looks like you are not alone :-(
>>>
>>> Too bad that the other didn't report it earlier. :)
>>
>> Everything was happening in the same time.
>>
>>>>> + updated from hadoop .2.1 to .4.
>>>>> + problems to get all datanodes started
>>>>
>>>>
>>>> what was the problem with datanodes?
>>>
>>> Scenario:
>>> I don't think there was a real problem. I notice that the datanodes  
>>> was not able to connect to the namenode.
>>> Later one I just add a "sleep 5" into the dfs starting script after  
>>> starteing the name node and that sloved the problem.
>>
>> That is right, we did the same.
>>
>>> However at this time I updated, notice that problem, was thinking  
>>> "ok, not working yet, lets wait another week", downgrading.
>>
>>>>> + downgrade to hadoop .3.1
>>>>> + error message of incompatible dfs (I guess . already had 
>>>>> started  to  write to the log)
>>>>
>>>>
>>>> What is the message?
>>>
>>>
>>> Sorry I can not find the exception anymore in the logs. :-(
>>> Something like "version conflict -1 vs -2" :-o Sorry didn't 
>>> remember  exactly.
>>
>> Yes. You are running the old version (-1) code that would not accept 
>> the "future" version (-2) images.
>> The image was converted to v. -2 when you tried to run the upgraded 
>> hadoop.
>>
>> Regards,
>> Konstantin
>

Mime
View raw message