hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <nutch-...@dragonflymc.com>
Subject Re: dfs incompatibility .3 and .4-dev?
Date Wed, 07 Jun 2006 22:45:17 GMT
All of the data nodes are there via bin/hadoop dfs -report.  One 
interesting thing.  I shut down via stop-all.sh again and restarted via 
start-all.sh and everything seems to be working.  I reran fsck and 
everything is not reporting healthy.  I have not tried another fetch yet 
but a generate was successful as was and updatedb and readdb.  I am 
seeing alot of the errors below in the log but I think those are fixed 
by some of the recent patches.

2006-06-07 17:44:19,916 INFO org.apache.hadoop.dfs.DataNode: Lost 
connection to namenode.  Retrying...
2006-06-07 17:44:24,920 INFO org.apache.hadoop.dfs.DataNode: Exception: 
java.lang.IllegalThreadStateException
2006-06-07 17:44:24,921 INFO org.apache.hadoop.dfs.DataNode: Lost 
connection to namenode.  Retrying...
2006-06-07 17:44:29,925 INFO org.apache.hadoop.dfs.DataNode: Exception: 
java.lang.IllegalThreadStateException
2006-06-07 17:44:29,925 INFO org.apache.hadoop.dfs.DataNode: Lost 
connection to namenode.  Retrying...
2006-06-07 17:44:34,929 INFO org.apache.hadoop.dfs.DataNode: Exception: 
java.lang.IllegalThreadStateException
2006-06-07 17:44:34,929 INFO org.apache.hadoop.dfs.DataNode: Lost 
connection to namenode.  Retrying...

Dennis

Konstantin Shvachko wrote:
> That might be the same problem.
> Related changes to hadoop have been committed just 1 hour before your 
> initial email.
> So they are probably not in nutch yet.
> Although "exactly one block missing in each file" looks suspicious.
> Try
> bin/hadoop dfs -report
> to see how many data nodes you have now.
> If all of them are reported then this is different.
>
> --Konstantin
>
> Dennis Kubes wrote:
>
>> Another interesting thing is that every single file is corrupt and 
>> missing exactly one block.
>>
>> Dennis Kubes wrote:
>>
>>> I don't know if this is the same problem or not but here is what I 
>>> am experiencing.
>>>
>>> I have an 11 node cluster deployed a fresh nutch install with 3.1.  
>>> Startup completed fine.  Filesystem healthy.  Performed 1st inject, 
>>> generate, fetch for 1000 urls.  Filesystem intact.  Performed 2nd 
>>> inject, generate, fetch for 1000 urls.  Filesystem healthy. Merged 
>>> crawldbs.  Filesystem healthy.  Merged segments.  Filesystem 
>>> healthy.  Inverted links. Healthy.  Indexed.  Healthy.  Performed 
>>> searches. Healthy.  Now here is where it gets interested.  Shutdown 
>>> all servers via stop-all.sh.  Started all server via start-all.sh.  
>>> Filesystem reports healthy.  Performed inject and generate of 1000 
>>> urls.  Filesystem reports healthy.  Performed fetch of the new 
>>> segments and get errors below and full corrupted filesystem (both 
>>> new segments and old data).
>>>
>>> java.io.IOException: Could not obtain block: blk_6625125900957460239 
>>> file=/user/phoenix/temp/segments1/20060607165425/crawl_generate/part-00006 
>>> offset=0
>>>     at 
>>> org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:529)

>>>
>>>     at 
>>> org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:638)
>>>     at 
>>> org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:84)

>>>
>>>     at 
>>> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:159)

>>>
>>>     at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>>     at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>>>     at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
>>>     at java.io.DataInputStream.readFully(DataInputStream.java:176)
>>>     at java.io.DataInputStream.readFully(DataInputStream.java:152)
>>>     at 
>>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:263)
>>>     at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:247)
>>>     at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:237)
>>>     at 
>>> org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:36)

>>>
>>>     at 
>>> org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:53)

>>>
>>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:105)
>>>     at 
>>> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:847)
>>>
>>> Hope this helps in tracking down the problem if it is even the same 
>>> problem.
>>>
>>> Dennis
>>>
>>> Konstantin Shvachko wrote:
>>>
>>>> Thanks Stefan.
>>>>
>>>> I spend some time investigating the problem.
>>>> There are 3 of them actually.
>>>> 1). At startup data nodes are now registering with the name node. 
>>>> If registering doesn't work,
>>>> because the name node is busy at the moment, which could easily be 
>>>> the case if it is loading
>>>> a two week long log, then the data node would just fail and won't 
>>>> start at all.
>>>> See HADOOP-282.
>>>> 2). When the cluster is running and the name node gets busy, and 
>>>> the data node as the result
>>>> fails to connect to it, then the data node falls into an infinite 
>>>> loop doing nothing but throwing an
>>>> exception. So for the name node it is dead, since it is not sending 
>>>> any heartbeats.
>>>> See HADOOP-285.
>>>> 3). People say that they have seen loss of recent data, while the 
>>>> old data is still present.
>>>> And this is happening when the cluster was brought down (for the 
>>>> upgrade), and restarted.
>>>> We know HADOOP-227 that logs/edits are accumulating as long as the 
>>>> cluster is running.
>>>> So if it was up for 2 weeks then the edits file is most probably 
>>>> huge. If it is corrupted then
>>>> the data is lost.
>>>> I could not reproduce that, just don't have any 2-week old edits 
>>>> files yet.
>>>> I thoroughly examined one cluster and found missing blocks on the 
>>>> nodes that pretended to be
>>>> up as in (2) above. Didn't see any data loss at all. I think large 
>>>> edits files should be further
>>>> investigated.
>>>>
>>>> There are patches fixing HADOOP-282 and HADOOP-285. We do not have 
>>>> patch
>>>> for HADOOP-227 yet, so people need to restart the name node (just 
>>>> the name node) depending
>>>> on the activity on the cluster, namely depending on the size of the 
>>>> edits file.
>>>>
>>>>
>>>> Stefan Groschupf wrote:
>>>>
>>>>> Hi Konstantin,
>>>>>
>>>>>> Could you give some more information about what happened to you.
>>>>>> - what is your cluster size
>>>>>
>>>>>
>>>>> 9 datanode, 1 namenode.
>>>>>
>>>>>> - amount of data
>>>>>
>>>>>
>>>>> Total raw bytes: 6023680622592 (5609.98 Gb)
>>>>> Used raw bytes: 2357053984804 (2195.17 Gb)
>>>>>
>>>>>> - how long did dfs ran without restarting the name node before  
>>>>>> upgrading
>>>>>
>>>>>
>>>>> I would say 2 weeks.
>>>>>
>>>>>>> I would love to figure out what was my problem today. :)
>>>>>>
>>>>>>
>>>>>> we discussed the three  kinds of data looses, hardware, software
 
>>>>>> or  human errors.
>>>>>>
>>>>>> Looks like you are not alone :-(
>>>>>
>>>>>
>>>>> Too bad that the other didn't report it earlier. :)
>>>>
>>>>
>>>> Everything was happening in the same time.
>>>>
>>>>>>> + updated from hadoop .2.1 to .4.
>>>>>>> + problems to get all datanodes started
>>>>>>
>>>>>>
>>>>>>
>>>>>> what was the problem with datanodes?
>>>>>
>>>>>
>>>>> Scenario:
>>>>> I don't think there was a real problem. I notice that the 
>>>>> datanodes  was not able to connect to the namenode.
>>>>> Later one I just add a "sleep 5" into the dfs starting script 
>>>>> after  starteing the name node and that sloved the problem.
>>>>
>>>>
>>>> That is right, we did the same.
>>>>
>>>>> However at this time I updated, notice that problem, was thinking  
>>>>> "ok, not working yet, lets wait another week", downgrading.
>>>>
>>>>
>>>>>>> + downgrade to hadoop .3.1
>>>>>>> + error message of incompatible dfs (I guess . already had 
>>>>>>> started  to  write to the log)
>>>>>>
>>>>>>
>>>>>>
>>>>>> What is the message?
>>>>>
>>>>>
>>>>>
>>>>> Sorry I can not find the exception anymore in the logs. :-(
>>>>> Something like "version conflict -1 vs -2" :-o Sorry didn't 
>>>>> remember  exactly.
>>>>
>>>>
>>>> Yes. You are running the old version (-1) code that would not 
>>>> accept the "future" version (-2) images.
>>>> The image was converted to v. -2 when you tried to run the upgraded 
>>>> hadoop.
>>>>
>>>> Regards,
>>>> Konstantin
>>>
>>>
>>
>>
>>
>

Mime
View raw message