hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dennis Kubes <nutch-...@dragonflymc.com>
Subject Re: dfs incompatibility .3 and .4-dev?
Date Wed, 07 Jun 2006 22:09:26 GMT
I don't know if this is the same problem or not but here is what I am 
experiencing.

I have an 11 node cluster deployed a fresh nutch install with 3.1.  
Startup completed fine.  Filesystem healthy.  Performed 1st inject, 
generate, fetch for 1000 urls.  Filesystem intact.  Performed 2nd 
inject, generate, fetch for 1000 urls.  Filesystem healthy. Merged 
crawldbs.  Filesystem healthy.  Merged segments.  Filesystem healthy.  
Inverted links. Healthy.  Indexed.  Healthy.  Performed searches. 
Healthy.  Now here is where it gets interested.  Shutdown all servers 
via stop-all.sh.  Started all server via start-all.sh.  Filesystem 
reports healthy.  Performed inject and generate of 1000 urls.  
Filesystem reports healthy.  Performed fetch of the new segments and get 
errors below and full corrupted filesystem (both new segments and old data).

java.io.IOException: Could not obtain block: blk_6625125900957460239 file=/user/phoenix/temp/segments1/20060607165425/crawl_generate/part-00006
offset=0
	at org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:529)
	at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:638)
	at org.apache.hadoop.fs.FSDataInputStream$Checker.read(FSDataInputStream.java:84)
	at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:159)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
	at java.io.DataInputStream.readFully(DataInputStream.java:176)
	at java.io.DataInputStream.readFully(DataInputStream.java:152)
	at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:263)
	at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:247)
	at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:237)
	at org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:36)
	at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:53)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:105)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:847)

Hope this helps in tracking down the problem if it is even the same problem.

Dennis

Konstantin Shvachko wrote:
> Thanks Stefan.
>
> I spend some time investigating the problem.
> There are 3 of them actually.
> 1). At startup data nodes are now registering with the name node. If 
> registering doesn't work,
> because the name node is busy at the moment, which could easily be the 
> case if it is loading
> a two week long log, then the data node would just fail and won't 
> start at all.
> See HADOOP-282.
> 2). When the cluster is running and the name node gets busy, and the 
> data node as the result
> fails to connect to it, then the data node falls into an infinite loop 
> doing nothing but throwing an
> exception. So for the name node it is dead, since it is not sending 
> any heartbeats.
> See HADOOP-285.
> 3). People say that they have seen loss of recent data, while the old 
> data is still present.
> And this is happening when the cluster was brought down (for the 
> upgrade), and restarted.
> We know HADOOP-227 that logs/edits are accumulating as long as the 
> cluster is running.
> So if it was up for 2 weeks then the edits file is most probably huge. 
> If it is corrupted then
> the data is lost.
> I could not reproduce that, just don't have any 2-week old edits files 
> yet.
> I thoroughly examined one cluster and found missing blocks on the 
> nodes that pretended to be
> up as in (2) above. Didn't see any data loss at all. I think large 
> edits files should be further
> investigated.
>
> There are patches fixing HADOOP-282 and HADOOP-285. We do not have patch
> for HADOOP-227 yet, so people need to restart the name node (just the 
> name node) depending
> on the activity on the cluster, namely depending on the size of the 
> edits file.
>
>
> Stefan Groschupf wrote:
>
>> Hi Konstantin,
>>
>>> Could you give some more information about what happened to you.
>>> - what is your cluster size
>>
>> 9 datanode, 1 namenode.
>>
>>> - amount of data
>>
>> Total raw bytes: 6023680622592 (5609.98 Gb)
>> Used raw bytes: 2357053984804 (2195.17 Gb)
>>
>>> - how long did dfs ran without restarting the name node before  
>>> upgrading
>>
>> I would say 2 weeks.
>>
>>>> I would love to figure out what was my problem today. :)
>>>
>>> we discussed the three  kinds of data looses, hardware, software  
>>> or  human errors.
>>>
>>> Looks like you are not alone :-(
>>
>> Too bad that the other didn't report it earlier. :)
>
> Everything was happening in the same time.
>
>>>> + updated from hadoop .2.1 to .4.
>>>> + problems to get all datanodes started
>>>
>>>
>>> what was the problem with datanodes?
>>
>> Scenario:
>> I don't think there was a real problem. I notice that the datanodes  
>> was not able to connect to the namenode.
>> Later one I just add a "sleep 5" into the dfs starting script after  
>> starteing the name node and that sloved the problem.
>
> That is right, we did the same.
>
>> However at this time I updated, notice that problem, was thinking  
>> "ok, not working yet, lets wait another week", downgrading.
>
>>>> + downgrade to hadoop .3.1
>>>> + error message of incompatible dfs (I guess . already had started  
>>>> to  write to the log)
>>>
>>>
>>> What is the message?
>>
>>
>> Sorry I can not find the exception anymore in the logs. :-(
>> Something like "version conflict -1 vs -2" :-o Sorry didn't remember  
>> exactly.
>
> Yes. You are running the old version (-1) code that would not accept 
> the "future" version (-2) images.
> The image was converted to v. -2 when you tried to run the upgraded 
> hadoop.
>
> Regards,
> Konstantin

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message