hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abhijit Bagri <aba...@yahoo-inc.com>
Subject Re: The Case of a Long Running Hadoop System
Date Mon, 17 Nov 2008 06:49:15 GMT
We do not have a secondary namenode because 0.15.3 has serious bug  
which truncates the namenode image if there is a failure while  
namenode fetches image from secondary namenode. See HADOOP-3069

I have a patched version of 0.15.3 for this issue. From the patch of  
HADOOP-3069, the changes are on namenode _and_ secondary namenode,  
which means I just cant fire up a seconday namenode.

- Bagri

On Nov 15, 2008, at 11:36 PM, Billy Pearson wrote:

> If I understand the secondary namenode merges the edits log in to  
> the fsimage and reduces the edit log size.
> Which is likely the root of your problems 8.5G seams large and  
> likely putting a strain on your master servers memory and io bandwidth
> Why do you not have a secondary namenode?
> If you do not have the memory on the master I would look in to  
> stopping a datanode/tasktracker on a server and loading the  
> secondary namenode on it
> Let it run for a while and watch your log for the secondary namenode  
> you should see your edit log get smaller
> I am not an expert but that would be my first action.
> Billy
> "Abhijit Bagri" <abagri@yahoo-inc.com> wrote in message news:4C1984EA-74B3-40D1-B9BE-AB2442537217@yahoo-inc.com

> ...
>> Hi,
>> This is a long mail as I have tried to put in as much details as  
>> might
>> help any of the Hadoop dev/users to help us out. The gist is this:
>> We have a long running Hadoop system (masters not restarted for about
>> 3 months). We have recently started seeing the DFS responding very
>> slowly which has resulted in failures on a system which depends on
>> Hadoop. Further, the DFS seems to be an unstable state (i.e if fsck  
>> is
>> a good representation which I believe it is). The edits file
>> These are the details (skip/return here later and jump to the
>> questions at the end of the mail for a quicker read) :
>> Hadoop Version: 0.15.3 on 32 bit systems.
>> Number of slaves: 12
>> Slaves heap size: 1G
>> Namenode heap: 2G
>> Jobtracker heap: 2G
>> The namenode and jobtrackers have not been restarted for about 3
>> months. We did restart slaves(all of them within a few hours) a few
>> times for some maintaineance in between though. We do not have a
>> secondary namenode in place.
>> There is another system X which talks to this hadoop cluster. X  
>> writes
>> to the Hadoop DFS and submits jobs to the Jobtracker. The number of
>> jobs submitted to Hadoop so far is over 650,000 ( I am using the job
>> id for jobs for this), each job may rad/write to multiple files and
>> has several dependent libraries which it loads from Distributed  
>> Cache.
>> Recently, we started seeing that there were several timeouts  
>> happening
>> while X tries to read/write to the DFS. This in turn results in DFS
>> becoming very slow in response. The writes are especially slow. The
>> trace we get in the logs are:
>> java.net.SocketTimeoutException: Read timed out
>>        at java.net.SocketInputStream.socketRead0(Native Method)
>>        at java.net.SocketInputStream.read(SocketInputStream.java:129)
>>        at java.net.SocketInputStream.read(SocketInputStream.java:182)
>>        at java.io.DataInputStream.readShort(DataInputStream.java:284)
>>        at org.apache.hadoop.dfs.DFSClient
>> $DFSOutputStream.endBlock(DFSClient.java:1660)
>>        at org.apache.hadoop.dfs.DFSClient
>> $DFSOutputStream.close(DFSClient.java:1733)
>>        at org.apache.hadoop.fs.FSDataOutputStream
>> $PositionCache.close(FSDataOutputStream.java:49)
>>        at
>> org 
>> .apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:
>> 64)
>>        ...
>> Also, datanode logs show a lot of traces like these:
>> 2008-11-14 21:21:49,429 ERROR org.apache.hadoop.dfs.DataNode:
>> DataXceiver: java.io.IOException: Block blk_-1310124865741110666 is
>> valid, and cannot be written to.
>>        at
>> org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:551)
>>        at org.apache.hadoop.dfs.DataNode
>> $BlockReceiver.<init>(DataNode.java:1257)
>>        at org.apache.hadoop.dfs.DataNode
>> $DataXceiver.writeBlock(DataNode.java:901)
>>        at org.apache.hadoop.dfs.DataNode
>> $DataXceiver.run(DataNode.java:804)
>>        at java.lang.Thread.run(Thread.java:595)
>> and these
>> 2008-11-14 21:21:50,695 WARN org.apache.hadoop.dfs.DataNode:
>> java.io.IOException: Error in deleting blocks.
>>        at org.apache.hadoop.dfs.FSDataset.invalidate(FSDataset.java:
>> 719)
>>        at org.apache.hadoop.
>> The first one seems to be same as HADOOP-1845. We have been seeing
>> this for a long time, but as HADOOP-1845 says, it hasn't been  
>> creating
>> any problems. We have thus ignored this for a while, but this recent
>> problems have raised eyebrows again if this really is a non-issue.
>> We also saw a spike in number of TCP connections on the system X.
>> Also, a lsof on the system X shows a lot of connections to datanodes
>> in the CLOSE_WAIT state.
>> While investigating the issue, the fsck output started saying that  
>> DFS
>> is corrupt. The output is:
>> Total size:    152873519485 B
>> Total blocks:  1963655 (avg. block size 77851 B)
>> Total dirs:    1390820
>> Total files:   1962680
>> Over-replicated blocks:        111 (0.0061619785 %)
>> Under-replicated blocks:       13 (6.6203077E-4 %)
>> Target replication factor:     3
>> Real replication factor:       3.0000503
>> The filesystem under path '/' is CORRUPT
>> After running a fsck -move, the system went back to healthy state.
>> However, after some time it again started showing corrupt and fsk -
>> move restored it to a healthy state.
>> We are considering restarting the masters as a possible solution. I
>> have earlier had issue with restarting Master with 2G heap and edits
>> file size of about 2.5G. So, we decided to lookup the size of the
>> edits file. The edits file is now 8.5G! Since, we are on a 32 bit
>> system, we cannot push JVM heap beyond about 2.5G. So, I am not sure
>> that if we bring down the master, we will be able to restart it at
>> all. Please note that we do not have a secondary namenode.
>> Questions for hadoop-users and hadoop-dev?
>> About Hadoop restart:
>> 1. Does namenode try to load edits file into memory on starrtup?
>> 2. Will we be able to restart the namenode without a format at all?
>> 3. Apart from taking backup, is there a safe way to restart the  
>> system
>> and retain the DFS data?
>> About DFS being slow:
>> 4. Is HADOOP-1845 a non-issue? If it is, what may cause the DFS to
>> become slow and lead to SocketTimeoutException on client.
>> 5. What may cause a spike in TCP connections and several connection
>> fds open in CLOSE_WAIT state? Is HADOOP-3071 relevant?
>> 6. Is restart of master a solution at all?
>> About the health of the DFS:
>> 7. What are the implications of DFS going into CORRUPT state every
>> once in a while? Should this increase my caffeine intake? More
>> importantly, is the Hadoop system on verge of a collapse?
>> 8. Would restarting all slaves within a few hours without restarting
>> the masters have an implication on DFS health? Note that, this was
>> done to prevent any downtime for system X.
>> Our target is to return to a state where the DFS functions normally
>> (like it was doing till some time back) and of course to maintain a
>> healthy DFS with its current data. Please provide any insights
>> To add to the above,
>> - Due to the kind of dependency of X on this Hadoop system, api, DFS
>> data, and internal data structure of Hadoop,  it is non trivial for  
>> us
>> to jump to higher versions of hadoop in very near future. We could
>> apply some patches which could go with 0.15.3
>> - Loss of DFS data will be fatal. I believe once I format, I will get
>> a namespace id mismatch unless I clear the hadoop data directories
>> from the slaves. is this correct?
>> - We did not start secondary namenode due to HADOOP-3069. I have a
>> locally patched 0.15.3 with this JIRA, Will it be help and is it safe
>> if we add the secondary namenode now?
>> We have tried to look into jira for more leads, but are not sure  
>> which
>> ones fit our case specifically. Pointers to JIRA issues are also
>> welcome. Of course, please feel free to address any part of the
>> problem :)
>> Thanks
>> Bagri

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message