Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Mon, 4 May 2015 18:14:09 +0000 (UTC)
From: "Ming Ma (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12767389.1421224699000.72383.1430763249391@Atlassian.JIRA>
In-Reply-To: <JIRA.12767389.1421224699000@Atlassian.JIRA>
References: <JIRA.12767389.1421224699000@Atlassian.JIRA>
 <JIRA.12767389.1421224699832@arcas>
Subject: [jira] [Updated] (HDFS-7609) startup used too much time to load
 edits
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HDFS-7609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ming Ma updated HDFS-7609:
--------------------------
    Attachment: HDFS-7609.patch

Here is the draft patch that prevents client from polluting the retry cache when standby is being transitioned to active. It doesn't cover other possible optimization ideas discussed above. Appreciate any input on this.

> startup used too much time to load edits
> ----------------------------------------
>
>                 Key: HDFS-7609
>                 URL: https://issues.apache.org/jira/browse/HDFS-7609
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.2.0
>            Reporter: Carrey Zhan
>         Attachments: HDFS-7609-CreateEditsLogWithRPCIDs.patch, HDFS-7609.patch, recovery_do_not_use_retrycache.patch
>
>
> One day my namenode crashed because of two journal node timed out at the same time under very high load, leaving behind about 100 million transactions in edits log.(I still have no idea why they were not rolled into fsimage.)
> I tryed to restart namenode, but it showed that almost 20 hours would be needed before finish, and it was loading fsedits most of the time. I also tryed to restart namenode in recover mode, the loading speed had no different.
> I looked into the stack trace, judged that it is caused by the retry cache. So I set dfs.namenode.enable.retrycache to false, the restart process finished in half an hour.
> I think the retry cached is useless during startup, at least during recover process.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)