Return-Path: Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: (qmail 88135 invoked from network); 1 Apr 2011 01:06:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 1 Apr 2011 01:06:48 -0000 Received: (qmail 66559 invoked by uid 500); 1 Apr 2011 01:06:48 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 66459 invoked by uid 500); 1 Apr 2011 01:06:48 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 66451 invoked by uid 99); 1 Apr 2011 01:06:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Apr 2011 01:06:48 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Apr 2011 01:06:42 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 06AA98CDE4 for ; Fri, 1 Apr 2011 01:06:06 +0000 (UTC) Date: Fri, 1 Apr 2011 01:06:06 +0000 (UTC) From: "Matt Foley (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: <1370059641.26426.1301619966024.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <414068084.6768.1300905845846.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (HDFS-1780) reduce need to rewrite fsimage on statrtup MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014335#comment-13014335 ] Matt Foley commented on HDFS-1780: ---------------------------------- I reviewed the logic currently in place that decides whether it is necessary to write out a new FSImage. Basically, within FSImage.recoverTransitionRead(), one of three methods is called: * doUpgrade() - which always writes out new FSImage, before renaming "tmp" to "previous" * doImportCheckpoint() - which always writes out new FSImage, while using the imported checkpointTime * loadFSImage() - which will request saveNamespace under any of these conditions: ** if missing version file, indicates directory was just formatted ** if checkpointTime <= 0, indicates invalid or missing checkpoint ** if there was more than one checkpointTime recorded ** if previously interrupted checkpoint is detected ** if the read-in ImageVersion != the current LAYOUT_VERSION for this code base ** if latestNameCheckpointTime > latestEditsCheckpointTime, indicates we should discard the edits by saving new image ** if loadFSEdits() > 0, indicates either "edits" or "edits.new" existed and had ANY edit records, or had logVersion != the current LAYOUT_VERSION for this code base. It seems to me that only the last item is a problem. Just because there were SOME edit records, doesn't mean it is worth delaying startup while it writes a new checkpoint. However, it appears the current code will tolerate only a single roll-over of edits logs (from "edits" to "edits.new"), and cannot combine two edit logs into one. So we can't just accumulate edits files over multiple startups. > reduce need to rewrite fsimage on statrtup > ------------------------------------------ > > Key: HDFS-1780 > URL: https://issues.apache.org/jira/browse/HDFS-1780 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: Daryn Sharp > > On startup, the namenode will read the fs image, apply edits, then rewrite the fs image. This requires a non-trivial amount of time for very large directory structures. Perhaps the namenode should employ some logic to decide that the edits are simple enough that it doesn't warrant rewriting the image back out to disk. > A few ideas: > Use the size of the edit logs, if the size is below a threshold, assume it's cheaper to reprocess the edit log instead of writing the image back out. > Time the processing of the edits and if the time is below a defined threshold, the image isn't rewritten. > Timing the reading of the image, and the processing of the edits. Base the decision on the time it would take to write the image (a multiplier is applied to the read time?) versus the time it would take to reprocess the edits. If a certain threshold (perhaps percentage or expected time to rewrite) is exceeded, rewrite the image. > Somethingalong the lines of the last suggestion may allow for defaults that adapt for any size cluster, thus eliminating the need to keep tweaking a cluster's settings based on its size. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira