hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daryn Sharp (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-4477) Secondary namenode may retain old tokens
Date Fri, 08 Feb 2013 00:03:13 GMT

     [ https://issues.apache.org/jira/browse/HDFS-4477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daryn Sharp updated HDFS-4477:
------------------------------

             Priority: Critical  (was: Major)
     Target Version/s: 3.0.0, 0.23.7, 2.0.4-beta
    Affects Version/s:     (was: 0.23.7)
                       3.0.0
                       0.23.0
                       2.0.0-alpha

I haven't had time to check 1.x, but I'm almost certain the bug is there.  This is one of
those "I can't believe this when unnoticed for so long".

Expired, ie. non-cancelled, tokens are discarded in a background thread.  The background thread
is only started after leaving safemode.  No edits are produced unlike explicitly cancelled
tokens.  So the 2NN loads up the image, applies edits, dumps out the image -- the 2NN never
knows to discard the expired tokens, just the cancelled ones.  Now the NN loads up the mountain
of tokens in the image, but then discards the expired ones after leaving safemode.  The fsimage
just keeps bloating.  Forever.

+Long Term Impact+
The severity of the problem is a 2NN was consuming ~15% more memory than the active NN and
trashing in GC.  The 2NN holds less state, so that was surprising.  The image was discovered
to contain ~42 MILLION tokens dating back to mid-2011.  Not 2012, yes, 2011.  Loading all
the tokens added a significant ~8min to the startup time.

+Aggravating Factors+
The main contributor to uncanceled tokens are a JT/RM conf hack to prevent cancellation of
tokens after a job completes.  Otherwise an oozie job would have its tokens killed after the
first sub-job completes.  It's not oozie/pig's fault that tokens aren't cancelled, it just
aggravates the bug in the namenode.

I'll dust off an old patch that reference counts tokens against jobs, so tokens will be cancelled
but only when no other jobs are running with those tokens.
                
> Secondary namenode may retain old tokens
> ----------------------------------------
>
>                 Key: HDFS-4477
>                 URL: https://issues.apache.org/jira/browse/HDFS-4477
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: security
>    Affects Versions: 0.23.0, 2.0.0-alpha, 3.0.0
>            Reporter: Kihwal Lee
>            Assignee: Daryn Sharp
>            Priority: Critical
>         Attachments: HDFS-4477.patch
>
>
> Upon inspection of a fsimage created by a secondary namenode, we've discovered it contains
very old tokens. These are probably the ones that were not explicitly canceled.  It may be
related to the optimization done to avoid loading fsimage from scratch every time checkpointing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message