hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5651) remove dfs.namenode.caching.enabled
Date Fri, 20 Dec 2013 07:43:07 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13853758#comment-13853758

Colin Patrick McCabe commented on HDFS-5651:

So, there were some synchronization issues here that needed to be cleaned up.  The biggest
one was stopping and starting the CRM thread.  Previously, this was prone to deadlock, since
if some other thread (or the thread stopping the CRM) was holding the FSN write lock, and
the CRM thread itself needed to get that lock, we'd block forever.

I tried to get around that by sending an {{InterruptedException}} to the CRM thread, but it
turns out that {{Condition#await}} does not actually "have" to wake up in response to one
of those exceptions being sent (although it "may").  It explicitly documents this in the JDK
6 javadoc, and it seems that the Linux HotSpot implementation may be one of those implementations
where condition variables cannot be interrupted.

The solution here is to *not* join the CRM thread when transitioning to the standby state,
but simply to set {{shutdown = true}} in the CRM thread, and have the CRM thread check that
variable after grabbing the {{FSNamesystem}} lock.  So we may have an old CRM thread hanging
around for a while, but it will never mutate {{CacheManager}} state, since {{CRM#shutdown
= true}}.

Along the way, I discovered that our strategy of doing {{writeUnlock}} in some places in CRM
was not working very well.  The problem is that since the FSN write lock is a reentrant lock,
a thread that calls {{ReentrantLock#unlock}} may still hold that lock.  You may need to unlock
multiple times to really release!  In general, having random "unlock some of the caller's
locks" sections sprinkled throughout the code seems like a recipe for problems, since the
caller may not be expecting it.  I think it's better to ask the top-level caller in {{FSNamesystem}}
to handle these locks.  So I moved the {{waitForRescanIfNeeded}} calls in {{FSNamesystem}}
to a point before the FSN lock was even taken in those functions.

Minor: we don't need to do anything with CacheManager in {{FSNamesystem#stopCommonServices}},
since we do it in {{FSNamesystem#stopActiveServices}}.  I also fixed a few cases where we
had more lock blocks than needed.

> remove dfs.namenode.caching.enabled
> -----------------------------------
>                 Key: HDFS-5651
>                 URL: https://issues.apache.org/jira/browse/HDFS-5651
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: namenode
>    Affects Versions: 3.0.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-5651.001.patch, HDFS-5651.002.patch, HDFS-5651.003.patch, HDFS-5651.004.patch,
> We can remove dfs.namenode.caching.enabled and simply always enable caching, similar
to how we do with snapshots and other features.  The main overhead is the size of the cachedBlocks
GSet.  However, we can simply make the size of this GSet configurable, and people who don't
want caching can set it to a very small value.

This message was sent by Atlassian JIRA

View raw message