hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsz Wo (Nicholas), SZE (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-6640) FileSystem.get() does RPC retries within a static synchronized block
Date Fri, 19 Mar 2010 18:02:27 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847478#action_12847478

Tsz Wo (Nicholas), SZE commented on HADOOP-6640:

When FileSystem cache is enabled, FileSystem.get(..) will call FileSystem.Cache.get(..), which
is a synchronized method. If the lookup fails, a new instance will be initialized. Depends
on the FileSystem subclass implementation, the initialization may take a long time. In such
case, the FileSystem.Cache lock will be hold and all calls to FileSystem.get(..) by other
threads will be blocked for a long time.

In particular, the DistributedFileSystem initialization may take a long time since there are
retries. It is even worst if the socket timeout is set to a large value.

There are two possible fixes for the problem:

# (by Sanjay) Change FileSystem.Cache.get(..) so that if the lookup fails, it first releases
the lock, initializes a FileSystem instance, acquires the lock again, and then add the instance
to the cache.  One problem is that if a user application keeps calling FileSystem.get(..)
for the same FileSystem in a short period of time, it will result in initializing many instances.

# Change DistributedFileSystem so that it does a lazy connection: it defers connecting to
the server until there is an rpc.  A drawback is that this only fixes DistributedFileSystem
but not other FileSystem subclasses.

> FileSystem.get() does RPC retries within a static synchronized block
> --------------------------------------------------------------------
>                 Key: HADOOP-6640
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6640
>             Project: Hadoop Common
>          Issue Type: Bug
>         Environment: all
>            Reporter: Alejandro Abdelnur
>            Priority: Critical
> If using FileSystem.get() in a multithreaded environment, and one get() locks because
the NN URI is too slow or not responding and retries are in progress, all other get() (for
the diffferent users, NN) are blocked.
> the synchronized block in in the static instance of Cache inner class.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message