hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Manoj Govindassamy (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-10480) Add an admin command to list currently open files
Date Fri, 26 May 2017 08:17:05 GMT

     [ https://issues.apache.org/jira/browse/HDFS-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Manoj Govindassamy updated HDFS-10480:
--------------------------------------
    Attachment: HDFS-10480.05.patch

bq. One high-level question first, what do we envision as the usecases for this command? I
figured it was for: Debugging lease manager state
Thats right. The prime use of this jira fix is to provide an admin command to debug LeaseManager
state and provide a diagnostics platform to debug issues around open files. There were several
cases in the past where stale files stay open for a very long time and without data being
written to it actively. Fsck way of finding the open files is very time consuming and degrades
cluster performance. The proposed admin command is very light weight and lists all open files
along with client details. Admin can then make a decision on running recover lease if needed.

bq. Finding open files that are blocking decommission
Yes. The plan is to extend the above admin command to help diagnose decommissioning and maintenance
state issues arising from open files. HDFS-11847 will take care of this.

bq. We probably shouldn't skip erroneous leases:
True. These file with valid lease but not in under construction state might be useful for
diagnosing. But the client name/machine details are part of UnderConstruction feature in INode.
So for the non-UC files with leases, shall we instead show some warning or error messages
in place of client name and machine ?

bq. For the second, the admin is wondering why some DN hasn't finished decomming yet, and
wants to find the UC blocks and the client and path. It looks like HDFS-11847 will make this
easy, without needing to resort to fsck. Nice. But what's the workflow where we need HDFS-11848?
This new command is much lighter weight than fsck -openforwrite, so I'd like to encourage
users to use the new command instead. Just wondering, before we add some new functionality.
This is an enhancement to the first usecase to make the dfsadmin -listOpenFiles command much
more light weight and easy to use. When the open files count is huge, listing them all using
dfsadmin command, though light weight might take several iterations to report the entire list.
If the admin is interested only in specific paths, listing open files under a path might be
much more faster and easy to read response list. Anyways, open for discussion on the need
for this enhancement.

bq. Maybe bump the NUM_RESPONSES limit to 1000, to match DFS_LIST_LIMIT?
Done.

bq. Should the precondition check for NUM_RESPONSES check for > 0 rather than >= 0 ?
FWIW, 0 is also not a positive integer.
That's right. 0 response entries doesn't make sense. Changed it to > 0.

bq. Based on HDFS-9395, we should only generate an audit event when the op is successful or
fails due to an ACE. Notably, it should not log for things like an IOE.
Done. Followed the usual pattern.

bq. LeaseManager#getUnderConstructionFiles makes a new TreeMap out of leasesById. This is
potentially a lot of garbage. Can we make leasesById a TreeMap instead to avoid this? TreeMaps
still have pretty good performance.
Done. I was worried about the performance of the LeaseManager with HashMap switched to TreeMap.
HashMap has better put/get performance compared to TreeMap. But, if that's not significant
enough for predominant usecase of say max open files in the order of 1000s, then we should
be ok.


bq. Can we also add an assert that the FSN read lock is held?
Done.

bq.Testing:
bq. I like the step-up/step-down with the open and closed file sets. Could we take the verification
one step further, and do it in a for-loop? This way we test all the way from 0..numOpenFiles
rather than just at numOpenFiles and numOpenFiles/2
Done. Also, moved the utils to DFSTestUtil so as to reduce code duplication.

bq. testListOpenFilesInHA, it'd be nice to see what happens when there's a failover between
batches while iterating. I also suggest perhaps moving this into TestListOpenFiles since it
doesn't really relate to append.
Moved the test to TestListOpenFiles. Will need some kind of delay simulator during listing
so as to effectively test the listing and failover in parallel. Will take this up as part
of HDFS-11847, if you are ok.

bq. Do we have any tests for the HdfsAdmin API? It'd be better to test against this than the
one in DistributedFileSystem, since our end users will be programming against HdfsAdmin.
Done. Added a test in TestHdfsAdmin.


> Add an admin command to list currently open files
> -------------------------------------------------
>
>                 Key: HDFS-10480
>                 URL: https://issues.apache.org/jira/browse/HDFS-10480
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Kihwal Lee
>            Assignee: Manoj Govindassamy
>         Attachments: HDFS-10480.02.patch, HDFS-10480.03.patch, HDFS-10480.04.patch, HDFS-10480.05.patch,
HDFS-10480-trunk-1.patch, HDFS-10480-trunk.patch
>
>
> Currently there is no easy way to obtain the list of active leases or files being written.
It will be nice if we have an admin command to list open files and their lease holders.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message