hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6093) Expose more caching information for debugging by users
Date Sat, 15 Mar 2014 08:20:43 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13936087#comment-13936087
] 

Colin Patrick McCabe commented on HDFS-6093:
--------------------------------------------

bq. Arpit said: In addition to reducing the timeout as you suggested, can we add some explanation
to the command output, or update CentralizedCacheManagement.html in the docs?

I agree, we could add a short comment to the docs about this.  Now that the timeout has been
reduced, there should be much less discrepancy between the output of the two commands, of
course.

Taking a more detailed look at the patch now.

{code}
+  public FsStatus getCacheStatus() throws IOException {
{code}

I know it seems clever to reuse the same class for getStatus and getCacheStatus, but it could
become a problem if someone later adds more fields to getStatus that don't apply to getCacheStatus.
 I think we need our own type for this, to maintain sanity in the future.  It's not that much
code.

{code}
   public long getMissingBlocksCount() throws IOException {
+    statistics.incrementReadOps(1);
     return dfs.getMissingBlocksCount();
{code}

Can we put the non-caching-related incrementReadOps changes in their own JIRA?  It may seem
like a trivial change, but it's kind of distracting from this JIRA.  Also I'm not sure I understand
when we're "supposed" to increment this...

{code}
  /**
   * Number of replicas pending caching.
   */
  private long numPendingCaching;
  /**
   * Number of replicas pending uncaching.
   */
  private long numPendingUncaching;
{code}

Could use a linebreak after {{numPendingCaching}} for consistency.

Like I said earlier, I'd prefer to decouple the counter(s) that can be read from the CRM from
the counters that the CRM uses internally during the scan.  Using the same variable for both
just invites bugs like the one Arpit pointed out, where rescan zeroes the counter outside
the lock.

{code}
[CacheManager#processCacheReportImpl changes]
{code}

Incrementally updating the pendingUncached list and stats is a nice idea, but it seems too
ambitious for 2.4 at this point.  Now that the CRM interval is 30 seconds, it shouldn't be
too bad to just wait for the CRM to update its stats and the lists.  Additionally, we don't
even know that monitor is non-null at this point, so there is an NPE here, I think.  Let's
leave this out and revisit it later.

> Expose more caching information for debugging by users
> ------------------------------------------------------
>
>                 Key: HDFS-6093
>                 URL: https://issues.apache.org/jira/browse/HDFS-6093
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: caching
>    Affects Versions: 2.4.0
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: hdfs-6093-1.patch
>
>
> When users submit a new cache directive, it's unclear if the NN has recognized it and
is actively trying to cache it, or if it's hung for some other reason. It'd be nice to expose
a "pending caching/uncaching" count the same way we expose pending replication work.
> It'd also be nice to display the aggregate cache capacity and usage in dfsadmin -report,
since we already have have it as a metric and expose it per-DN in report output.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message