Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Sat, 15 Mar 2014 08:20:43 +0000 (UTC)
From: "Colin Patrick McCabe (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12700868.1394585913116.75517.1394871643424@arcas>
In-Reply-To: <JIRA.12700868.1394585913116@arcas>
References: <JIRA.12700868.1394585913116@arcas>
Subject: [jira] [Commented] (HDFS-6093) Expose more caching information for
 debugging by users
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13936087#comment-13936087 ] 

Colin Patrick McCabe commented on HDFS-6093:
--------------------------------------------

bq. Arpit said: In addition to reducing the timeout as you suggested, can we add some explanation to the command output, or update CentralizedCacheManagement.html in the docs?

I agree, we could add a short comment to the docs about this.  Now that the timeout has been reduced, there should be much less discrepancy between the output of the two commands, of course.

Taking a more detailed look at the patch now.

{code}
+  public FsStatus getCacheStatus() throws IOException {
{code}

I know it seems clever to reuse the same class for getStatus and getCacheStatus, but it could become a problem if someone later adds more fields to getStatus that don't apply to getCacheStatus.  I think we need our own type for this, to maintain sanity in the future.  It's not that much code.

{code}
   public long getMissingBlocksCount() throws IOException {
+    statistics.incrementReadOps(1);
     return dfs.getMissingBlocksCount();
{code}

Can we put the non-caching-related incrementReadOps changes in their own JIRA?  It may seem like a trivial change, but it's kind of distracting from this JIRA.  Also I'm not sure I understand when we're "supposed" to increment this...

{code}
  /**
   * Number of replicas pending caching.
   */
  private long numPendingCaching;
  /**
   * Number of replicas pending uncaching.
   */
  private long numPendingUncaching;
{code}

Could use a linebreak after {{numPendingCaching}} for consistency.

Like I said earlier, I'd prefer to decouple the counter(s) that can be read from the CRM from the counters that the CRM uses internally during the scan.  Using the same variable for both just invites bugs like the one Arpit pointed out, where rescan zeroes the counter outside the lock.

{code}
[CacheManager#processCacheReportImpl changes]
{code}

Incrementally updating the pendingUncached list and stats is a nice idea, but it seems too ambitious for 2.4 at this point.  Now that the CRM interval is 30 seconds, it shouldn't be too bad to just wait for the CRM to update its stats and the lists.  Additionally, we don't even know that monitor is non-null at this point, so there is an NPE here, I think.  Let's leave this out and revisit it later.

> Expose more caching information for debugging by users
> ------------------------------------------------------
>
>                 Key: HDFS-6093
>                 URL: https://issues.apache.org/jira/browse/HDFS-6093
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: caching
>    Affects Versions: 2.4.0
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: hdfs-6093-1.patch
>
>
> When users submit a new cache directive, it's unclear if the NN has recognized it and is actively trying to cache it, or if it's hung for some other reason. It'd be nice to expose a "pending caching/uncaching" count the same way we expose pending replication work.
> It'd also be nice to display the aggregate cache capacity and usage in dfsadmin -report, since we already have have it as a metric and expose it per-DN in report output.


--
This message was sent by Atlassian JIRA
(v6.2#6252)