hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrew Wang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5096) Automatically cache new data added to a cached path
Date Sat, 12 Oct 2013 02:01:43 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13793207#comment-13793207

Andrew Wang commented on HDFS-5096:

This is only a partial review, but I wanted to dump my review comments thus far before the
weekend. Mostly small comments, still wrapping my head around all the changes. Overall it
feels pretty good though.

* Looks like the {{TODO: support loading and storing}} snuck back in during a rebase
* I'd lower the {{cachedBlocks}} GSet size to something like 1 or 2%. This makes sense considering
the blocksMap is set to 50% of heap, and there should be at least an order of magnitude fewer
cached blocks.
* Is adding the iterator to LightWeightCache necessary? Seems like that should go on trunk
instead, and there are no callers.
* Still some uncapitalized log/exception messages. I used this to check:
git diff | egrep "\+(.*)(Exception|LOG)(.*)\(\"[a-z]"

* Seems like a generally useful class, could we put this in common?
* It'd be nice to reuse Java's {{List}} interface as much as possible for consistency.
** Instead of {{addToFront}} and {{addToBack}}, it looks like we're only using {{addToBack}}
(thus {{add}}). {{List}} suggests {{add(Element, int)}} for a positional add, which is optional
** {{removeElement}} returns the next element which isn't part of the normal {{remove}} contract.
Could also be named simply {{remove}}
** {{getLength}} to {{size}}
** etc etc
* Actually, since there's no {{get}}, {{set}}, or {{indexOf}} methods means this isn't really
a list right? It feels more like a double-ended queue ({{Deque}}). If this interface matches
better with your intent, maybe copy that instead.
* Need javadoc on all public methods

* Should probably be named {{TestCachedBlocksLists}} and moved to the same package as {{CachedBlocksList}}.
Or you could test a more generic implementation.
* Rather than randomly deciding to remove via iterator or pseudo-randomly, let's just test
each case definitively.
* The test lists aren't big enough to make the random testing effective. I'd expect to see
O(1000) elements, not 4.
* How about some more elaborate random tests that do things like addition, removal, and traversal
O(1000) times? Can compare against a {{LinkedList}} or something.

* Although it has a {{triplets}} array, it's formatted differently from the triplets array
in {{BlockInfo}}. CachedBlock goes {{prev, item, next}}, while {{BlockInfo}} goes {{item,
prev, next}}. This is pretty confusing; let's make it like {{BlockInfo}}.
* The format and contents of {{triplets}} should also be explained in a comment.
* Having the Types for a cached block is also unfortunate
* Javadoc for {{getDatanodes}} mentions cached or planning to be cached; should also mention
planning to be uncached, or just say "of a certain type".
* A little comment before {{setNext}} and {{getNext}} saying they're for LightWeightGSet would
be nice.
* Class javadoc should explain how GSet and IntrusiveList are used, took me a while to understand
the backreferences to the containing list. It's tricky code.
* The overrides for Element accidentally increased visibility to public from package-protected
* We should check for setting an invalid replication factor, especially since we're re-using
the top bit for the mark.
* I still think we should code share somehow with BlockInfo, having all this triplets logic
twice sucks. This could go into a trunk refactor patch.

* I think we should be taking the write lock, not the read lock in {{#rescan}}, since in {{rescanCachedBlockMap}}
we're modifying things, e.g. {{removeElement}}, {{addNewPendingCached}}.
* Javadoc incorrect on {{#rescanCachedBlockMap}}, those parameters don't exist
* Rename {{neededReplication}} to {{neededCached}} for uniformity
* Comment for {{neededReplicaion >= numCached}} needs to be updated (it's just copy pasted
from the above)
* The {{neededCached}} case is passing in {{pendingUncached}} instead of {{pendingCached}}
* I think it's okay to populate the pending queues and do all the tracking even in standby
mode, since it'd be good to take over these duties immediately on failover. We just shouldn't
send any datanode operations.

* processCacheReport, shouldReply is unused.
* This merged class no longer uses the ReportProcessor class we refactored out from BlockManager.
I like ReportProcessor, but it's of dubious utility now. We should either punt this change
out to trunk, or revert it to keep our diff down.
* Is there some way to enforce that a block is never present on multiple of a datanode's CachedBlockLists?

CachedBlock cachedBlock =
          new CachedBlock(block.getBlockId(), (short)0, false);
      CachedBlock prevCachedBlock = cachedBlocks.get(cachedBlock);
      if (prevCachedBlock != null) {
        cachedBlock = prevCachedBlock;
* This would be clearer without {{prevCachedBlock}}

* Test for adding a PCE for a path with repl 1, verify, new PCE with repl 3 for same path,
verify, remove PCE repl 3, verify.
* Different namespace operations for paranoia: create, delete, rename
* Writing more tests in general would be a good way of staying busy, since I still need more
time to review thoroughly

> Automatically cache new data added to a cached path
> ---------------------------------------------------
>                 Key: HDFS-5096
>                 URL: https://issues.apache.org/jira/browse/HDFS-5096
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: datanode, namenode
>            Reporter: Andrew Wang
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-5096-caching.005.patch, HDFS-5096-caching.006.patch
> For some applications, it's convenient to specify a path to cache, and have HDFS automatically
cache new data added to the path without sending a new caching request or a manual refresh
> One example is new data appended to a cached file. It would be nice to re-cache a block
at the new appended length, and cache new blocks added to the file.
> Another example is a cached Hive partition directory, where a user can drop new files
directly into the partition. It would be nice if these new files were cached.
> In both cases, this automatic caching would happen after the file is closed, i.e. block
replica is finalized.

This message was sent by Atlassian JIRA

View raw message