hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
Date Fri, 12 Jul 2013 23:33:49 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13707546#comment-13707546

Todd Lipcon commented on HDFS-4949:

Hey folks. I agree that HSM is a much bigger task than what we're talking about here, and
not certain they can fit into the same framework. During our early internal design discussions
I'd suggested the same thing, but after an hour or two of throwing the idea around, we discounted
it due to the reasons Colin mentioned above (partial caching and revocation).

Though partial caching isn't referenced in the doc, it's a straightforward extension that
we plan to tackle down the road. For example, we can take each block, subdivide into 1MB chunks,
and then report a bitmap indicating which chunks are cached. Taking advantage of the kernel
lets us do this relatively easily calling mlock/munlock -- and the revocation problem is again
simple because a misbehaving client won't be able to pin memory.

I don't think this work precludes later work on the idea of memory-only storages/replicas.
That has other advantages, particularly on the *write* side for temporary data, etc. But is
somewhat tricky to get right. When we do that, we should certainly look at it in a generalized
way (RAM, SSD, Disk as a hierarchy).
> Centralized cache management in HDFS
> ------------------------------------
>                 Key: HDFS-4949
>                 URL: https://issues.apache.org/jira/browse/HDFS-4949
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: caching-design-doc-2013-07-02.pdf
> HDFS currently has no support for managing or exposing in-memory caches at datanodes.
This makes it harder for higher level application frameworks like Hive, Pig, and Impala to
effectively use cluster memory, because they cannot explicitly cache important datasets or
place their tasks for memory locality.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message