hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
Date Thu, 18 Jul 2013 18:06:49 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13712579#comment-13712579
] 

Colin Patrick McCabe commented on HDFS-4949:
--------------------------------------------

As Todd, Andrew, and I said before, all of the designs we considered that treated what was
in the cache as replicas suffered from an inability to revoke the client's access to this
memory.  If you pass the client a file descriptor to a file in {{/dev/shm}}, you cannot revoke
access to that later on.  The client can hold on to that memory forever.  That alone is enough
to throw out that design.

To avoid this, we have to use mmap of a file on disk.  And when you do that, it can no longer
be abstracted as a replica, because the on-disk copy has to exist.  It is at best, a property
of an existing replica.

Just as important, caching decisions also have to be made on a different timescale than decisions
about hierarchical storage management.  HSM decisions can be made over the course of minutes
or hours; caching decisions have to be made in seconds to be relevant.

Memory is not a storage tier.  It doesn't store anything; rather, it caches.  Does it make
sense to fsck memory?  That is silly.
                
> Centralized cache management in HDFS
> ------------------------------------
>
>                 Key: HDFS-4949
>                 URL: https://issues.apache.org/jira/browse/HDFS-4949
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: caching-design-doc-2013-07-02.pdf
>
>
> HDFS currently has no support for managing or exposing in-memory caches at datanodes.
This makes it harder for higher level application frameworks like Hive, Pig, and Impala to
effectively use cluster memory, because they cannot explicitly cache important datasets or
place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message