hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sanjay Radia (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
Date Thu, 18 Jul 2013 21:08:52 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13712856#comment-13712856

Sanjay Radia commented on HDFS-4949:

bq.  we have to use mmap of a file on disk.
Please look at my comments:  I have not objected to mmap and mlock.
I am fine with having Ram replicas backed by disk replica; indeed I see this as an important
advantage over Ramfs where the data is copied. The replication abstractions allows for a more
general view where they are not, but our implementation restricts the memory replicas to be
backed by disk replicas.

bq. In general, tiered storage management happens over a longer period of time than cache
The term tier-storage is unfortunate (I misused it in my original comment). In HDFS-2832,
we consciously  used  the terms "heterogeneous storage" and not tiered storage. Tiering as
in "moving things based on their hotness" is policy. (BTW I envision using SSDs initially
not for moving hot blocks but as storage for *one* of 3 replicas. I have discussed this use
case with a few of the HBase folks). Caching is a use case that applies well to disks vs ram.
Both the use cases apply well to the abstraction of replicas stored on different kinds of
storage devices. 

bq. Memory is not a storage tier. It doesn't store anything; rather, it caches. Does it make
sense to fsck memory? That is silly.
Memory and disks store data but one is way more durable. Fsck is a bad example - you do fsck
on a file system not on the disk. Here we are taking about entities that store HDFS block
data.  But this debate over the similarities and difference between ram and disk is a longer
one that we should have over beer. I am not blind to the differences between disks and ram.
Further, by using the same abstraction to model ram copies and disk copies does not mean that
I am implying that I am going to always treat them as exactly the same and ignore the differences.

> Centralized cache management in HDFS
> ------------------------------------
>                 Key: HDFS-4949
>                 URL: https://issues.apache.org/jira/browse/HDFS-4949
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: caching-design-doc-2013-07-02.pdf
> HDFS currently has no support for managing or exposing in-memory caches at datanodes.
This makes it harder for higher level application frameworks like Hive, Pig, and Impala to
effectively use cluster memory, because they cannot explicitly cache important datasets or
place their tasks for memory locality.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message