hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4949) Centralized cache management in HDFS
Date Sat, 10 Aug 2013 20:18:48 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13736019#comment-13736019
] 

Arun C Murthy commented on HDFS-4949:
-------------------------------------

bq. 1. The main reason we added auto-caching of new files was actually for Hive. My understanding
is that Hive users can drop new files into a Hive partition directory without notifying the
Hive metastore, e.g. via the fs shell. 

Usually partitions in Hive are new directories. So every 5 or 10 or 15 mins a new directory
is added along with new data. Hence, the ability to automatically cache new files seems redundant.

bq. 2. We were planning on extending the existing getFileBlockLocations API (which takes a
Path, offset, and length) to also indicate which replicas of the returned blocks are cached.
This should satisfy the needs of framework schedulers like MR or Impala. 

[~andrew.wang] Agree that the enhancement to getFileBlockLocations suffices for the scheduler.
However, at read time it will be very useful to get an indicator on whether it's cached or
not during open. The RecordReader needs this API to decide whether to do stream-based reads
(when data isn't cached in RAM) or mmap the file (when it's cached). It would be unfortunate
to have to do another call to getFileBlockLocations to validate during read time.

For e.g. SequenceFileRecordReader.initialize would look something like:

{code:title=SequenceFileRecordReader.java}
  public void initialize(InputSplit split, 
                         TaskAttemptContext context
                         ) throws IOException, InterruptedException {

  // ...

  StreamOrCached splitData = split.getPath().open(fileSplit.getStart(), fileSplit.getLength();
  InputStream in = null;
  if (in.isCached()) {
    in = new ByteBufferInputStream(splitData.getByteBuffer());
  } else {
    in = splitData.getFSDataInputStream();
  }
  
  // Now use in
  // ...
  
{code}

So, having the open api which returns something like StreamOrCached will be useful as sketched
above.

Open to other ideas, but hopefully I put across what I'm looking for.

Thoughts?
                
> Centralized cache management in HDFS
> ------------------------------------
>
>                 Key: HDFS-4949
>                 URL: https://issues.apache.org/jira/browse/HDFS-4949
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>    Affects Versions: 3.0.0, 2.3.0
>            Reporter: Andrew Wang
>            Assignee: Andrew Wang
>         Attachments: caching-design-doc-2013-07-02.pdf, caching-design-doc-2013-08-09.pdf
>
>
> HDFS currently has no support for managing or exposing in-memory caches at datanodes.
This makes it harder for higher level application frameworks like Hive, Pig, and Impala to
effectively use cluster memory, because they cannot explicitly cache important datasets or
place their tasks for memory locality.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message