hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Virajith Jalaparti (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-13069) Enable HDFS to cache data read from external storage systems
Date Tue, 30 Jan 2018 19:45:01 GMT

    [ https://issues.apache.org/jira/browse/HDFS-13069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16345673#comment-16345673
] 

Virajith Jalaparti commented on HDFS-13069:
-------------------------------------------

The design for caching PROVIDED files is as follows:
 1) The onus of caching any file will be on the client. If a particular file has to be cached,
a {{readThrough}} flag will be set to {{true}} as part of the {{CachingStrategy}}.
 2) When a Datanode receives a {{readBlock}} request, for a PROVIDED block, with the {{readThrough}}
flag set to {{true}}, it creates a {{TEMPORARY}} local replica. As data is streamed back to
the client, it will be written this temporary replica (can be synchronous or asynchronous).
 3) When the stream to the client closes, the Datanode attempts to finalize the replica. If
the block is not fully read by the client, the Datanode pages in the remaining data from the
PROVIDED store. The local replica is then finalized and the NN is notified ({{FsDatasetSpi#notifyNamenodeForNewReplica}}
is called).
 4) In the above, if any failure occurs between the creation of the temporary replica and
the replica being finalized, it will be cleaned up. If the Datanode goes down while this happens,
the temporary replica will be handled as any other temporary replica.

Note that if the Namenode gets notified of the new replica in (4) above, it would be deleted
as an excess replica. To prevent this, we will introduce a new kind of per-file replication
factor called "over-replication" with the following semantics: the NN will allow (replication
+ over-replication) number of replicas to exist for any block. The current semantics of replication
will continue to be enforced (i.e., if the number of replicas of a block are less than the
replication factor, the NN will schedule replications) but no replications will be scheduled
to meet the over-replication.

So to enable read-through caching, the client has to first set the appropriate over-replication
for the PROVIDED files and then read the data. If needed, the over-replication can also be
set as part of the image generation process for the PROVIDED storage (HDFS-9806/HDFS-10706)

> Enable HDFS to cache data read from external storage systems
> ------------------------------------------------------------
>
>                 Key: HDFS-13069
>                 URL: https://issues.apache.org/jira/browse/HDFS-13069
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Virajith Jalaparti
>            Priority: Major
>
> With {{PROVIDED}} storage (HDFS-9806), HDFS can address data stored in external storage
systems. Caching this data, locally, in HDFS can speed up subsequent accesses to the data
-- this feature is especially useful when accesses to the external store have limited bandwidth/higher
latency. This JIRA is to add this feature in HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message