hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhe Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9806) Allow HDFS block replicas to be provided by an external storage system
Date Fri, 03 Jun 2016 19:01:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15314609#comment-15314609
] 

Zhe Zhang commented on HDFS-9806:
---------------------------------

Thanks for posting the design and PoC [~chris.douglas]! It's really exciting to see this work
moving forward.

A few questions / comments about the current design doc:
# Having a {{PROVIDED}} storage type is an interesting idea. There are a few tricky issues
though. How should we update the over-replication logic to work with caching? If replication
factor is 1, and a {{PROVIDED}} block is cached by DN, NN will try to remove the excess replica
right? If we specify a replication factor > 1, NN will always try to create DN-local replicas,
which is probably not what we want as the opportunistic caching logic. How should we specify
the preference of caching on SSD vs. HDD? How about {{Mover}} and {{Balancer}}?
# bq. blocks in the PROVIDED storage type are not included by any Datanode as part of its
block report.
So does a DN still reports connectivity to the {{PROVIDED}} store to NN at each BR? I guess
an alternative is for NN itself to periodically check the connectivity?
# Per section 3.4, I think the NN also needs to have a "PROVIDED store client" anyway, right?

bq. Data and metadata in the external store can change out-of-band (e.g., daily log data uploaded).
This would be a tricky case to handle. How are directories persisted in the external store?
Consider the below case:
#* An empty HDFS cluster is built on WASB (only {{/}})
#* {{mkdir /data}} through HDFS. The metadata should be persisted in WASB in some form right?
#* {{/data/log1.txt}} is uploaded by some other WASB client (not the HDFS on top of it)
#* {{ls /data}} is done through HDFS. I guess HDFS NN can check the WASB data structure for
{{/data}} and get the update
#* How about when another directory {{/jobs}} is created through other WASB client? Are we
assuming HDFS has created data structure in WASB to track root dir {{/}}?
# I think more details can be added to Section 2 for clarification. In particular, per the
above comment, is this work mainly intended for "using a big external store to back a single
smaller HDFS"? Or the above "out-of-band update" use case is also important? Is it better
to have a phase 1 for single-HDFS use case (no other updates to external store)?

> Allow HDFS block replicas to be provided by an external storage system
> ----------------------------------------------------------------------
>
>                 Key: HDFS-9806
>                 URL: https://issues.apache.org/jira/browse/HDFS-9806
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Chris Douglas
>         Attachments: HDFS-9806-design.001.pdf
>
>
> In addition to heterogeneous media, many applications work with heterogeneous storage
systems. The guarantees and semantics provided by these systems are often similar, but not
identical to those of [HDFS|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html].
Any client accessing multiple storage systems is responsible for reasoning about each system
independently, and must propagate/and renew credentials for each store.
> Remote stores could be mounted under HDFS. Block locations could be mapped to immutable
file regions, opaque IDs, or other tokens that represent a consistent view of the data. While
correctness for arbitrary operations requires careful coordination between stores, in practice
we can provide workable semantics with weaker guarantees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message