hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Virajith Jalaparti (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-12090) Handling writes from HDFS to Provided storages
Date Tue, 09 Jan 2018 23:24:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16319396#comment-16319396

Virajith Jalaparti commented on HDFS-12090:

Thanks for posting the patch [~ehiggs]. Here are some initial thoughts on the high-level design:
# As you note, the current implementation doesn't support ordered operations (e.g., backup
of a directory hierarchy to another instance of HDFS). All the operations in a particular
snapshot diff happen in parallel across (potentially) multiple datanodes. When supporting
ordered operations, I think {{SyncServiceSatisfier}} needs to coordinate them (so that Datanodes
don't start having additional coordination). So, the design should make sure that some part
of it is capable of handling ordered operations. Having an abstract class that performs the
functions handled in {{SyncServiceSatisfier#synchronizeBackupMount}} can be one way to solve
this issue.. 
# The data backup path is concerning. It bypasses the DN write path and one Datanode backs
up a whole file (in {{SyncServiceSatisfierWorker#backupFile}}) --- it copies blocks from other
datanodes in the cluster and then writes it back to the provides store. Compared to the SPS
approach (a DN could be responsible for only 1 block), this approach involves 2 network transfers
instead of 1 (the DN has to copy blocks from other DNs and then write it back to the provided
store), and cannot benefit from the parallelism of each DN handling one or a few blocks for
the file.
# The patch seems a completely separate path from the SPS work (HDFS-10285). Given that the
SPS is still in a state of flux, this is OK for now. However, in the future (once SPS converges),
it would be good to look at how this work can plug into/reuse parts of the SPS/refactor parts
of SPS if necessary. I would hate to have two parallel code paths that do something very similar
(satisfy storage policies). That said, I think that shouldn't stop progress on this JIRA.
# Need for a throttling mechanism so as to limit the load on the NN. Although not immediate,
this would be eventually required.

Some comments specific to this patch:
* In {{SyncTaskScheduler#schedule}}, why have these two separate paths?
      if (syncTask.operation == SyncTask.Operation.CREATE_FILE) {
      } else {

* Use a builder pattern for creating {{SyncTask}}?
* Why use the sync mount and not backup endpoint? That was the terminology used in the latest
functional spec.
* The method names {{createSync}}, {{removeSync}}, though understandable, are confusing. I
think {{createBackupEndPoint}}, {{removeBackupEndPoint}} etc. would be easier to understood
(and adhere to the functional spec).

> Handling writes from HDFS to Provided storages
> ----------------------------------------------
>                 Key: HDFS-12090
>                 URL: https://issues.apache.org/jira/browse/HDFS-12090
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Virajith Jalaparti
>         Attachments: HDFS-12090-Functional-Specification.001.pdf, HDFS-12090-Functional-Specification.002.pdf,
HDFS-12090-Functional-Specification.003.pdf, HDFS-12090-design.001.pdf, HDFS-12090.0000.patch
> HDFS-9806 introduces the concept of {{PROVIDED}} storage, which makes data in external
storage systems accessible through HDFS. However, HDFS-9806 is limited to data being read
through HDFS. This JIRA will deal with how data can be written to such {{PROVIDED}} storages
from HDFS.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message