hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Tianyi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9075) Multiple datacenter replication inside one HDFS cluster
Date Wed, 16 Sep 2015 02:12:46 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14746675#comment-14746675
] 

He Tianyi commented on HDFS-9075:
---------------------------------

Thanks for point that out, [~cnauroth].

Prior discusses mentioned global namespace model, which i think is the most valuable direction
to work on.

There are consistency choices about namespace:
1. strong consistent namespace, perhaps either requires a global quorum to ensure consistency
or namespace segmentation (bit like federation, with only local block pool)
2. eventual consistent namespace, can be achieved via snapshots. (What happens

Besides, there are choices about data replication fashion:
1. sync replication, add remote nodes to pipeline during write,
2. async replication.

IMHO strong consistent namespace is a must, otherwise global operations tend to be hard to
become transparent.
i.e. What happens if append operation on different file (or same file) in same directory take
place simultaneously in two datacenters?
(Of course a global lease manager would do the trick, but that requires remote communication)
If we go the strong consistent way, performance suffers anyway (R/W needs global communication
anyway). It's no harm simply use one central active NameNode, but with JournalNode and standby
NameNode deployed globally.

As for replication, I think performance will not be an issue when given latency is tolerable
and bandwidth is sufficient (See HDFS-8829). We can certainly let user decide.

We have a real scenario that communication between two datacenters have a latency of nearly
3ms, while bandwidth is sufficient.
In this case, we see no performance drop so far.

But with high latency, I think that will not hold. Perhaps we need some fresh idea.

> Multiple datacenter replication inside one HDFS cluster
> -------------------------------------------------------
>
>                 Key: HDFS-9075
>                 URL: https://issues.apache.org/jira/browse/HDFS-9075
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>            Reporter: He Tianyi
>            Assignee: He Tianyi
>
> It is common scenario of deploying multiple datacenter for scaling and disaster tolerant.

> In this case we certainly want that data can be shared transparently (to user) across
datacenters.
> For example, say we have a raw user action log stored daily, different computations may
take place with the log as input. As scale grows, we may want to schedule various kind of
computations in more than one datacenter.
> As far as i know, current solution is to deploy multiple independent clusters corresponding
to datacenters, using {{distcp}} to sync data files between them.
> But in this case, user needs to know exactly where data is stored, and mistakes may be
made during human-intervened operations. After all, it is basically a computer job.
> Based on these facts, it is obvious that a multiple datacenter replication solution may
solve the scenario.
> I am working one prototype that works with 2 datacenters, the goal is to provide data
replication between datacenters transparently and minimize the inter-dc bandwidth usage. Basic
idea is replicate blocks to both DC and determine number of replications by historical statistics
of access behaviors of that part of namespace.
> I will post a design document soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message