hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Elek, Marton (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDDS-199) Implement ReplicationManager to replicate ClosedContainers
Date Mon, 09 Jul 2018 10:55:00 GMT

    [ https://issues.apache.org/jira/browse/HDDS-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536780#comment-16536780
] 

Elek, Marton edited comment on HDDS-199 at 7/9/18 10:54 AM:
------------------------------------------------------------

Thanks [~ajayydv] the additional comments.

1. I started to refactor it to use ExecutorService after your comment but it become more complex
for me. ExecutorServices is good for handling multiple smaller tasks (executorService.submit),
but in our case we have one long-running thread with only one task. I think it's more clear
to use just a thread.

2. By default ReplicationManager receives events only for closed containers. But you are right,
it's better to check it. I added a PreconditionCheck to check the state of the container state
(as there is a try catch block inside the main loop, it will be printed out and the loop will
continue).

3. SCMCommonPolicy unit tests: To be honest, I also considered to modify the unit test. The
only problem is that there is no unit tests for the policies. There is a higher level test
(TestContainerPlacement) which checks the distributions of the containers. But you are right,
and your comment convinced me. I created two brand new unit tests for the two placement implementations
which include the check of the exclude list.

4. Other nits are fixed. Except the UUID: We can't use the UUID of the original replication
request as there is a one-to-many relationship between the original replication event and
the new tracking events: if multiple replicas are missing, we create multiple DatanodeCommand
and we need to track them one-by-one. Therefore we need different UUIDs. But thanks to point
to it: in that case we don't need the getUUID in the original  ReplicationRequest event as
it could not been used.

Latest patch has been uploaded with all these fixes + new unit tests.


was (Author: elek):
Thanks [~ajayydv] the additional comments.

1. I started to refactor it to us ExecutorService after your comment but it become more complex
for me. ExecutorServices is good for handling multiple smaller tasks (executorService.submit),
but in our case we have one long-running thread with only one task. I think it's more clear
to use just a thread.

2. By default ReplicationManager receives events only for closed containers. But you are right,
it's better to check it. I added a PreconditionCheck to check the state of the container state
(as there is a try catch block inside the main loop, it will be printed out and the loop will
continue).

3. SCMCommonPolicy unit tests: To be honest, I also considered to modify the unit test. The
only problem is that there is no unit tests for the policies. There is a higher level test
(TestContainerPlacement) which checks the distributions of the containers. But you are right,
and your comment convinced me. I created two brand new unit tests for the two placement implementation
which includes the check of the exclude list.

4. Other nits are fixed. Except the UUID: We can't use the UUID of the original replication
request as there is a one-to-many relationship between the original replication event and
the new tracking events: if multiple replicas are missing, we create multiple DatanodeCommand
and we need to track them one-by-one. Therefore we need different UUIDs. But thanks to point
to it: in that case we don't need the getUUID in the original  ReplicationRequest event as
it could not been used.

Latest patch has been uploaded with all these fixess + new unit tests.

> Implement ReplicationManager to replicate ClosedContainers
> ----------------------------------------------------------
>
>                 Key: HDDS-199
>                 URL: https://issues.apache.org/jira/browse/HDDS-199
>             Project: Hadoop Distributed Data Store
>          Issue Type: Improvement
>          Components: SCM
>            Reporter: Elek, Marton
>            Assignee: Elek, Marton
>            Priority: Major
>             Fix For: 0.2.1
>
>         Attachments: HDDS-199.001.patch, HDDS-199.002.patch, HDDS-199.003.patch, HDDS-199.004.patch,
HDDS-199.005.patch
>
>
> HDDS/Ozone supports Open and Closed containers. In case of specific conditions (container
is full, node is failed) the container will be closed and will be replicated in a different
way. The replication of Open containers are handled with Ratis and PipelineManger.
> The ReplicationManager should handle the replication of the ClosedContainers. The replication
information will be sent as an event (UnderReplicated/OverReplicated). 
> The Replication manager will collect all of the events in a priority queue (to replicate
first the containers where more replica is missing) calculate the destination datanode (first
with a very simple algorithm, later with calculating scatter-width) and send the Copy/Delete
container to the datanode (CommandQueue).
> A CopyCommandWatcher/DeleteCommandWatcher are also included to retry the copy/delete
in case of failure. This is an in-memory structure (based on HDDS-195) which can requeue the
underreplicated/overreplicated events to the prioirity queue unless the confirmation of the
copy/delete command is arrived.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message