hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-1172) Blocks in newly completed files are considered under-replicated too quickly
Date Thu, 25 Aug 2011 00:37:30 GMT

     [ https://issues.apache.org/jira/browse/HDFS-1172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Todd Lipcon updated HDFS-1172:
------------------------------

    Attachment: hdfs-1172.txt

Here's a new patch against trunk for this issue.

A few things changed since Hairong's original patch:

- I removed the part of the test that changes the replication factor of a file while it's
under construction. This part of the test wasn't succeeding reliably, since it was running
into a different bug: HDFS-2283

- added the test code from HDFS-1197 which allows the DNs to artificially delay blockReceived
calls in the tests. This exposed some other bugs with the patch

- the new replicateLastBlock code needed to be called in a different place:
-- the original patch called this on every attempt of completeFile(), rather than on only
the final/successful attempt. This meant that, if the replicas were very slow to check in,
the targets would be added to pendingReplication many times, yielding a pending replica count
much larger than the actual replication factor
-- the code needs to be called for all blocks, not just the last block in a file

I looped the new tests for a while and they pass reliably.

> Blocks in newly completed files are considered under-replicated too quickly
> ---------------------------------------------------------------------------
>
>                 Key: HDFS-1172
>                 URL: https://issues.apache.org/jira/browse/HDFS-1172
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.21.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>             Fix For: 0.23.0
>
>         Attachments: HDFS-1172.patch, hdfs-1172.txt, replicateBlocksFUC.patch, replicateBlocksFUC1.patch,
replicateBlocksFUC1.patch
>
>
> I've seen this for a long time, and imagine it's a known issue, but couldn't find an
existing JIRA. It often happens that we see the NN schedule replication on the last block
of files very quickly after they're completed, before the other DNs in the pipeline have a
chance to report the new block. This results in a lot of extra replication work on the cluster,
as we replicate the block and then end up with multiple excess replicas which are very quickly
deleted.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message