Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 59F1F18232 for ; Mon, 9 Nov 2015 22:12:12 +0000 (UTC) Received: (qmail 90309 invoked by uid 500); 9 Nov 2015 22:12:11 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 90125 invoked by uid 500); 9 Nov 2015 22:12:11 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 89720 invoked by uid 99); 9 Nov 2015 22:12:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Nov 2015 22:12:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 432632C1F7D for ; Mon, 9 Nov 2015 22:12:11 +0000 (UTC) Date: Mon, 9 Nov 2015 22:12:11 +0000 (UTC) From: "Zhe Zhang (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-9079) Erasure coding: preallocate multiple generation stamps and serialize updates from data streamers MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-9079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhe Zhang updated HDFS-9079: ---------------------------- Attachment: HDFS-9079.09.patch Fixed test failures from last Jenkins run, by addressing the following corner cases: # In some test cases a streamer doesn't have any byte to write. Should properly handle the status of such streamers in the coordinator # {{setExternalError}} should wait until the streamer is in {{DATA_STREAMING}} stage (i.e. {{blockStream}} is not null) > Erasure coding: preallocate multiple generation stamps and serialize updates from data streamers > ------------------------------------------------------------------------------------------------ > > Key: HDFS-9079 > URL: https://issues.apache.org/jira/browse/HDFS-9079 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: erasure-coding > Affects Versions: HDFS-7285 > Reporter: Zhe Zhang > Assignee: Zhe Zhang > Attachments: HDFS-9079-HDFS-7285.00.patch, HDFS-9079.01.patch, HDFS-9079.02.patch, HDFS-9079.03.patch, HDFS-9079.04.patch, HDFS-9079.05.patch, HDFS-9079.06.patch, HDFS-9079.07.patch, HDFS-9079.08.patch, HDFS-9079.09.patch > > > A non-striped DataStreamer goes through the following steps in error handling: > {code} > 1) Finds error => 2) Asks NN for new GS => 3) Gets new GS from NN => 4) Applies new GS to DN (createBlockOutputStream) => 5) Ack from DN => 6) Updates block on NN > {code} > With multiple streamer threads run in parallel, we need to correctly handle a large number of possible combinations of interleaved thread events. For example, {{streamer_B}} starts step 2 in between events {{streamer_A.2}} and {{streamer_A.3}}. > HDFS-9040 moves steps 1, 2, 3, 6 from streamer to {{DFSStripedOutputStream}}. This JIRA proposes some further optimizations based on HDFS-9040: > # We can preallocate GS when NN creates a new striped block group ({{FSN#createNewBlock}}). For each new striped block group we can reserve {{NUM_PARITY_BLOCKS}} GS's. If more than {{NUM_PARITY_BLOCKS}} errors have happened we shouldn't try to further recover anyway. > # We can use a dedicated event processor to offload the error handling logic from {{DFSStripedOutputStream}}, which is not a long running daemon. > # We can limit the lifespan of a streamer to be a single block. A streamer ends either after finishing the current block or when encountering a DN failure. > With the proposed change, a {{StripedDataStreamer}}'s flow becomes: > {code} > 1) Finds DN error => 2) Notify coordinator (async, not waiting for response) => terminates > 1) Finds external error => 2) Applies new GS to DN (createBlockOutputStream) => 3) Ack from DN => 4) Notify coordinator (async, not waiting for response) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)