Return-Path: X-Original-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 577FF17DC8 for ; Fri, 1 May 2015 18:57:08 +0000 (UTC) Received: (qmail 76025 invoked by uid 500); 1 May 2015 18:57:08 -0000 Delivered-To: apmail-hadoop-hdfs-issues-archive@hadoop.apache.org Received: (qmail 75970 invoked by uid 500); 1 May 2015 18:57:08 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hdfs-issues@hadoop.apache.org Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 75957 invoked by uid 99); 1 May 2015 18:57:08 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 May 2015 18:57:08 +0000 Date: Fri, 1 May 2015 18:57:07 +0000 (UTC) From: "Jing Zhao (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-8281) Erasure Coding: implement parallel stateful reading for striped layout MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HDFS-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14523684#comment-14523684 ] Jing Zhao commented on HDFS-8281: --------------------------------- Thanks for the comments, Zhe! For 4&5, thanks for the explanation about the "short read"! Some of my thoughts here: # At the current stage, I think our main use case is still sequential read, and it's good to read in parallel to serve this kind of request so that we can achieve better throughput. This means that the basic unit for each individual read should still be a cell. # Actually the tradeoff here is the throughput and the biggest latency of serving a single read request. The parallel read may get delayed by a slow/unavailable DN. But we always have to handle slow/unavailable DN during the read. The difference is the stripe size during the decoding: let's say each time we only return 64KB (for simplicity assuming they come from the same DN), and if the data is unavailable, a corresponding (64KB * 6) stripe will be read. In the current case we read 256KB * 6 (and if the cell size is 64KB it's actually the same). # For the possible decoding use case we need to have a buffer to keep the data that has been served. If reading a complete stripe becomes a real concern because of its latency, a simple way to improve is to read less data into the buffer each time but without changing the buffer size. But currently without detailed benchmark data I'm not sure whether we want to add this logic immediately. I think this is something we must explore while doing the performance test and we can do improvement as a follow-on work. # One question is why we choose 256KB as the cell size instead of the original 64KB? I will update the patch later to address comments 1~3. > Erasure Coding: implement parallel stateful reading for striped layout > ---------------------------------------------------------------------- > > Key: HDFS-8281 > URL: https://issues.apache.org/jira/browse/HDFS-8281 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Jing Zhao > Assignee: Jing Zhao > Attachments: HDFS-8281-HDFS-7285.001.patch, HDFS-8281-HDFS-7285.001.patch, HDFS-8281.000.patch > > > This jira aims to support parallel reading for stateful read in {{DFSStripedInputStream}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)