Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B6BA9200CF2 for ; Sat, 19 Aug 2017 00:19:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B523116D937; Fri, 18 Aug 2017 22:19:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0789416D933 for ; Sat, 19 Aug 2017 00:19:10 +0200 (CEST) Received: (qmail 52493 invoked by uid 500); 18 Aug 2017 22:19:08 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 52482 invoked by uid 99); 18 Aug 2017 22:19:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Aug 2017 22:19:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 474CC1A0505 for ; Fri, 18 Aug 2017 22:19:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id TudOZIleJ9M9 for ; Fri, 18 Aug 2017 22:19:07 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id ADCBE5F522 for ; Fri, 18 Aug 2017 22:19:06 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 0865AE099C for ; Fri, 18 Aug 2017 22:19:04 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id D47DC25381 for ; Fri, 18 Aug 2017 22:19:01 +0000 (UTC) Date: Fri, 18 Aug 2017 22:19:01 +0000 (UTC) From: "Kai Zheng (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HDFS-12222) Add EC information to BlockLocation MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 18 Aug 2017 22:19:11 -0000 [ https://issues.apache.org/jira/browse/HDFS-12222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16133699#comment-16133699 ] Kai Zheng commented on HDFS-12222: ---------------------------------- bq. Also, if someone is going below DFSClient to read, they can also get the cellSize from the LocatedBlock/HdfsFileStatus information. Yes, agree. It has to be that, since much more info would be needed to support that kind of hack and only HDFS specific API can provide. So looks like what's exactly needed would be just having the existing API return all the data block locations in EC case instead of the replication locations. The location info can be used to support data processing scheduling, pretty enough and clean. Cool! > Add EC information to BlockLocation > ----------------------------------- > > Key: HDFS-12222 > URL: https://issues.apache.org/jira/browse/HDFS-12222 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 3.0.0-alpha1 > Reporter: Andrew Wang > Assignee: Huafeng Wang > Labels: hdfs-ec-3.0-nice-to-have > Attachments: HDFS-12222.001.patch > > > HDFS applications query block location information to compute splits. One example of this is FileInputFormat: > https://github.com/apache/hadoop/blob/d4015f8628dd973c7433639451a9acc3e741d2a2/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/FileInputFormat.java#L346 > You see bits of code like this that calculate offsets as follows: > {noformat} > long bytesInThisBlock = blkLocations[startIndex].getOffset() + > blkLocations[startIndex].getLength() - offset; > {noformat} > EC confuses this since the block locations include parity block locations as well, which are not part of the logical file length. This messes up the offset calculation and thus topology/caching information too. > Applications can figure out what's a parity block by reading the EC policy and then parsing the schema, but it'd be a lot better if we exposed this more generically in BlockLocation instead. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org