From common-issues-return-149391-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org Wed Mar 7 02:40:06 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 0ACA0180652 for ; Wed, 7 Mar 2018 02:40:05 +0100 (CET) Received: (qmail 96611 invoked by uid 500); 7 Mar 2018 01:40:04 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 96600 invoked by uid 99); 7 Mar 2018 01:40:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Mar 2018 01:40:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 4B4DDC012E for ; Wed, 7 Mar 2018 01:40:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.511 X-Spam-Level: X-Spam-Status: No, score=-109.511 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id A5AjcWYO2mHF for ; Wed, 7 Mar 2018 01:40:03 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id AEDEE5F39A for ; Wed, 7 Mar 2018 01:40:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 04387E022D for ; Wed, 7 Mar 2018 01:40:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 2D0B6253FA for ; Wed, 7 Mar 2018 01:40:00 +0000 (UTC) Date: Wed, 7 Mar 2018 01:40:00 +0000 (UTC) From: "Chris Douglas (JIRA)" To: common-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HADOOP-15292) Distcp's use of pread is slowing it down. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388859#comment-16388859 ] Chris Douglas commented on HADOOP-15292: ---------------------------------------- Instead of passing a flag to {{readBytes}}, this can just call {{seek()}} outside the loop (and include the {{getPos() != position}} optimization). [~stevel@apache.org] are you set up to test S3? {{pread}} happens to have an expensive implementation in HDFS (and other {{FileSystem}} impls), but creating a test for distcp to ensure the {{PositionedReadable}} APIs aren't used seems excessive. bq. Not sure if it's worth extending that unit test to track how many times we open the stream. From the description, it's inside the DN where {{pread}} creates multiple streams. IIRC the position of the stream isn't updated when using PR APIs. If the stream were shared that could be an issue, but that's not in the design. In HDFS, updating the set of locations for each read (without checking the distcp invariants) is also unused, here. Demonstrating the fix with a demo in HDFS would be sufficient for commit, IMO. It might be possible to add a test around the command itself to ensure the {{seek()}} is correct on retry, but wiring the flaw into a test would require a {{MiniDFSCluster}}. > Distcp's use of pread is slowing it down. > ----------------------------------------- > > Key: HADOOP-15292 > URL: https://issues.apache.org/jira/browse/HADOOP-15292 > Project: Hadoop Common > Issue Type: Bug > Components: tools/distcp > Affects Versions: 3.0.0 > Reporter: Virajith Jalaparti > Priority: Minor > Attachments: HADOOP-15292.000.patch > > > Distcp currently uses positioned-reads (in RetriableFileCopyCommand#copyBytes) when the source offset is > 0. This results in unnecessary overheads (new BlockReader being created on the client-side, multiple readBlock() calls to the Datanodes, each of requires the creation of a BlockSender and an inputstream to the ReplicaInfo). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: common-issues-help@hadoop.apache.org