hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-15292) Distcp's use of pread is slowing it down.
Date Wed, 07 Mar 2018 01:40:00 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16388859#comment-16388859

Chris Douglas commented on HADOOP-15292:

Instead of passing a flag to {{readBytes}}, this can just call {{seek()}} outside the loop
(and include the {{getPos() != position}} optimization).

[~stevel@apache.org] are you set up to test S3? {{pread}} happens to have an expensive implementation
in HDFS (and other {{FileSystem}} impls), but creating a test for distcp to ensure the {{PositionedReadable}}
APIs aren't used seems excessive.

bq. Not sure if it's worth extending that unit test to track how many times we open the stream.
>From the description, it's inside the DN where {{pread}} creates multiple streams. IIRC
the position of the stream isn't updated when using PR APIs. If the stream were shared that
could be an issue, but that's not in the design. In HDFS, updating the set of locations for
each read (without checking the distcp invariants) is also unused, here.

Demonstrating the fix with a demo in HDFS would be sufficient for commit, IMO. It might be
possible to add a test around the command itself to ensure the {{seek()}} is correct on retry,
but wiring the flaw into a test would require a {{MiniDFSCluster}}.

> Distcp's use of pread is slowing it down.
> -----------------------------------------
>                 Key: HADOOP-15292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15292
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: tools/distcp
>    Affects Versions: 3.0.0
>            Reporter: Virajith Jalaparti
>            Priority: Minor
>         Attachments: HADOOP-15292.000.patch
> Distcp currently uses positioned-reads (in RetriableFileCopyCommand#copyBytes) when the
source offset is > 0. This results in unnecessary overheads (new BlockReader being created
on the client-side, multiple readBlock() calls to the Datanodes, each of requires the creation
of a BlockSender and an inputstream to the ReplicaInfo).

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message