hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-14766) Cloudup: an object store high performance dfs put command
Date Mon, 06 Nov 2017 19:16:00 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-14766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Steve Loughran updated HADOOP-14766:
    Attachment: HADOOP-14766-001.patch

Patch 001

this is the initial PoC imported into Hadoop under hadoop-common; eliminate copy & past
of ContractTestUtils.NanoTime by moving the class and then retaining the old one as a subclass
of the moved one.

I'm not 100% sure this is the right home, but we don't yet have an explicit cloud module.

Note: this also works with HDFS, even across the local FS...any FS which implements their
own version of {{copyFromLocalFile}}  will benefit from it.

Testing: only manually against S3A and its copyFromLocalFile.

There's no check for changed files; i.e. against checksums, timestamps or similar. None planned.
This is primarily of a local-to-store upload program with comparable speed to that shipped
with the AWS SDK, but able to work with any remote HCFS store, not some incremental backup
mechanism. Though if someone were to issue getChecksum(path) across all the stores, it'd be
good to log that, possibly even export a minimal avro file summary

> Cloudup: an object store high performance dfs put command
> ---------------------------------------------------------
>                 Key: HADOOP-14766
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14766
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs, fs/azure, fs/s3
>    Affects Versions: 2.8.1
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Minor
>         Attachments: HADOOP-14766-001.patch
> {{hdfs put local s3a://path}} is suboptimal as it treewalks down down the source tree
then, sequentially, copies up the file through copying the file (opened as a stream) contents
to a buffer, writes that to the dest file, repeats.
> For S3A that hurts because
> * it;s doing the upload inefficiently: the file can be uploaded just by handling the
pathname to the AWS xter manager
> * it is doing it sequentially, when some parallelised upload would work. 
> * as the ordering of the files to upload is a recursive treewalk, it doesn't spread the
upload across multiple shards. 
> Better:
> * build the list of files to upload
> * upload in parallel, picking entries from the list at random and spreading across a
pool of uploaders
> * upload straight from local file (copyFromLocalFile()
> * track IO load (files created/second) to estimate risk of throttling.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message