hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-13655) document object store use with fs shell and distcp
Date Wed, 16 Nov 2016 22:27:58 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-13655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671867#comment-15671867

ASF GitHub Bot commented on HADOOP-13655:

Github user liuml07 commented on a diff in the pull request:

    --- Diff: hadoop-tools/hadoop-distcp/src/site/markdown/DistCp.md.vm ---
    @@ -470,6 +470,105 @@ $H3 SSL Configurations for HSFTP sources
       The SSL configuration file must be in the class-path of the DistCp program.
    +$H3 DistCp and Object Stores
    +DistCp works with Object Stores such as Amazon S3, Azure WASB and OpenStack Swift.
    +1. The JAR containing the object store implementation is on the classpath,
    +along with all of its dependencies.
    +1. Unless the JAR automatically registers its bundled filesystem clients,
    +the configuration may need to be modified to state the class which
    +implements the filesystem schema. All of the ASF's own object store clients
    +are self-registering.
    +1. The relevant object store access credentials must be available in the cluster
    +configuration, or be otherwise available in all cluster hosts.
    +DistCp can be used to upload data
    +hadoop distcp hdfs://nn1:8020/datasets/set1 s3a://bucket/datasets/set1
    +To download data
    +hadoop distcp s3a://bucket/generated/results hdfs://nn1:8020/results
    +To copy data between object stores
    +hadoop distcp s3a://bucket/generated/results \
    +  wasb://updates@example.blob.core.windows.net
    +And do copy data within an object store
    +hadoop distcp wasb://updates@example.blob.core.windows.net/current \
    +  wasb://updates@example.blob.core.windows.net/old
    +And to use `-update` to only copy changed files.
    +hadoop distcp -update -numListstatusThreads 20  \
    +  swift://history.cluster1/2016 \
    +  hdfs://nn1:8020/history/2016
    +Because object stores are slow to list files, consider setting the `-numListstatusThreads`
option when performing a `-update` operation
    +on a large directory tree (the limit is 40 threads).
    +When `DistCp -update` is used with objec stores,
    +generally only the modification time and length of the individual files are compared,
    +not any checksums. The fact that most object stores do have valid timestamps
    +for directories is irrelevant; only the file timestamps are compared.
    +However, it is important to have the clock of the client computers close
    +to that of the infrastructure, so that timestamps are consistent between
    +the client/HDFS cluster and that of the object store. Otherwise, changed files may be
    +missed/copied too often.
    +* The `-atomic` option causes a rename of the temporary data, so significantly
    +increases the time to commit work at the end of the operation. Furthermore,
    +as Object Stores other than (optionally) `wasb://` do not offer atomic renames of directories
    +the `-atomic` operation doesn't actually deliver what is promised. *Avoid*.
    +* The `-append` option is not supported.
    +* The `-diff` option is not supported
    +* CRC checking will not be performed, irrespective of the value of the `-skipCrc`
    +* All `-p` options, including those to preserve permissions, user and group information,
    +checksums and replication are generally ignored. The `wasb://` connector will
    +preserve the information, but not enforce the permissions.
    +* Some object store connectors offer an option for in-memory buffering of
    +output —for example the S3A connector. Using such option while copying
    +large files may trigger some form of out of memory event,
    +be it a heap overflow or a YARN container termination.
    +This is particularly common if the network bandwidth
    +between the cluster and the object store is limited (such as when working
    +with remote object stores). It is best to disable/avoid such options and
    +rely on disk buffering.
    +* Copy operations within a single object store still take place in the Hadoop cluster
    +—even when the object store implements a more efficient COPY operation internally
    +    That is, an operation such as
    --- End diff --
    The indention is unnecessary?

> document object store use with fs shell and distcp
> --------------------------------------------------
>                 Key: HADOOP-13655
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13655
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: documentation, fs, fs/s3
>    Affects Versions: 2.7.3
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
> There's no specific docs for working with object stores from the {{hadoop fs}} shell
or in distcp; people either suffer from this (performance, billing), or learn through trial
and error what to do.
> Add a section in both fs shell and distcp docs covering use with object stores.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message