impala-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henry Robinson <he...@cloudera.com>
Subject Re: [Impala-CR](cdh5-trunk) IMPALA-2904: Support INSERT and LOAD DATA on S3 and between filesystems
Date Mon, 21 Mar 2016 04:52:47 GMT
Great! I'll start reviewing this tomorrow. Could take a couple of days to
get through.

On 18 March 2016 at 17:39, Sailesh Mukil (Code Review) <gerrit@cloudera.org>
wrote:

> Sailesh Mukil has uploaded a new patch set (#5).
>
> Change subject: IMPALA-2904: Support INSERT and LOAD DATA on S3 and
> between filesystems
> ......................................................................
>
> IMPALA-2904: Support INSERT and LOAD DATA on S3 and between filesystems
>
> Previously Impala disallowed LOAD DATA and INSERT on S3. This patch
> functionally enables LOAD DATA and INSERT on S3 without making major
> changes for the sake of improving performance over S3. This patch also
> enables both INSERT and LOAD DATA between file systems.
>
> Added a python S3 client called 'boto3' to access S3 from the python
> tests. A new class called S3Client is introduced which creates
> wrappers around the boto3 functions and have the same function
> signatures as PyWebHdfsClient by deriving from a base abstract class
> BaseFileSystem so that they can be interchangeably through a
> 'generic_client'. test_load.py is refactored to use this generic
> client. The ImpalaTestSuite setup creates a client according to the
> TARGET_FILESYSTEM environment variable and assigns it to the
> 'generic_client'.
>
> P.S: Currently, the test_load.py runs 15x slower on S3 than on
> HDFS (Even after removing one query for S3). Performance needs
> to be improved in future patches. INSERT performance is slower
> than on HDFS too. However, larger INSERTs come closer to HDFS
> permformance than smaller inserts.
>
> ACLs are not taken care of for S3 in this patch. It is something
> that still needs to be discussed before implementing.
>
> Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
> ---
> M be/src/exec/hdfs-table-sink.cc
> M be/src/exec/hdfs-table-sink.h
> M be/src/runtime/coordinator.cc
> M be/src/runtime/hdfs-fs-cache.h
> M be/src/util/hdfs-bulk-ops.cc
> M be/src/util/hdfs-bulk-ops.h
> M be/src/util/hdfs-util.cc
> M be/src/util/hdfs-util.h
> M common/thrift/ImpalaInternalService.thrift
> M fe/src/main/java/com/cloudera/impala/analysis/InsertStmt.java
> M fe/src/main/java/com/cloudera/impala/analysis/LoadDataStmt.java
> M fe/src/main/java/com/cloudera/impala/common/FileSystemUtil.java
> M fe/src/main/java/com/cloudera/impala/service/Frontend.java
> M infra/python/deps/requirements.txt
> M
> testdata/workloads/functional-query/queries/QueryTest/insert_permutation.test
> M
> testdata/workloads/functional-query/queries/QueryTest/multiple-filesystems.test
> M testdata/workloads/functional-query/queries/QueryTest/truncate-table.test
> M testdata/workloads/tpch/queries/insert_parquet.test
> M tests/common/impala_test_suite.py
> M tests/common/skip.py
> M tests/custom_cluster/test_insert_behaviour.py
> M tests/custom_cluster/test_parquet_max_page_header.py
> M tests/data_errors/test_data_errors.py
> M tests/metadata/test_compute_stats.py
> M tests/metadata/test_ddl.py
> M tests/metadata/test_explain.py
> M tests/metadata/test_hdfs_encryption.py
> M tests/metadata/test_hdfs_permissions.py
> M tests/metadata/test_last_ddl_time_update.py
> M tests/metadata/test_load.py
> M tests/metadata/test_partition_metadata.py
> M tests/metadata/test_recover_partitions.py
> M tests/metadata/test_show_create_table.py
> M tests/query_test/test_aggregation.py
> M tests/query_test/test_cancellation.py
> M tests/query_test/test_chars.py
> M tests/query_test/test_compressed_formats.py
> M tests/query_test/test_insert.py
> M tests/query_test/test_insert_behaviour.py
> M tests/query_test/test_insert_parquet.py
> M tests/query_test/test_join_queries.py
> M tests/query_test/test_queries.py
> M tests/query_test/test_scanners.py
> A tests/util/filesystem_base.py
> M tests/util/filesystem_utils.py
> M tests/util/hdfs_util.py
> A tests/util/s3_util.py
> 47 files changed, 565 insertions(+), 295 deletions(-)
>
>
>   git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/74/2574/5
> --
> To view, visit http://gerrit.cloudera.org:8080/2574
> To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
>
> Gerrit-MessageType: newpatchset
> Gerrit-Change-Id: I94e15ad67752dce21c9b7c1dced6e114905a942d
> Gerrit-PatchSet: 5
> Gerrit-Project: Impala
> Gerrit-Branch: cdh5-trunk
> Gerrit-Owner: Sailesh Mukil <sailesh@cloudera.com>
> Gerrit-Reviewer: Henry Robinson <henry@cloudera.com>
> Gerrit-Reviewer: Sailesh Mukil <sailesh@cloudera.com>
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message