Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6D424200CA9 for ; Fri, 2 Jun 2017 02:53:08 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 6BBB7160BC4; Fri, 2 Jun 2017 00:53:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B0D6E160BDF for ; Fri, 2 Jun 2017 02:53:07 +0200 (CEST) Received: (qmail 82776 invoked by uid 500); 2 Jun 2017 00:53:06 -0000 Mailing-List: contact commits-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@impala.incubator.apache.org Delivered-To: mailing list commits@impala.incubator.apache.org Received: (qmail 82767 invoked by uid 99); 2 Jun 2017 00:53:06 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Jun 2017 00:53:06 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 862D61AFB54 for ; Fri, 2 Jun 2017 00:53:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -4.222 X-Spam-Level: X-Spam-Status: No, score=-4.222 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id lSJ4APUFTAwA for ; Fri, 2 Jun 2017 00:53:04 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with SMTP id 1928C5FC43 for ; Fri, 2 Jun 2017 00:53:03 +0000 (UTC) Received: (qmail 82489 invoked by uid 99); 2 Jun 2017 00:53:03 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Jun 2017 00:53:03 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 16FE0E03B3; Fri, 2 Jun 2017 00:53:03 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: tarasbob@apache.org To: commits@impala.incubator.apache.org Date: Fri, 02 Jun 2017 00:53:04 -0000 Message-Id: <2abff4d91cb946489296b4b5797567d2@git.apache.org> In-Reply-To: References: X-Mailer: ASF-Git Admin Mailer Subject: [3/4] incubator-impala git commit: IMPALA-5383: Fix PARQUET_FILE_SIZE option for ADLS archived-at: Fri, 02 Jun 2017 00:53:08 -0000 IMPALA-5383: Fix PARQUET_FILE_SIZE option for ADLS PARQUET_FILE_SIZE query option doesn't work with ADLS because the AdlFileSystem doesn't have a notion of block sizes. And impala depends on the filesystem remembering the block size which is then used as the target parquet file size (this is done for Hdfs so that the parquet file size and block size match even if the parquet_file_size isn't a valid blocksize). We special case for Adls just like we do for S3 to bypass the FileSystem block size, and instead just use the requested PARQUET_FILE_SIZE as the output partitions block_size (and consequently the parquet file target size). Testing: Re-enabled test_insert_parquet_verify_size() for ADLS. Also fixed a miscellaneous bug with the ADLS client listing helper function. Change-Id: I474a913b0ff9b2709f397702b58cb1c74251c25b Reviewed-on: http://gerrit.cloudera.org:8080/7018 Reviewed-by: Sailesh Mukil Tested-by: Impala Public Jenkins Project: http://git-wip-us.apache.org/repos/asf/incubator-impala/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-impala/commit/117fc388 Tree: http://git-wip-us.apache.org/repos/asf/incubator-impala/tree/117fc388 Diff: http://git-wip-us.apache.org/repos/asf/incubator-impala/diff/117fc388 Branch: refs/heads/branch-2.9.0 Commit: 117fc388bff2a754be081eae7667627f84f1b33c Parents: 2ffc86a Author: Sailesh Mukil Authored: Tue May 30 18:56:43 2017 +0000 Committer: Taras Bobrovytsky Committed: Thu Jun 1 17:51:57 2017 -0700 ---------------------------------------------------------------------- be/src/exec/hdfs-table-sink.cc | 8 +++++--- be/src/util/hdfs-util.cc | 7 +++++++ be/src/util/hdfs-util.h | 3 +++ tests/query_test/test_insert_parquet.py | 4 ---- tests/util/adls_util.py | 3 ++- 5 files changed, 17 insertions(+), 8 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/117fc388/be/src/exec/hdfs-table-sink.cc ---------------------------------------------------------------------- diff --git a/be/src/exec/hdfs-table-sink.cc b/be/src/exec/hdfs-table-sink.cc index 9da6e57..b49451a 100644 --- a/be/src/exec/hdfs-table-sink.cc +++ b/be/src/exec/hdfs-table-sink.cc @@ -390,10 +390,12 @@ Status HdfsTableSink::CreateNewTmpFile(RuntimeState* state, output_partition->current_file_name)); } - if (IsS3APath(output_partition->current_file_name.c_str())) { + if (IsS3APath(output_partition->current_file_name.c_str()) || + IsADLSPath(output_partition->current_file_name.c_str())) { // On S3A, the file cannot be stat'ed until after it's closed, and even so, the block - // size reported will be just the filesystem default. So, remember the requested - // block size. + // size reported will be just the filesystem default. Similarly, the block size + // reported for ADLS will be the filesystem default. So, remember the requested block + // size. output_partition->block_size = block_size; } else { // HDFS may choose to override the block size that we've recommended, so for non-S3 http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/117fc388/be/src/util/hdfs-util.cc ---------------------------------------------------------------------- diff --git a/be/src/util/hdfs-util.cc b/be/src/util/hdfs-util.cc index 440b68d..28d318c 100644 --- a/be/src/util/hdfs-util.cc +++ b/be/src/util/hdfs-util.cc @@ -85,6 +85,13 @@ bool IsS3APath(const char* path) { return strncmp(path, "s3a://", 6) == 0; } +bool IsADLSPath(const char* path) { + if (strstr(path, ":/") == NULL) { + return ExecEnv::GetInstance()->default_fs().compare(0, 6, "adl://") == 0; + } + return strncmp(path, "adl://", 6) == 0; +} + // Returns the length of the filesystem name in 'path' which is the length of the // 'scheme://authority'. Returns 0 if the path is unqualified. static int GetFilesystemNameLength(const char* path) { http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/117fc388/be/src/util/hdfs-util.h ---------------------------------------------------------------------- diff --git a/be/src/util/hdfs-util.h b/be/src/util/hdfs-util.h index 32be643..b9f415b 100644 --- a/be/src/util/hdfs-util.h +++ b/be/src/util/hdfs-util.h @@ -50,6 +50,9 @@ bool IsHdfsPath(const char* path); /// Returns true iff the path refers to a location on an S3A filesystem. bool IsS3APath(const char* path); +/// Returns true iff the path refers to a location on an ADL filesystem. +bool IsADLSPath(const char* path); + /// Returns true iff 'pathA' and 'pathB' are on the same filesystem. bool FilesystemsMatch(const char* pathA, const char* pathB); } http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/117fc388/tests/query_test/test_insert_parquet.py ---------------------------------------------------------------------- diff --git a/tests/query_test/test_insert_parquet.py b/tests/query_test/test_insert_parquet.py index ee24549..c19363f 100644 --- a/tests/query_test/test_insert_parquet.py +++ b/tests/query_test/test_insert_parquet.py @@ -161,10 +161,6 @@ class TestInsertParquetVerifySize(ImpalaTestSuite): cls.ImpalaTestMatrix.add_dimension( ImpalaTestDimension("compression_codec", *PARQUET_CODECS)) - # ADLS does not have a configurable block size, so the 'PARQUET_FILE_SIZE' option - # that's passed as a hint to Hadoop is ignored for AdlFileSystem. So, we skip this - # test for ADLS. - @SkipIfADLS.hdfs_block_size @SkipIfIsilon.hdfs_block_size @SkipIfLocal.hdfs_client def test_insert_parquet_verify_size(self, vector, unique_database): http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/117fc388/tests/util/adls_util.py ---------------------------------------------------------------------- diff --git a/tests/util/adls_util.py b/tests/util/adls_util.py index f616074..b72b4c1 100644 --- a/tests/util/adls_util.py +++ b/tests/util/adls_util.py @@ -73,4 +73,5 @@ class ADLSClient(BaseFilesystem): def get_all_file_sizes(self, path): """Returns a list of integers which are all the file sizes of files found under 'path'.""" - return [self.adlsclient.info(f)['length'] for f in self.ls(path)] + return [self.adlsclient.info(f)['length'] for f in self.adlsclient.ls(path) \ + if self.adlsclient.info(f)['type'] == 'FILE']