<ph audience="standalone">Known Issues and Workarounds in Impala</ph><ph audience="integrated">Apache Impala (incubating) Known Issues</ph>

Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8A50A200BCE for ; Fri, 18 Nov 2016 00:12:07 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 88B8C160B1E; Thu, 17 Nov 2016 23:12:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 46B58160B1C for ; Fri, 18 Nov 2016 00:12:05 +0100 (CET) Received: (qmail 41675 invoked by uid 500); 17 Nov 2016 23:12:04 -0000 Mailing-List: contact commits-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@impala.incubator.apache.org Delivered-To: mailing list commits@impala.incubator.apache.org Received: (qmail 41666 invoked by uid 99); 17 Nov 2016 23:12:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Nov 2016 23:12:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id E4D2BC04FE for ; Thu, 17 Nov 2016 23:12:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -6.218 X-Spam-Level: X-Spam-Status: No, score=-6.218 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-2.999, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id ht2sYyE2qgzZ for ; Thu, 17 Nov 2016 23:11:57 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with SMTP id EC35D60D99 for ; Thu, 17 Nov 2016 23:11:40 +0000 (UTC) Received: (qmail 40123 invoked by uid 99); 17 Nov 2016 23:11:40 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Nov 2016 23:11:40 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 0346BF1745; Thu, 17 Nov 2016 23:11:40 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: jbapple@apache.org To: commits@impala.incubator.apache.org Date: Thu, 17 Nov 2016 23:12:00 -0000 Message-Id: In-Reply-To: References: X-Mailer: ASF-Git Admin Mailer Subject: [22/51] [partial] incubator-impala git commit: IMPALA-3398: Add docs to main Impala branch. archived-at: Thu, 17 Nov 2016 23:12:07 -0000 http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_known_issues.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_known_issues.xml b/docs/topics/impala_known_issues.xml new file mode 100644 index 0000000..e57ec62 --- /dev/null +++ b/docs/topics/impala_known_issues.xml @@ -0,0 +1,1812 @@ + + + + + <ph audience="standalone">Known Issues and Workarounds in Impala</ph><ph audience="integrated">Apache Impala (incubating) Known Issues</ph> + + + + + + + + + + + + + + + + +

+ The following sections describe known issues and workarounds in Impala, as of the current production release. This page summarizes the + most serious or frequently encountered issues in the current release, to help you make planning decisions about installing and + upgrading. Any workarounds are listed here. The bug links take you to the Impala issues site, where you can see the diagnosis and + whether a fix is in the pipeline. +

+ + + The online issue tracking system for Impala contains comprehensive information and is updated in real time. To verify whether an issue + you are experiencing has already been reported, or which release an issue is fixed in, search on the + issues.cloudera.org JIRA tracker. + + +

+ +

+ For issues fixed in various Impala releases, see . +

+ + + + + + + + + + Impala Known Issues: Crashes and Hangs + + + +

+ These issues can cause Impala to quit or become unresponsive. +

+ + + + + + Setting BATCH_SIZE query option too large can cause a crash + + + +

+ Using a value in the millions for the BATCH_SIZE query option, together with wide rows or large string values in + columns, could cause a memory allocation of more than 2 GB resulting in a crash. +

+ +

+ Bug: IMPALA-3069 +

+ +

+ Severity: High +

+ +

Resolution: Fixed in CDH 5.9.0 / Impala 2.7.0.

+ + + + + + + + + + + +

+ Malformed Avro data, such as out-of-bounds integers or values in the wrong format, could cause a crash when queried. +

+ +

+ Bug: IMPALA-3441 +

+ +

+ Severity: High +

+ +

Resolution: Fixed in CDH 5.9.0 / Impala 2.7.0 and CDH 5.8.2 / Impala 2.6.2.

+ + + + + + + + Queries may hang on server-to-server exchange errors + + + +

+ The DataStreamSender::Channel::CloseInternal() does not close the channel on an error. This causes the node on + the other side of the channel to wait indefinitely, causing a hang. +

+ +

+ Bug: IMPALA-2592 +

+ +

+ Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0. +

+ + + + + + + + Impalad is crashing if udf jar is not available in hdfs location for first time + + + +

+ If the JAR file corresponding to a Java UDF is removed from HDFS after the Impala CREATE FUNCTION statement is + issued, the impalad daemon crashes. +

+ +

+ Bug: IMPALA-2365 +

+ +

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.

+ + + + + + + + + + Impala Known Issues: Performance + + + +

+ These issues involve the performance of operations such as queries or DDL statements. +

+ + + + + + + + Slow DDL statements for tables with large number of partitions + + + +

+ DDL statements for tables with a large number of partitions might be slow. +

+ +

+ Bug: IMPALA-1480 +

+ +

+ Workaround: Run the DDL statement in Hive if the slowness is an issue. +

+ +

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.

+ + + + + + + + + + Impala Known Issues: Usability + + + +

+ These issues affect the convenience of interacting directly with Impala, typically through the Impala shell or Hue. +

+ + + + + + Unexpected privileges in show output + + + +

+ Due to a timing condition in updating cached policy data from Sentry, the SHOW statements for Sentry roles could + sometimes display out-of-date role settings. Because Impala rechecks authorization for each SQL statement, this discrepancy does + not represent a security issue for other statements. +

+ +

+ Bug: IMPALA-3133 +

+ +

+ Severity: High +

+ +

+ Resolution: Fixes have been issued for some but not all CDH / Impala releases. Check the JIRA for details of fix releases. +

+ +

Resolution: Fixed in CDH 5.8.0 / Impala 2.6.0 and CDH 5.7.1 / Impala 2.5.1.

+ + + + + + + + Less than 100% progress on completed simple SELECT queries + + + +

+ Simple SELECT queries show less than 100% progress even though they are already completed. +

+ +

+ Bug: IMPALA-1776 +

+ + + + + + + + Unexpected column overflow behavior with INT datatypes + + + +

+ +

+ Bug: + IMPALA-3123 +

+ + + + + + + + + + Impala Known Issues: JDBC and ODBC Drivers + + + +

+ These issues affect applications that use the JDBC or ODBC APIs, such as business intelligence tools or custom-written applications + in languages such as Java or C++. +

+ + + + + + + + ImpalaODBC: Can not get the value in the SQLGetData(m-x th column) after the SQLBindCol(m th column) + + + +

+ If the ODBC SQLGetData is called on a series of columns, the function calls must follow the same order as the + columns. For example, if data is fetched from column 2 then column 1, the SQLGetData call for column 1 returns + NULL. +

+ +

+ Bug: IMPALA-1792 +

+ +

+ Workaround: Fetch columns in the same order they are defined in the table. +

+ + + + + + + + + + Impala Known Issues: Security + + + +

+ These issues relate to security features, such as Kerberos authentication, Sentry authorization, encryption, auditing, and + redaction. +

+ + + + + + + + impala-shell requires Python with ssl module + + + +

+ On CentOS 5.10 and Oracle Linux 5.11 using the built-in Python 2.4, invoking the impala-shell with the + --ssl option might fail with the following error: +

+ + +Unable to import the python 'ssl' module. It is required for an SSL-secured connection. + + + + +

+ Severity: Low, workaround available +

+ +

+ Resolution: Customers are less likely to experience this issue over time, because ssl module is included + in newer Python releases packaged with recent Linux releases. +

+ +

+ Workaround: To use SSL with impala-shell on these platform versions, install the ssh + Python module: +

+ + +yum install python-ssl + + +

+ Then impala-shell can run when using SSL. For example: +

+ + +impala-shell -s impala --ssl --ca_cert /path_to_truststore/truststore.pem + + + + + + + + + + + Kerberos tickets must be renewable + + + +

+ In a Kerberos environment, the impalad daemon might not start if Kerberos tickets are not renewable. +

+ +

+ Workaround: Configure your KDC to allow tickets to be renewed, and configure krb5.conf to request + renewable tickets. +

+ + + + + + + + + + + + + + Impala Known Issues: Resources + + + +

+ These issues involve memory or disk usage, including out-of-memory conditions, the spill-to-disk feature, and resource management + features. +

+ + + + + + Impala catalogd heap issues when upgrading to 5.7 + + + +

+ The default heap size for Impala catalogd has changed in and higher: +

+ +

+
+ Before 5.7, by default catalogd was using the JVM's default heap size, which is the smaller of 1/4th of the + physical memory or 32 GB. +
+
+
+ Starting with CDH 5.7.0, the default catalogd heap size is 4 GB. +
+

+ +

+ For example, on a host with 128GB physical memory this will result in catalogd heap decreasing from 32GB to 4GB. This can result + in out-of-memory errors in catalogd and leading to query failures. +

+ +

+ Bug: TSB-168 +

+ +

+ Severity: High +

+ +

+ Workaround: Increase the catalogd memory limit as follows. + + +

+ +

+ + + + + + + + Breakpad minidumps can be very large when the thread count is high + + + +

+ The size of the breakpad minidump files grows linearly with the number of threads. By default, each thread adds 8 KB to the + minidump size. Minidump files could consume significant disk space when the daemons have a high number of threads. +

+ +

+ Bug: IMPALA-3509 +

+ +

+ Severity: High +

+ +

+ Workaround: Add --minidump_size_limit_hint_kb=size to set a soft upper limit on the + size of each minidump file. If the minidump file would exceed that limit, Impala reduces the amount of information for each thread + from 8 KB to 2 KB. (Full thread information is captured for the first 20 threads, then 2 KB per thread after that.) The minidump + file can still grow larger than the hinted size. For example, if you have 10,000 threads, the minidump file can be more + than 20 MB. +

+ + + + + + + + Parquet scanner memory increase after IMPALA-2736 + + + +

+ The initial release of sometimes has a higher peak memory usage than in previous releases while reading + Parquet files. +

+ +

+ addresses the issue IMPALA-2736, which improves the efficiency of Parquet scans by up to 2x. The faster scans + may result in a higher peak memory consumption compared to earlier versions of Impala due to the new column-wise row + materialization strategy. You are likely to experience higher memory consumption in any of the following scenarios: +

+
+ Very wide rows due to projecting many columns in a scan. +
+
+
+ Very large rows due to big column values, for example, long strings or nested collections with many items. +
+
+
+ Producer/consumer speed imbalances, leading to more rows being buffered between a scan (producer) and downstream (consumer) + plan nodes. +
+

+ +

+ Bug: IMPALA-3662 +

+ +

+ Severity: High +

+ +

+ Workaround: The following query options might help to reduce memory consumption in the Parquet scanner: +

+ Reduce the number of scanner threads, for example: set num_scanner_threads=30 +
+ Reduce the batch size, for example: set batch_size=512 +
+ Increase the memory limit, for example: set mem_limit=64g +

+ + + + + + + + Process mem limit does not account for the JVM's memory usage + + + + + +

+ Some memory allocated by the JVM used internally by Impala is not counted against the memory limit for the + impalad daemon. +

+ +

+ Bug: IMPALA-691 +

+ +

+ Workaround: To monitor overall memory usage, use the top command, or add the memory figures in the + Impala web UI /memz tab to JVM memory usage shown on the /metrics tab. +

+ + + + + + + + + + Fix issues with the legacy join and agg nodes using --enable_partitioned_hash_join=false and --enable_partitioned_aggregation=false + + + +

+ +

+ Bug: IMPALA-2375 +

+ +

+ Workaround: Transition away from the old-style join and aggregation mechanism if practical. +

+ +

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.

+ + + + + + + + + + Impala Known Issues: Correctness + + + +

+ These issues can cause incorrect or unexpected results from queries. They typically only arise in very specific circumstances. +

+ + + + + + Incorrect assignment of NULL checking predicate through an outer join of a nested collection. + + + +

+ A query could return wrong results (too many or too few NULL values) if it referenced an outer-joined nested + collection and also contained a null-checking predicate (IS NULL, IS NOT NULL, or the + <=> operator) in the WHERE clause. +

+ +

+ Bug: IMPALA-3084 +

+ +

+ Severity: High +

+ +

Resolution: Fixed in CDH 5.9.0 / Impala 2.7.0.

+ + + + + + + + Incorrect result due to constant evaluation in query with outer join + + + +

+ An OUTER JOIN query could omit some expected result rows due to a constant such as FALSE in + another join clause. For example: +

+ + + + +

+ Bug: IMPALA-3094 +

+ +

+ Severity: High +

+ +

+ Resolution: +

+ +

+ Workaround: +

+ + + + + + + + Incorrect assignment of an inner join On-clause predicate through an outer join. + + + +

+ Impala may return incorrect results for queries that have the following properties: +

+ +

+
+ There is an INNER JOIN following a series of OUTER JOINs. +
+
+
+ The INNER JOIN has an On-clause with a predicate that references at least two tables that are on the nullable side of the + preceding OUTER JOINs. +
+

+ +

+ The following query demonstrates the issue: +

+ + +select 1 from functional.alltypes a left outer join + functional.alltypes b on a.id = b.id left outer join + functional.alltypes c on b.id = c.id right outer join + functional.alltypes d on c.id = d.id inner join functional.alltypes e +on b.int_col = c.int_col; + + +

+ The following listing shows the incorrect EXPLAIN plan: +

+ + c.id | +| | | +| 05:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED] | +| | hash predicates: b.id = a.id | +| | runtime filters: RF002 <- a.id | +| | | +| |--10:EXCHANGE [HASH(a.id)] | +| | | | +| | 00:SCAN HDFS [functional.alltypes a] | +| | partitions=24/24 files=24 size=478.45KB | +| | | +| 09:EXCHANGE [HASH(b.id)] | +| | | +| 01:SCAN HDFS [functional.alltypes b] | +| partitions=24/24 files=24 size=478.45KB | +| runtime filters: RF001 -> b.int_col, RF002 -> b.id | ++-----------------------------------------------------------+ +]]> + + +

+ Bug: IMPALA-3126 +

+ +

+ Severity: High +

+ +

+ Workaround: High +

+ +

+ For some queries, this problem can be worked around by placing the problematic ON clause predicate in the + WHERE clause instead, or changing the preceding OUTER JOINs to INNER JOINs (if + the ON clause predicate would discard NULLs). For example, to fix the problematic query above: +

+ + c.id | +| | | +| 05:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED] | +| | hash predicates: b.id = a.id | +| | runtime filters: RF001 <- a.id | +| | | +| |--10:EXCHANGE [HASH(a.id)] | +| | | | +| | 00:SCAN HDFS [functional.alltypes a] | +| | partitions=24/24 files=24 size=478.45KB | +| | | +| 09:EXCHANGE [HASH(b.id)] | +| | | +| 01:SCAN HDFS [functional.alltypes b] | +| partitions=24/24 files=24 size=478.45KB | +| runtime filters: RF001 -> b.id | ++-----------------------------------------------------------+ +]]> + + + + + + + + + Impala may use incorrect bit order with BIT_PACKED encoding + + + +

+ Parquet BIT_PACKED encoding as implemented by Impala is LSB first. The parquet standard says it is MSB first. +

+ +

+ Bug: IMPALA-3006 +

+ +

+ Severity: High, but rare in practice because BIT_PACKED is infrequently used, is not written by Impala, and is deprecated + in Parquet 2.0. +

+ + + + + + + + BST between 1972 and 1995 + + + +

+ The calculation of start and end times for the BST (British Summer Time) time zone could be incorrect between 1972 and 1995. + Between 1972 and 1995, BST began and ended at 02:00 GMT on the third Sunday in March (or second Sunday when Easter fell on the + third) and fourth Sunday in October. For example, both function calls should return 13, but actually return 12, in a query such + as: +

+ + +select + extract(from_utc_timestamp(cast('1970-01-01 12:00:00' as timestamp), 'Europe/London'), "hour") summer70start, + extract(from_utc_timestamp(cast('1970-12-31 12:00:00' as timestamp), 'Europe/London'), "hour") summer70end; + + +

+ Bug: IMPALA-3082 +

+ +

+ Severity: High +

+ + + + + + + + parse_url() returns incorrect result if @ character in URL + + + +

+ If a URL contains an @ character, the parse_url() function could return an incorrect value for + the hostname field. +

+ +

+ Bug: IMPALA-1170 +

+ +

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0 and CDH 5.5.4 / Impala 2.3.4.

+ + + + + + + + % escaping does not work correctly when occurs at the end in a LIKE clause + + + +

+ If the final character in the RHS argument of a LIKE operator is an escaped \% character, it + does not match a % final character of the LHS argument. +

+ +

+ Bug: IMPALA-2422 +

+ + + + + + + + ORDER BY rand() does not work. + + + +

+ Because the value for rand() is computed early in a query, using an ORDER BY expression + involving a call to rand() does not actually randomize the results. +

+ +

+ Bug: IMPALA-397 +

+ + + + + + + + Duplicated column in inline view causes dropping null slots during scan + + + +

+ If the same column is queried twice within a view, NULL values for that column are omitted. For example, the + result of COUNT(*) on the view could be less than expected. +

+ +

+ Bug: IMPALA-2643 +

+ +

+ Workaround: Avoid selecting the same column twice within an inline view. +

+ +

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0, CDH 5.5.2 / Impala 2.3.2, and CDH 5.4.10 / Impala 2.2.10.

+ + + + + + + + + + Incorrect assignment of predicates through an outer join in an inline view. + + + +

+ A query involving an OUTER JOIN clause where one of the table references is an inline view might apply predicates + from the ON clause incorrectly. +

+ +

+ Bug: IMPALA-1459 +

+ +

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0, CDH 5.5.2 / Impala 2.3.2, and CDH 5.4.9 / Impala 2.2.9.

+ + + + + + + + Crash: impala::Coordinator::ValidateCollectionSlots + + + +

+ A query could encounter a serious error if includes multiple nested levels of INNER JOIN clauses involving + subqueries. +

+ +

+ Bug: IMPALA-2603 +

+ + + + + + + + Incorrect assignment of On-clause predicate inside inline view with an outer join. + + + +

+ A query might return incorrect results due to wrong predicate assignment in the following scenario: +

+ +

+ There is an inline view that contains an outer join +
+ That inline view is joined with another table in the enclosing query block +
+ That join has an On-clause containing a predicate that only references columns originating from the outer-joined tables inside + the inline view +

+ +

+ Bug: IMPALA-2665 +

+ +

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0, CDH 5.5.2 / Impala 2.3.2, and CDH 5.4.9 / Impala 2.2.9.

+ + + + + + + + Wrong assignment of having clause predicate across outer join + + + +

+ In an OUTER JOIN query with a HAVING clause, the comparison from the HAVING + clause might be applied at the wrong stage of query processing, leading to incorrect results. +

+ +

+ Bug: IMPALA-2144 +

+ +

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.

+ + + + + + + + Wrong plan of NOT IN aggregate subquery when a constant is used in subquery predicate + + + +

+ A NOT IN operator with a subquery that calls an aggregate function, such as NOT IN (SELECT + SUM(...)), could return incorrect results. +

+ +

+ Bug: IMPALA-2093 +

+ +

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0 and CDH 5.5.4 / Impala 2.3.4.

+ + + + + + + + + + Impala Known Issues: Metadata + + + +

+ These issues affect how Impala interacts with metadata. They cover areas such as the metastore database, the COMPUTE + STATS statement, and the Impala catalogd daemon. +

+ + + + + + Catalogd may crash when loading metadata for tables with many partitions, many columns and with incremental stats + + + +

+ Incremental stats use up about 400 bytes per partition for each column. For example, for a table with 20K partitions and 100 + columns, the memory overhead from incremental statistics is about 800 MB. When serialized for transmission across the network, + this metadata exceeds the 2 GB Java array size limit and leads to a catalogd crash. +

+ +

+ Bugs: IMPALA-2647, + IMPALA-2648, + IMPALA-2649 +

+ +

+ Workaround: If feasible, compute full stats periodically and avoid computing incremental stats for that table. The + scalability of incremental stats computation is a continuing work item. +

+ + + + + + + + + + Can't update stats manually via alter table after upgrading to CDH 5.2 + + + +

+ +

+ Bug: IMPALA-1420 +

+ +

+ Workaround: On CDH 5.2, when adjusting table statistics manually by setting the numRows, you must also + enable the Boolean property STATS_GENERATED_VIA_STATS_TASK. For example, use a statement like the following to + set both properties with a single ALTER TABLE statement: +

+ +ALTER TABLE table_name SET TBLPROPERTIES('numRows'='new_value', 'STATS_GENERATED_VIA_STATS_TASK' = 'true'); + +

+ Resolution: The underlying cause is the issue + HIVE-8648 that affects the + metastore in Hive 0.13. The workaround is only needed until the fix for this issue is incorporated into a CDH release. +

+ + + + + + + + + + Impala Known Issues: Interoperability + + + +

+ These issues affect the ability to interchange data between Impala and other database systems. They cover areas such as data types + and file formats. +

+ + + + + + + + DESCRIBE FORMATTED gives error on Avro table + + + +

+ This issue can occur either on old Avro tables (created prior to Hive 1.1 / CDH 5.4) or when changing the Avro schema file by + adding or removing columns. Columns added to the schema file will not show up in the output of the DESCRIBE + FORMATTED command. Removing columns from the schema file will trigger a NullPointerException. +

+ +

+ As a workaround, you can use the output of SHOW CREATE TABLE to drop and recreate the table. This will populate + the Hive metastore database with the correct column definitions. +

+ + + Only use this for external tables, or Impala will remove the data files. In case of an internal table, set it to external first: + +ALTER TABLE table_name SET TBLPROPERTIES('EXTERNAL'='TRUE'); + + (The part in parentheses is case sensitive.) Make sure to pick the right choice between internal and external when recreating the + table. See for the differences between internal and external tables. + + +

+ Bug: CDH-41605 +

+ +

+ Severity: High +

+ + + + + + + + + + Deviation from Hive behavior: Impala does not do implicit casts between string and numeric and boolean types. + + + +

+ Cloudera Bug: ; KI added 0.1 + Cloudera internal only +

+ +

+ Anticipated Resolution: None +

+ +

+ Workaround: Use explicit casts. +

+ + + + + + + + + + Deviation from Hive behavior: Out of range values float/double values are returned as maximum allowed value of type (Hive returns NULL) + + + +

+ Impala behavior differs from Hive with respect to out of range float/double values. Out of range values are returned as maximum + allowed value of type (Hive returns NULL). +

+ +

+ Cloudera Bug: IMPALA-175 ; KI + added 0.1 Cloudera internal only +

+ +

+ Workaround: None +

+ + + + + + + + + + Configuration needed for Flume to be compatible with Impala + + + +

+ For compatibility with Impala, the value for the Flume HDFS Sink hdfs.writeFormat must be set to + Text, rather than its default value of Writable. The hdfs.writeFormat setting + must be changed to Text before creating data files with Flume; otherwise, those files cannot be read by either + Impala or Hive. +

+ +

+ Resolution: This information has been requested to be added to the upstream Flume documentation. +

+ + + + + + + + + + Avro Scanner fails to parse some schemas + + + +

+ Querying certain Avro tables could cause a crash or return no rows, even though Impala could DESCRIBE the table. +

+ +

+ Bug: IMPALA-635 +

+ +

+ Workaround: Swap the order of the fields in the schema specification. For example, ["null", "string"] + instead of ["string", "null"]. +

+ +

+ Resolution: Not allowing this syntax agrees with the Avro specification, so it may still cause an error even when the + crashing issue is resolved. +

+ + + + + + + + + + Impala BE cannot parse Avro schema that contains a trailing semi-colon + + + +

+ If an Avro table has a schema definition with a trailing semicolon, Impala encounters an error when the table is queried. +

+ +

+ Bug: IMPALA-1024 +

+ +

+ Severity: Remove trailing semicolon from the Avro schema. +

+ + + + + + + + + + Fix decompressor to allow parsing gzips with multiple streams + + + +

+ Currently, Impala can only read gzipped files containing a single stream. If a gzipped file contains multiple concatenated + streams, the Impala query only processes the data from the first stream. +

+ +

+ Bug: IMPALA-2154 +

+ +

+ Workaround: Use a different gzip tool to compress file to a single stream file. +

+ +

Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.

+ + + + + + + + + + Impala incorrectly handles text data when the new line character \n\r is split between different HDFS block + + + +

+ If a carriage return / newline pair of characters in a text table is split between HDFS data blocks, Impala incorrectly processes + the row following the \n\r pair twice. +

+ +

+ Bug: IMPALA-1578 +

+ +

+ Workaround: Use the Parquet format for large volumes of data where practical. +

+ +

Resolution: Fixed in CDH 5.8.0 / Impala 2.6.0.

+ + + + + + + + + + Invalid bool value not reported as a scanner error + + + +

+ In some cases, an invalid BOOLEAN value read from a table does not produce a warning message about the bad value. + The result is still NULL as expected. Therefore, this is not a query correctness issue, but it could lead to + overlooking the presence of invalid data. +

+ +

+ Bug: IMPALA-1862 +

+ + + + + + + + + + Incorrect results with basic predicate on CHAR typed column. + + + +

+ When comparing a CHAR column value to a string literal, the literal value is not blank-padded and so the + comparison might fail when it should match. +

+ +

+ Bug: IMPALA-1652 +

+ +

+ Workaround: Use the RPAD() function to blank-pad literals compared with CHAR columns to + the expected length. +

+ + + + + + + + + + Impala Known Issues: Limitations + + + +

+ These issues are current limitations of Impala that require evaluation as you plan how to integrate Impala into your data management + workflow. +

+ + + + + + + + Impala does not support running on clusters with federated namespaces + + + +

+ Impala does not support running on clusters with federated namespaces. The impalad process will not start on a + node running such a filesystem based on the org.apache.hadoop.fs.viewfs.ViewFs class. +

+ +

+ Bug: IMPALA-77 +

+ +

+ Anticipated Resolution: Limitation +

+ +

+ Workaround: Use standard HDFS on all Impala nodes. +

+ + + + + + + + + + Impala Known Issues: Miscellaneous / Older Issues + + + +

+ These issues do not fall into one of the above categories or have not been categorized yet. +

+ + + + + + + + A failed CTAS does not drop the table if the insert fails. + + + +

+ If a CREATE TABLE AS SELECT operation successfully creates the target table but an error occurs while querying + the source table or copying the data, the new table is left behind rather than being dropped. +

+ +

+ Bug: IMPALA-2005 +

+ +

+ Workaround: Drop the new table manually after a failed CREATE TABLE AS SELECT. +

+ + + + + + + + + + Casting scenarios with invalid/inconsistent results + + + +

+ Using a CAST() function to convert large literal values to smaller types, or to convert special values such as + NaN or Inf, produces values not consistent with other database systems. This could lead to + unexpected results from queries. +

+ +

+ Bug: IMPALA-1821 +

+ + + + + + + + + + + + Support individual memory allocations larger than 1 GB + + + +

+ The largest single block of memory that Impala can allocate during a query is 1 GiB. Therefore, a query could fail or Impala could + crash if a compressed text file resulted in more than 1 GiB of data in uncompressed form, or if a string function such as + group_concat() returned a value greater than 1 GiB. +

+ +

+ Bug: IMPALA-1619 +

+ +

Resolution: Fixed in CDH 5.9.0 / Impala 2.7.0 and CDH 5.8.3 / Impala 2.6.3.

+ + + + + + + + + + Impala Parser issue when using fully qualified table names that start with a number. + + + +

+ A fully qualified table name starting with a number could cause a parsing error. In a name such as db.571_market, + the decimal point followed by digits is interpreted as a floating-point number. +

+ +

+ Bug: IMPALA-941 +

+ +

+ Workaround: Surround each part of the fully qualified name with backticks (``). +

+ + + + + + + + + + Impala should tolerate bad locale settings + + + +

+ If the LC_* environment variables specify an unsupported locale, Impala does not start. +

+ +

+ Bug: IMPALA-532 +

+ +

+ Workaround: Add LC_ALL="C" to the environment settings for both the Impala daemon and the Statestore + daemon. See for details about modifying these environment settings. +

+ +

+ Resolution: Fixing this issue would require an upgrade to Boost 1.47 in the Impala distribution. +

+ + + + + + + + + + Log Level 3 Not Recommended for Impala + + + +

+ The extensive logging produced by log level 3 can cause serious performance overhead and capacity issues. +

+ +

+ Workaround: Reduce the log level to its default value of 1, that is, GLOG_v=1. See + for details about the effects of setting different logging levels. +

+ + + + + + + + http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_kudu.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_kudu.xml b/docs/topics/impala_kudu.xml new file mode 100644 index 0000000..c530cc1 --- /dev/null +++ b/docs/topics/impala_kudu.xml @@ -0,0 +1,167 @@ + + + + + Using Impala to Query Kudu Tables + + + + + + + + + + + + + +

+ Kudu + You can use Impala to query Kudu tables. This capability allows convenient access to a storage system that is + tuned for different kinds of workloads than the default with Impala. The default Impala tables use data files + stored on HDFS, which are ideal for bulk loads and queries using full-table scans. In contrast, Kudu can do + efficient queries for data organized either in data warehouse style (with full table scans) or for OLTP-style + workloads (with key-based lookups for single rows or small ranges of values). +

+ +

+ Certain Impala SQL statements, such as UPDATE and DELETE, only work with + Kudu tables. These operations were impractical from a performance perspective to perform at large scale on + HDFS data, or on HBase tables. +

+ + + + + + Benefits of Using Kudu Tables with Impala + + + +

+ The combination of Kudu and Impala works best for tables where scan performance is important, but data + arrives continuously, in small batches, or needs to be updated without being completely replaced. In these + scenarios (such as for streaming data), it might be impractical to use Parquet tables because Parquet works + best with multi-megabyte data files, requiring substantial overhead to replace or reorganize data files to + accomodate frequent additions or changes to data. Impala can query Kudu tables with scan performance close + to that of Parquet, and Impala can also perform update or delete operations without replacing the entire + table contents. You can also use the Kudu API to do ingestion or transformation operations outside of + Impala, and Impala can query the current data at any time. +

+ + + + + + + + Primary Key Columns for Kudu Tables + + + +

+ Kudu tables introduce the notion of primary keys to Impala for the first time. The primary key is made up + of one or more columns, whose values are combined and used as a lookup key during queries. These columns + cannot contain any NULL values or any duplicate values, and can never be updated. For a + partitioned Kudu table, all the partition key columns must come from the set of primary key columns. +

+ +

+ Impala itself still does not have the notion of unique or non-NULL constraints. These + restrictions on the primary key columns are enforced on the Kudu side. +

+ +

+ The primary key columns must be the first ones specified in the CREATE TABLE statement. + You specify which column or columns make up the primary key in the table properties, rather than through + attributes in the column list. +

+ +

+ Kudu can do extra optimizations for queries that refer to the primary key columns in the + WHERE clause. It is not crucial though to include the primary key columns in the + WHERE clause of every query. The benefit is mainly for partitioned tables, + which divide the data among various tablet servers based on the distribution of + data values in some or all of the primary key columns. +

+ + + + + + + + Impala DML Support for Kudu Tables + + + +

+ Impala supports certain DML statements for Kudu tables only. The UPDATE and + DELETE statements let you modify data within Kudu tables without rewriting substantial + amounts of table data. +

+ +

+ The INSERT statement for Kudu tables honors the unique and non-NULL + requirements for the primary key columns. +

+ +

+ Because Impala and Kudu do not support transactions, the effects of any INSERT, + UPDATE, or DELETE statement are immediately visible. For example, you + cannot do a sequence of UPDATE statements and only make the change visible after all the + statements are finished. Also, if a DML statement fails partway through, any rows that were already + inserted, deleted, or changed remain in the table; there is no rollback mechanism to undo the changes. +

+ + + + + + + + Partitioning for Kudu Tables + + + +

+ Kudu tables use special mechanisms to evenly distribute data among the underlying tablet servers. Although + we refer to such tables as partitioned tables, they are distinguished from traditional Impala partitioned + tables by use of different clauses on the CREATE TABLE statement. Partitioned Kudu tables + use DISTRIBUTE BY, HASH, RANGE, and SPLIT + ROWS clauses rather than the traditional PARTITIONED BY clause. All of the + columns involved in these clauses must be primary key columns. These clauses let you specify different ways + to divide the data for each column, or even for different value ranges within a column. This flexibility + lets you avoid problems with uneven distribution of data, where the partitioning scheme for HDFS tables + might result in some partitions being much larger than others. By setting up an effective partitioning + scheme for a Kudu table, you can ensure that the work for a query can be parallelized evenly across the + hosts in a cluster. +

+ + + + + + + + Impala Query Performance for Kudu Tables + + + +

+ For queries involving Kudu tables, Impala can delegate much of the work of filtering the result set to + Kudu, avoiding some of the I/O involved in full table scans of tables containing HDFS data files. This type + of optimization is especially effective for partitioned Kudu tables, where the Impala query + WHERE clause refers to one or more primary key columns that are also used as partition key + columns. For example, if a partitioned Kudu table uses a HASH clause for + col1 and a RANGE clause for col2, a query using a clause + such as WHERE col1 IN (1,2,3) AND col2 > 100 can determine exactly which tablet servers + contain relevant data, and therefore parallelize the query very efficiently. +

+ + + + + + http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_langref.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_langref.xml b/docs/topics/impala_langref.xml new file mode 100644 index 0000000..f81b76f --- /dev/null +++ b/docs/topics/impala_langref.xml @@ -0,0 +1,74 @@ + + + + + Impala SQL Language Reference + SQL Reference + + + + + + + + + + + + +

+ Impala uses SQL as its query language. To protect user investment in skills development and query + design, Impala provides a high degree of compatibility with the Hive Query Language (HiveQL): +

+ +

+ Because Impala uses the same metadata store as Hive to record information about table structure and + properties, Impala can access tables defined through the native Impala CREATE TABLE + command, or tables created using the Hive data definition language (DDL). +
+ Impala supports data manipulation (DML) statements similar to the DML component of HiveQL. +
+ Impala provides many built-in functions with the same + names and parameter types as their HiveQL equivalents. +

+ +

+ Impala supports most of the same statements and + clauses as HiveQL, including, but not limited to JOIN, AGGREGATE, + DISTINCT, UNION ALL, ORDER BY, LIMIT and + (uncorrelated) subquery in the FROM clause. Impala also supports INSERT + INTO and INSERT OVERWRITE. +

+ +

+ Impala supports data types with the same names and semantics as the equivalent Hive data types: + STRING, TINYINT, SMALLINT, INT, + BIGINT, FLOAT, DOUBLE, BOOLEAN, + STRING, TIMESTAMP. +

+ +

+ For full details about Impala SQL syntax and semantics, see + . +

+ +

+ Most HiveQL SELECT and INSERT statements run unmodified with Impala. For + information about Hive syntax not available in Impala, see + . +

+ +

+ For a list of the built-in functions available in Impala queries, see + . +

+ +

+ + http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_langref_sql.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_langref_sql.xml b/docs/topics/impala_langref_sql.xml new file mode 100644 index 0000000..18b6726 --- /dev/null +++ b/docs/topics/impala_langref_sql.xml @@ -0,0 +1,35 @@ + + + + + Impala SQL Statements + SQL Statements + + + + + + + + + + + +

+ The Impala SQL dialect supports a range of standard elements, plus some extensions for Big Data use cases + related to data loading and data warehousing. +

+ + +

+ In the impala-shell interpreter, a semicolon at the end of each statement is required. + Since the semicolon is not actually part of the SQL syntax, we do not include it in the syntax definition + of each statement, but we do show it in examples intended to be run in impala-shell. +

+ + +

+ The following sections show the major SQL statements that you work with in Impala: +

+ + http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_langref_unsupported.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_langref_unsupported.xml b/docs/topics/impala_langref_unsupported.xml new file mode 100644 index 0000000..82910d6 --- /dev/null +++ b/docs/topics/impala_langref_unsupported.xml @@ -0,0 +1,312 @@ + + + + + SQL Differences Between Impala and Hive + + + + + + + + + + + + + +

+ Hive + HiveQL + Impala's SQL syntax follows the SQL-92 standard, and includes many industry extensions in areas such as + built-in functions. See for a general discussion of adapting SQL + code from a variety of database systems to Impala. +

+ +

+ Because Impala and Hive share the same metastore database and their tables are often used interchangeably, + the following section covers differences between Impala and Hive in detail. +

+ +

+ + + + + HiveQL Features not Available in Impala + + + +

+ The current release of Impala does not support the following SQL features that you might be familiar with + from HiveQL: +

+ + + +

+ Extensibility mechanisms such as TRANSFORM, custom file formats, or custom SerDes. +
+ The DATE data type. +
+ XML and JSON functions. +
+ Certain aggregate functions from HiveQL: covar_pop, covar_samp, + corr, percentile, percentile_approx, + histogram_numeric, collect_set; Impala supports the set of aggregate + functions listed in and analytic + functions listed in . +
+ Sampling. +
+ Lateral views. In and higher, Impala supports queries on complex types + (STRUCT, ARRAY, or MAP), using join notation + rather than the EXPLODE() keyword. + See for details about Impala support for complex types. +
+ Multiple DISTINCT clauses per query, although Impala includes some workarounds for this + limitation. + +

+ +

+ User-defined functions (UDFs) are supported starting in Impala 1.2. See + for full details on Impala UDFs. +

+
+ Impala supports high-performance UDFs written in C++, as well as reusing some Java-based Hive UDFs. +
+
+
+ Impala supports scalar UDFs and user-defined aggregate functions (UDAFs). Impala does not currently + support user-defined table generating functions (UDTFs). +
+
+
+ Only Impala-supported column types are supported in Java-based UDFs. +
+
+
+

+ +

+ Impala does not currently support these HiveQL statements: +

+ +

+ ANALYZE TABLE (the Impala equivalent is COMPUTE STATS) +
+ DESCRIBE COLUMN +
+ DESCRIBE DATABASE +
+ EXPORT TABLE +
+ IMPORT TABLE +
+ SHOW TABLE EXTENDED +
+ SHOW INDEXES +
+ SHOW COLUMNS +
+ INSERT OVERWRITE DIRECTORY; use INSERT OVERWRITE table_name + or CREATE TABLE AS SELECT to materialize query results into the HDFS directory associated + with an Impala table. +

+ + + + + + Semantic Differences Between Impala and HiveQL Features + + + +

+ This section covers instances where Impala and Hive have similar functionality, sometimes including the + same syntax, but there are differences in the runtime semantics of those features. +

+ +

+ Security: +

+ +

+ Impala utilizes the Apache + Sentry authorization framework, which provides fine-grained role-based access control + to protect data against unauthorized access or tampering. +

+ +

+ The Hive component included in CDH 5.1 and higher now includes Sentry-enabled GRANT, + REVOKE, and CREATE/DROP ROLE statements. Earlier Hive releases had a + privilege system with GRANT and REVOKE statements that were primarily + intended to prevent accidental deletion of data, rather than a security mechanism to protect against + malicious users. +

+ +

+ Impala can make use of privileges set up through Hive GRANT and REVOKE statements. + Impala has its own GRANT and REVOKE statements in Impala 2.0 and higher. + See for the details of authorization in Impala, including + how to switch from the original policy file-based privilege model to the Sentry service using privileges + stored in the metastore database. +

+ +

+ SQL statements and clauses: +

+ +

+ The semantics of Impala SQL statements varies from HiveQL in some cases where they use similar SQL + statement and clause names: +

+ +

+ Impala uses different syntax and names for query hints, [SHUFFLE] and + [NOSHUFFLE] rather than MapJoin or StreamJoin. See + for the Impala details. +
+ Impala does not expose MapReduce specific features of SORT BY, DISTRIBUTE + BY, or CLUSTER BY. +
+ Impala does not require queries to include a FROM clause. +

+ +

+ Data types: +

+ +

+ Impala supports a limited set of implicit casts. This can help avoid undesired results from unexpected + casting behavior. +
- + Impala does not implicitly cast between string and numeric or Boolean types. Always use + CAST() for these conversions. +
- + Impala does perform implicit casts among the numeric types, when going from a smaller or less precise + type to a larger or more precise one. For example, Impala will implicitly convert a + SMALLINT to a BIGINT or FLOAT, but to convert from + DOUBLE to FLOAT or INT to TINYINT + requires a call to CAST() in the query. +
- + Impala does perform implicit casts from string to timestamp. Impala has a restricted set of literal + formats for the TIMESTAMP data type and the from_unixtime() format + string; see for details. +
+
+ See for full details on implicit and explicit casting for + all types, and for details about + the CAST() function. +
+
+ Impala does not store or interpret timestamps using the local timezone, to avoid undesired results from + unexpected time zone issues. Timestamps are stored and interpreted relative to UTC. This difference can + produce different results for some calls to similarly named date/time functions between Impala and Hive. + See for details about the Impala + functions. See for a discussion of how Impala handles + time zones, and configuration options you can use to make Impala match the Hive behavior more closely + when dealing with Parquet-encoded TIMESTAMP data or when converting between + the local time zone and UTC. +
+ The Impala TIMESTAMP type can represent dates ranging from 1400-01-01 to 9999-12-31. + This is different from the Hive date range, which is 0000-01-01 to 9999-12-31. +
+
+

+ +

+ Miscellaneous features: +

+ +

+ Impala does not provide virtual columns. +
+ Impala does not expose locking. +
+ Impala does not expose some configuration properties. +

+ + +