Return-Path:
X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io
Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io
Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183])
by cust-asf2.ponee.io (Postfix) with ESMTP id 8A50A200BCE
for ; Fri, 18 Nov 2016 00:12:07 +0100 (CET)
Received: by cust-asf.ponee.io (Postfix)
id 88B8C160B1E; Thu, 17 Nov 2016 23:12:07 +0000 (UTC)
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
by cust-asf.ponee.io (Postfix) with SMTP id 46B58160B1C
for ; Fri, 18 Nov 2016 00:12:05 +0100 (CET)
Received: (qmail 41675 invoked by uid 500); 17 Nov 2016 23:12:04 -0000
Mailing-List: contact commits-help@impala.incubator.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: dev@impala.incubator.apache.org
Delivered-To: mailing list commits@impala.incubator.apache.org
Received: (qmail 41666 invoked by uid 99); 17 Nov 2016 23:12:04 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Nov 2016 23:12:04 +0000
Received: from localhost (localhost [127.0.0.1])
by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id E4D2BC04FE
for ; Thu, 17 Nov 2016 23:12:03 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -6.218
X-Spam-Level:
X-Spam-Status: No, score=-6.218 tagged_above=-999 required=6.31
tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1,
RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01,
RP_MATCHES_RCVD=-2.999, URIBL_BLOCKED=0.001] autolearn=disabled
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
with ESMTP id ht2sYyE2qgzZ for ;
Thu, 17 Nov 2016 23:11:57 +0000 (UTC)
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with SMTP id EC35D60D99
for ; Thu, 17 Nov 2016 23:11:40 +0000 (UTC)
Received: (qmail 40123 invoked by uid 99); 17 Nov 2016 23:11:40 -0000
Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23)
by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Nov 2016 23:11:40 +0000
Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33)
id 0346BF1745; Thu, 17 Nov 2016 23:11:40 +0000 (UTC)
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
From: jbapple@apache.org
To: commits@impala.incubator.apache.org
Date: Thu, 17 Nov 2016 23:12:00 -0000
Message-Id:
In-Reply-To:
References:
X-Mailer: ASF-Git Admin Mailer
Subject: [22/51] [partial] incubator-impala git commit: IMPALA-3398: Add docs
to main Impala branch.
archived-at: Thu, 17 Nov 2016 23:12:07 -0000
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_known_issues.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_known_issues.xml b/docs/topics/impala_known_issues.xml
new file mode 100644
index 0000000..e57ec62
--- /dev/null
+++ b/docs/topics/impala_known_issues.xml
@@ -0,0 +1,1812 @@
+
+
+
+
+ Known Issues and Workarounds in ImpalaApache Impala (incubating) Known Issues
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ The following sections describe known issues and workarounds in Impala, as of the current production release. This page summarizes the
+ most serious or frequently encountered issues in the current release, to help you make planning decisions about installing and
+ upgrading. Any workarounds are listed here. The bug links take you to the Impala issues site, where you can see the diagnosis and
+ whether a fix is in the pipeline.
+
+
+
+ The online issue tracking system for Impala contains comprehensive information and is updated in real time. To verify whether an issue
+ you are experiencing has already been reported, or which release an issue is fixed in, search on the
+ issues.cloudera.org JIRA tracker.
+
+
+
+
+
+ For issues fixed in various Impala releases, see .
+
+
+
+
+
+
+
+
+
+
+ Impala Known Issues: Crashes and Hangs
+
+
+
+
+ These issues can cause Impala to quit or become unresponsive.
+
+
+
+
+
+
+ Setting BATCH_SIZE query option too large can cause a crash
+
+
+
+
+ Using a value in the millions for the BATCH_SIZE query option, together with wide rows or large string values in
+ columns, could cause a memory allocation of more than 2 GB resulting in a crash.
+
+
+
+ Bug: IMPALA-3069
+
+
+
+ Severity: High
+
+
+ Resolution: Fixed in CDH 5.9.0 / Impala 2.7.0.
+
+
+
+
+
+
+
+
+
+
+
+
+ Malformed Avro data, such as out-of-bounds integers or values in the wrong format, could cause a crash when queried.
+
+
+
+ Bug: IMPALA-3441
+
+
+
+ Severity: High
+
+
+ Resolution: Fixed in CDH 5.9.0 / Impala 2.7.0 and CDH 5.8.2 / Impala 2.6.2.
+
+
+
+
+
+
+
+ Queries may hang on server-to-server exchange errors
+
+
+
+
+ The DataStreamSender::Channel::CloseInternal() does not close the channel on an error. This causes the node on
+ the other side of the channel to wait indefinitely, causing a hang.
+
+
+
+ Bug: IMPALA-2592
+
+
+
+ Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.
+
+
+
+
+
+
+
+
+ Impalad is crashing if udf jar is not available in hdfs location for first time
+
+
+
+
+ If the JAR file corresponding to a Java UDF is removed from HDFS after the Impala CREATE FUNCTION statement is
+ issued, the impalad daemon crashes.
+
+
+
+ Bug: IMPALA-2365
+
+
+ Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.
+
+
+
+
+
+
+
+
+
+ Impala Known Issues: Performance
+
+
+
+
+ These issues involve the performance of operations such as queries or DDL statements.
+
+
+
+
+
+
+
+
+ Slow DDL statements for tables with large number of partitions
+
+
+
+
+ DDL statements for tables with a large number of partitions might be slow.
+
+
+
+ Bug: IMPALA-1480
+
+
+
+ Workaround: Run the DDL statement in Hive if the slowness is an issue.
+
+
+ Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.
+
+
+
+
+
+
+
+
+
+ Impala Known Issues: Usability
+
+
+
+
+ These issues affect the convenience of interacting directly with Impala, typically through the Impala shell or Hue.
+
+
+
+
+
+
+ Unexpected privileges in show output
+
+
+
+
+ Due to a timing condition in updating cached policy data from Sentry, the SHOW statements for Sentry roles could
+ sometimes display out-of-date role settings. Because Impala rechecks authorization for each SQL statement, this discrepancy does
+ not represent a security issue for other statements.
+
+
+
+ Bug: IMPALA-3133
+
+
+
+ Severity: High
+
+
+
+ Resolution: Fixes have been issued for some but not all CDH / Impala releases. Check the JIRA for details of fix releases.
+
+
+ Resolution: Fixed in CDH 5.8.0 / Impala 2.6.0 and CDH 5.7.1 / Impala 2.5.1.
+
+
+
+
+
+
+
+ Less than 100% progress on completed simple SELECT queries
+
+
+
+
+ Simple SELECT queries show less than 100% progress even though they are already completed.
+
+
+
+ Bug: IMPALA-1776
+
+
+
+
+
+
+
+
+ Unexpected column overflow behavior with INT datatypes
+
+
+
+
+
+
+ Bug:
+ IMPALA-3123
+
+
+
+
+
+
+
+
+
+
+ Impala Known Issues: JDBC and ODBC Drivers
+
+
+
+
+ These issues affect applications that use the JDBC or ODBC APIs, such as business intelligence tools or custom-written applications
+ in languages such as Java or C++.
+
+
+
+
+
+
+
+
+ ImpalaODBC: Can not get the value in the SQLGetData(m-x th column) after the SQLBindCol(m th column)
+
+
+
+
+ If the ODBC SQLGetData is called on a series of columns, the function calls must follow the same order as the
+ columns. For example, if data is fetched from column 2 then column 1, the SQLGetData call for column 1 returns
+ NULL.
+
+
+
+ Bug: IMPALA-1792
+
+
+
+ Workaround: Fetch columns in the same order they are defined in the table.
+
+
+
+
+
+
+
+
+
+
+ Impala Known Issues: Security
+
+
+
+
+ These issues relate to security features, such as Kerberos authentication, Sentry authorization, encryption, auditing, and
+ redaction.
+
+
+
+
+
+
+
+
+ impala-shell requires Python with ssl module
+
+
+
+
+ On CentOS 5.10 and Oracle Linux 5.11 using the built-in Python 2.4, invoking the impala-shell with the
+ --ssl option might fail with the following error:
+
+
+
+Unable to import the python 'ssl' module. It is required for an SSL-secured connection.
+
+
+
+
+
+ Severity: Low, workaround available
+
+
+
+ Resolution: Customers are less likely to experience this issue over time, because ssl module is included
+ in newer Python releases packaged with recent Linux releases.
+
+
+
+ Workaround: To use SSL with impala-shell on these platform versions, install the ssh
+ Python module:
+
+
+
+yum install python-ssl
+
+
+
+ Then impala-shell can run when using SSL. For example:
+
+
+
+impala-shell -s impala --ssl --ca_cert /path_to_truststore/truststore.pem
+
+
+
+
+
+
+
+
+
+
+ Kerberos tickets must be renewable
+
+
+
+
+ In a Kerberos environment, the impalad daemon might not start if Kerberos tickets are not renewable.
+
+
+
+ Workaround: Configure your KDC to allow tickets to be renewed, and configure krb5.conf to request
+ renewable tickets.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Impala Known Issues: Resources
+
+
+
+
+ These issues involve memory or disk usage, including out-of-memory conditions, the spill-to-disk feature, and resource management
+ features.
+
+
+
+
+
+
+ Impala catalogd heap issues when upgrading to 5.7
+
+
+
+
+ The default heap size for Impala catalogd has changed in and higher:
+
+
+
+ -
+
+ Before 5.7, by default catalogd was using the JVM's default heap size, which is the smaller of 1/4th of the
+ physical memory or 32 GB.
+
+
+
+ -
+
+ Starting with CDH 5.7.0, the default catalogd heap size is 4 GB.
+
+
+
+
+
+ For example, on a host with 128GB physical memory this will result in catalogd heap decreasing from 32GB to 4GB. This can result
+ in out-of-memory errors in catalogd and leading to query failures.
+
+
+
+ Bug: TSB-168
+
+
+
+ Severity: High
+
+
+
+ Workaround: Increase the catalogd memory limit as follows.
+
+
+
+
+
+
+
+
+
+
+
+
+ Breakpad minidumps can be very large when the thread count is high
+
+
+
+
+ The size of the breakpad minidump files grows linearly with the number of threads. By default, each thread adds 8 KB to the
+ minidump size. Minidump files could consume significant disk space when the daemons have a high number of threads.
+
+
+
+ Bug: IMPALA-3509
+
+
+
+ Severity: High
+
+
+
+ Workaround: Add --minidump_size_limit_hint_kb=size to set a soft upper limit on the
+ size of each minidump file. If the minidump file would exceed that limit, Impala reduces the amount of information for each thread
+ from 8 KB to 2 KB. (Full thread information is captured for the first 20 threads, then 2 KB per thread after that.) The minidump
+ file can still grow larger than the hinted
size. For example, if you have 10,000 threads, the minidump file can be more
+ than 20 MB.
+
+
+
+
+
+
+
+
+ Parquet scanner memory increase after IMPALA-2736
+
+
+
+
+ The initial release of sometimes has a higher peak memory usage than in previous releases while reading
+ Parquet files.
+
+
+
+ addresses the issue IMPALA-2736, which improves the efficiency of Parquet scans by up to 2x. The faster scans
+ may result in a higher peak memory consumption compared to earlier versions of Impala due to the new column-wise row
+ materialization strategy. You are likely to experience higher memory consumption in any of the following scenarios:
+
+ -
+
+ Very wide rows due to projecting many columns in a scan.
+
+
+
+ -
+
+ Very large rows due to big column values, for example, long strings or nested collections with many items.
+
+
+
+ -
+
+ Producer/consumer speed imbalances, leading to more rows being buffered between a scan (producer) and downstream (consumer)
+ plan nodes.
+
+
+
+
+
+
+ Bug: IMPALA-3662
+
+
+
+ Severity: High
+
+
+
+ Workaround: The following query options might help to reduce memory consumption in the Parquet scanner:
+
+ -
+ Reduce the number of scanner threads, for example: set num_scanner_threads=30
+
+
+ -
+ Reduce the batch size, for example: set batch_size=512
+
+
+ -
+ Increase the memory limit, for example: set mem_limit=64g
+
+
+
+
+
+
+
+
+
+
+ Process mem limit does not account for the JVM's memory usage
+
+
+
+
+
+
+ Some memory allocated by the JVM used internally by Impala is not counted against the memory limit for the
+ impalad daemon.
+
+
+
+ Bug: IMPALA-691
+
+
+
+ Workaround: To monitor overall memory usage, use the top command, or add the memory figures in the
+ Impala web UI /memz tab to JVM memory usage shown on the /metrics tab.
+
+
+
+
+
+
+
+
+
+
+ Fix issues with the legacy join and agg nodes using --enable_partitioned_hash_join=false and --enable_partitioned_aggregation=false
+
+
+
+
+
+
+ Bug: IMPALA-2375
+
+
+
+ Workaround: Transition away from the old-style
join and aggregation mechanism if practical.
+
+
+ Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.
+
+
+
+
+
+
+
+
+
+ Impala Known Issues: Correctness
+
+
+
+
+ These issues can cause incorrect or unexpected results from queries. They typically only arise in very specific circumstances.
+
+
+
+
+
+
+ Incorrect assignment of NULL checking predicate through an outer join of a nested collection.
+
+
+
+
+ A query could return wrong results (too many or too few NULL values) if it referenced an outer-joined nested
+ collection and also contained a null-checking predicate (IS NULL, IS NOT NULL, or the
+ <=> operator) in the WHERE clause.
+
+
+
+ Bug: IMPALA-3084
+
+
+
+ Severity: High
+
+
+ Resolution: Fixed in CDH 5.9.0 / Impala 2.7.0.
+
+
+
+
+
+
+
+ Incorrect result due to constant evaluation in query with outer join
+
+
+
+
+ An OUTER JOIN query could omit some expected result rows due to a constant such as FALSE in
+ another join clause. For example:
+
+
+
+
+
+
+ Bug: IMPALA-3094
+
+
+
+ Severity: High
+
+
+
+ Resolution:
+
+
+
+ Workaround:
+
+
+
+
+
+
+
+
+ Incorrect assignment of an inner join On-clause predicate through an outer join.
+
+
+
+
+ Impala may return incorrect results for queries that have the following properties:
+
+
+
+ -
+
+ There is an INNER JOIN following a series of OUTER JOINs.
+
+
+
+ -
+
+ The INNER JOIN has an On-clause with a predicate that references at least two tables that are on the nullable side of the
+ preceding OUTER JOINs.
+
+
+
+
+
+ The following query demonstrates the issue:
+
+
+
+select 1 from functional.alltypes a left outer join
+ functional.alltypes b on a.id = b.id left outer join
+ functional.alltypes c on b.id = c.id right outer join
+ functional.alltypes d on c.id = d.id inner join functional.alltypes e
+on b.int_col = c.int_col;
+
+
+
+ The following listing shows the incorrect EXPLAIN plan:
+
+
+ c.id |
+| | |
+| 05:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED] |
+| | hash predicates: b.id = a.id |
+| | runtime filters: RF002 <- a.id |
+| | |
+| |--10:EXCHANGE [HASH(a.id)] |
+| | | |
+| | 00:SCAN HDFS [functional.alltypes a] |
+| | partitions=24/24 files=24 size=478.45KB |
+| | |
+| 09:EXCHANGE [HASH(b.id)] |
+| | |
+| 01:SCAN HDFS [functional.alltypes b] |
+| partitions=24/24 files=24 size=478.45KB |
+| runtime filters: RF001 -> b.int_col, RF002 -> b.id |
++-----------------------------------------------------------+
+]]>
+
+
+
+ Bug: IMPALA-3126
+
+
+
+ Severity: High
+
+
+
+ Workaround: High
+
+
+
+ For some queries, this problem can be worked around by placing the problematic ON clause predicate in the
+ WHERE clause instead, or changing the preceding OUTER JOINs to INNER JOINs (if
+ the ON clause predicate would discard NULLs). For example, to fix the problematic query above:
+
+
+ c.id |
+| | |
+| 05:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED] |
+| | hash predicates: b.id = a.id |
+| | runtime filters: RF001 <- a.id |
+| | |
+| |--10:EXCHANGE [HASH(a.id)] |
+| | | |
+| | 00:SCAN HDFS [functional.alltypes a] |
+| | partitions=24/24 files=24 size=478.45KB |
+| | |
+| 09:EXCHANGE [HASH(b.id)] |
+| | |
+| 01:SCAN HDFS [functional.alltypes b] |
+| partitions=24/24 files=24 size=478.45KB |
+| runtime filters: RF001 -> b.id |
++-----------------------------------------------------------+
+]]>
+
+
+
+
+
+
+
+
+ Impala may use incorrect bit order with BIT_PACKED encoding
+
+
+
+
+ Parquet BIT_PACKED encoding as implemented by Impala is LSB first. The parquet standard says it is MSB first.
+
+
+
+ Bug: IMPALA-3006
+
+
+
+ Severity: High, but rare in practice because BIT_PACKED is infrequently used, is not written by Impala, and is deprecated
+ in Parquet 2.0.
+
+
+
+
+
+
+
+
+ BST between 1972 and 1995
+
+
+
+
+ The calculation of start and end times for the BST (British Summer Time) time zone could be incorrect between 1972 and 1995.
+ Between 1972 and 1995, BST began and ended at 02:00 GMT on the third Sunday in March (or second Sunday when Easter fell on the
+ third) and fourth Sunday in October. For example, both function calls should return 13, but actually return 12, in a query such
+ as:
+
+
+
+select
+ extract(from_utc_timestamp(cast('1970-01-01 12:00:00' as timestamp), 'Europe/London'), "hour") summer70start,
+ extract(from_utc_timestamp(cast('1970-12-31 12:00:00' as timestamp), 'Europe/London'), "hour") summer70end;
+
+
+
+ Bug: IMPALA-3082
+
+
+
+ Severity: High
+
+
+
+
+
+
+
+
+ parse_url() returns incorrect result if @ character in URL
+
+
+
+
+ If a URL contains an @ character, the parse_url() function could return an incorrect value for
+ the hostname field.
+
+
+
+ Bug: IMPALA-1170
+
+
+ Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0 and CDH 5.5.4 / Impala 2.3.4.
+
+
+
+
+
+
+
+ % escaping does not work correctly when occurs at the end in a LIKE clause
+
+
+
+
+ If the final character in the RHS argument of a LIKE operator is an escaped \% character, it
+ does not match a % final character of the LHS argument.
+
+
+
+ Bug: IMPALA-2422
+
+
+
+
+
+
+
+
+ ORDER BY rand() does not work.
+
+
+
+
+ Because the value for rand() is computed early in a query, using an ORDER BY expression
+ involving a call to rand() does not actually randomize the results.
+
+
+
+ Bug: IMPALA-397
+
+
+
+
+
+
+
+
+ Duplicated column in inline view causes dropping null slots during scan
+
+
+
+
+ If the same column is queried twice within a view, NULL values for that column are omitted. For example, the
+ result of COUNT(*) on the view could be less than expected.
+
+
+
+ Bug: IMPALA-2643
+
+
+
+ Workaround: Avoid selecting the same column twice within an inline view.
+
+
+ Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0, CDH 5.5.2 / Impala 2.3.2, and CDH 5.4.10 / Impala 2.2.10.
+
+
+
+
+
+
+
+
+
+ Incorrect assignment of predicates through an outer join in an inline view.
+
+
+
+
+ A query involving an OUTER JOIN clause where one of the table references is an inline view might apply predicates
+ from the ON clause incorrectly.
+
+
+
+ Bug: IMPALA-1459
+
+
+ Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0, CDH 5.5.2 / Impala 2.3.2, and CDH 5.4.9 / Impala 2.2.9.
+
+
+
+
+
+
+
+ Crash: impala::Coordinator::ValidateCollectionSlots
+
+
+
+
+ A query could encounter a serious error if includes multiple nested levels of INNER JOIN clauses involving
+ subqueries.
+
+
+
+ Bug: IMPALA-2603
+
+
+
+
+
+
+
+
+ Incorrect assignment of On-clause predicate inside inline view with an outer join.
+
+
+
+
+ A query might return incorrect results due to wrong predicate assignment in the following scenario:
+
+
+
+ -
+ There is an inline view that contains an outer join
+
+
+ -
+ That inline view is joined with another table in the enclosing query block
+
+
+ -
+ That join has an On-clause containing a predicate that only references columns originating from the outer-joined tables inside
+ the inline view
+
+
+
+
+ Bug: IMPALA-2665
+
+
+ Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0, CDH 5.5.2 / Impala 2.3.2, and CDH 5.4.9 / Impala 2.2.9.
+
+
+
+
+
+
+
+ Wrong assignment of having clause predicate across outer join
+
+
+
+
+ In an OUTER JOIN query with a HAVING clause, the comparison from the HAVING
+ clause might be applied at the wrong stage of query processing, leading to incorrect results.
+
+
+
+ Bug: IMPALA-2144
+
+
+ Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.
+
+
+
+
+
+
+
+ Wrong plan of NOT IN aggregate subquery when a constant is used in subquery predicate
+
+
+
+
+ A NOT IN operator with a subquery that calls an aggregate function, such as NOT IN (SELECT
+ SUM(...)), could return incorrect results.
+
+
+
+ Bug: IMPALA-2093
+
+
+ Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0 and CDH 5.5.4 / Impala 2.3.4.
+
+
+
+
+
+
+
+
+
+ Impala Known Issues: Metadata
+
+
+
+
+ These issues affect how Impala interacts with metadata. They cover areas such as the metastore database, the COMPUTE
+ STATS statement, and the Impala catalogd daemon.
+
+
+
+
+
+
+ Catalogd may crash when loading metadata for tables with many partitions, many columns and with incremental stats
+
+
+
+
+ Incremental stats use up about 400 bytes per partition for each column. For example, for a table with 20K partitions and 100
+ columns, the memory overhead from incremental statistics is about 800 MB. When serialized for transmission across the network,
+ this metadata exceeds the 2 GB Java array size limit and leads to a catalogd crash.
+
+
+
+ Bugs: IMPALA-2647,
+ IMPALA-2648,
+ IMPALA-2649
+
+
+
+ Workaround: If feasible, compute full stats periodically and avoid computing incremental stats for that table. The
+ scalability of incremental stats computation is a continuing work item.
+
+
+
+
+
+
+
+
+
+
+ Can't update stats manually via alter table after upgrading to CDH 5.2
+
+
+
+
+
+
+ Bug: IMPALA-1420
+
+
+
+ Workaround: On CDH 5.2, when adjusting table statistics manually by setting the numRows, you must also
+ enable the Boolean property STATS_GENERATED_VIA_STATS_TASK. For example, use a statement like the following to
+ set both properties with a single ALTER TABLE statement:
+
+
+ALTER TABLE table_name SET TBLPROPERTIES('numRows'='new_value', 'STATS_GENERATED_VIA_STATS_TASK' = 'true');
+
+
+ Resolution: The underlying cause is the issue
+ HIVE-8648 that affects the
+ metastore in Hive 0.13. The workaround is only needed until the fix for this issue is incorporated into a CDH release.
+
+
+
+
+
+
+
+
+
+
+ Impala Known Issues: Interoperability
+
+
+
+
+ These issues affect the ability to interchange data between Impala and other database systems. They cover areas such as data types
+ and file formats.
+
+
+
+
+
+
+
+
+ DESCRIBE FORMATTED gives error on Avro table
+
+
+
+
+ This issue can occur either on old Avro tables (created prior to Hive 1.1 / CDH 5.4) or when changing the Avro schema file by
+ adding or removing columns. Columns added to the schema file will not show up in the output of the DESCRIBE
+ FORMATTED command. Removing columns from the schema file will trigger a NullPointerException.
+
+
+
+ As a workaround, you can use the output of SHOW CREATE TABLE to drop and recreate the table. This will populate
+ the Hive metastore database with the correct column definitions.
+
+
+
+ Only use this for external tables, or Impala will remove the data files. In case of an internal table, set it to external first:
+
+ALTER TABLE table_name SET TBLPROPERTIES('EXTERNAL'='TRUE');
+
+ (The part in parentheses is case sensitive.) Make sure to pick the right choice between internal and external when recreating the
+ table. See for the differences between internal and external tables.
+
+
+
+ Bug: CDH-41605
+
+
+
+ Severity: High
+
+
+
+
+
+
+
+
+
+
+ Deviation from Hive behavior: Impala does not do implicit casts between string and numeric and boolean types.
+
+
+
+
+ Cloudera Bug: ; KI added 0.1
+ Cloudera internal only
+
+
+
+ Anticipated Resolution: None
+
+
+
+ Workaround: Use explicit casts.
+
+
+
+
+
+
+
+
+
+
+ Deviation from Hive behavior: Out of range values float/double values are returned as maximum allowed value of type (Hive returns NULL)
+
+
+
+
+ Impala behavior differs from Hive with respect to out of range float/double values. Out of range values are returned as maximum
+ allowed value of type (Hive returns NULL).
+
+
+
+ Cloudera Bug: IMPALA-175 ; KI
+ added 0.1 Cloudera internal only
+
+
+
+ Workaround: None
+
+
+
+
+
+
+
+
+
+
+ Configuration needed for Flume to be compatible with Impala
+
+
+
+
+ For compatibility with Impala, the value for the Flume HDFS Sink hdfs.writeFormat must be set to
+ Text, rather than its default value of Writable. The hdfs.writeFormat setting
+ must be changed to Text before creating data files with Flume; otherwise, those files cannot be read by either
+ Impala or Hive.
+
+
+
+ Resolution: This information has been requested to be added to the upstream Flume documentation.
+
+
+
+
+
+
+
+
+
+
+ Avro Scanner fails to parse some schemas
+
+
+
+
+ Querying certain Avro tables could cause a crash or return no rows, even though Impala could DESCRIBE the table.
+
+
+
+ Bug: IMPALA-635
+
+
+
+ Workaround: Swap the order of the fields in the schema specification. For example, ["null", "string"]
+ instead of ["string", "null"].
+
+
+
+ Resolution: Not allowing this syntax agrees with the Avro specification, so it may still cause an error even when the
+ crashing issue is resolved.
+
+
+
+
+
+
+
+
+
+
+ Impala BE cannot parse Avro schema that contains a trailing semi-colon
+
+
+
+
+ If an Avro table has a schema definition with a trailing semicolon, Impala encounters an error when the table is queried.
+
+
+
+ Bug: IMPALA-1024
+
+
+
+ Severity: Remove trailing semicolon from the Avro schema.
+
+
+
+
+
+
+
+
+
+
+ Fix decompressor to allow parsing gzips with multiple streams
+
+
+
+
+ Currently, Impala can only read gzipped files containing a single stream. If a gzipped file contains multiple concatenated
+ streams, the Impala query only processes the data from the first stream.
+
+
+
+ Bug: IMPALA-2154
+
+
+
+ Workaround: Use a different gzip tool to compress file to a single stream file.
+
+
+ Resolution: Fixed in CDH 5.7.0 / Impala 2.5.0.
+
+
+
+
+
+
+
+
+
+ Impala incorrectly handles text data when the new line character \n\r is split between different HDFS block
+
+
+
+
+ If a carriage return / newline pair of characters in a text table is split between HDFS data blocks, Impala incorrectly processes
+ the row following the \n\r pair twice.
+
+
+
+ Bug: IMPALA-1578
+
+
+
+ Workaround: Use the Parquet format for large volumes of data where practical.
+
+
+ Resolution: Fixed in CDH 5.8.0 / Impala 2.6.0.
+
+
+
+
+
+
+
+
+
+ Invalid bool value not reported as a scanner error
+
+
+
+
+ In some cases, an invalid BOOLEAN value read from a table does not produce a warning message about the bad value.
+ The result is still NULL as expected. Therefore, this is not a query correctness issue, but it could lead to
+ overlooking the presence of invalid data.
+
+
+
+ Bug: IMPALA-1862
+
+
+
+
+
+
+
+
+
+
+ Incorrect results with basic predicate on CHAR typed column.
+
+
+
+
+ When comparing a CHAR column value to a string literal, the literal value is not blank-padded and so the
+ comparison might fail when it should match.
+
+
+
+ Bug: IMPALA-1652
+
+
+
+ Workaround: Use the RPAD() function to blank-pad literals compared with CHAR columns to
+ the expected length.
+
+
+
+
+
+
+
+
+
+
+ Impala Known Issues: Limitations
+
+
+
+
+ These issues are current limitations of Impala that require evaluation as you plan how to integrate Impala into your data management
+ workflow.
+
+
+
+
+
+
+
+
+ Impala does not support running on clusters with federated namespaces
+
+
+
+
+ Impala does not support running on clusters with federated namespaces. The impalad process will not start on a
+ node running such a filesystem based on the org.apache.hadoop.fs.viewfs.ViewFs class.
+
+
+
+ Bug: IMPALA-77
+
+
+
+ Anticipated Resolution: Limitation
+
+
+
+ Workaround: Use standard HDFS on all Impala nodes.
+
+
+
+
+
+
+
+
+
+
+ Impala Known Issues: Miscellaneous / Older Issues
+
+
+
+
+ These issues do not fall into one of the above categories or have not been categorized yet.
+
+
+
+
+
+
+
+
+ A failed CTAS does not drop the table if the insert fails.
+
+
+
+
+ If a CREATE TABLE AS SELECT operation successfully creates the target table but an error occurs while querying
+ the source table or copying the data, the new table is left behind rather than being dropped.
+
+
+
+ Bug: IMPALA-2005
+
+
+
+ Workaround: Drop the new table manually after a failed CREATE TABLE AS SELECT.
+
+
+
+
+
+
+
+
+
+
+ Casting scenarios with invalid/inconsistent results
+
+
+
+
+ Using a CAST() function to convert large literal values to smaller types, or to convert special values such as
+ NaN or Inf, produces values not consistent with other database systems. This could lead to
+ unexpected results from queries.
+
+
+
+ Bug: IMPALA-1821
+
+
+
+
+
+
+
+
+
+
+
+
+ Support individual memory allocations larger than 1 GB
+
+
+
+
+ The largest single block of memory that Impala can allocate during a query is 1 GiB. Therefore, a query could fail or Impala could
+ crash if a compressed text file resulted in more than 1 GiB of data in uncompressed form, or if a string function such as
+ group_concat() returned a value greater than 1 GiB.
+
+
+
+ Bug: IMPALA-1619
+
+
+ Resolution: Fixed in CDH 5.9.0 / Impala 2.7.0 and CDH 5.8.3 / Impala 2.6.3.
+
+
+
+
+
+
+
+
+
+ Impala Parser issue when using fully qualified table names that start with a number.
+
+
+
+
+ A fully qualified table name starting with a number could cause a parsing error. In a name such as db.571_market,
+ the decimal point followed by digits is interpreted as a floating-point number.
+
+
+
+ Bug: IMPALA-941
+
+
+
+ Workaround: Surround each part of the fully qualified name with backticks (``).
+
+
+
+
+
+
+
+
+
+
+ Impala should tolerate bad locale settings
+
+
+
+
+ If the LC_* environment variables specify an unsupported locale, Impala does not start.
+
+
+
+ Bug: IMPALA-532
+
+
+
+ Workaround: Add LC_ALL="C" to the environment settings for both the Impala daemon and the Statestore
+ daemon. See for details about modifying these environment settings.
+
+
+
+ Resolution: Fixing this issue would require an upgrade to Boost 1.47 in the Impala distribution.
+
+
+
+
+
+
+
+
+
+
+ Log Level 3 Not Recommended for Impala
+
+
+
+
+ The extensive logging produced by log level 3 can cause serious performance overhead and capacity issues.
+
+
+
+ Workaround: Reduce the log level to its default value of 1, that is, GLOG_v=1. See
+ for details about the effects of setting different logging levels.
+
+
+
+
+
+
+
+
+
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_kudu.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_kudu.xml b/docs/topics/impala_kudu.xml
new file mode 100644
index 0000000..c530cc1
--- /dev/null
+++ b/docs/topics/impala_kudu.xml
@@ -0,0 +1,167 @@
+
+
+
+
+ Using Impala to Query Kudu Tables
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Kudu
+ You can use Impala to query Kudu tables. This capability allows convenient access to a storage system that is
+ tuned for different kinds of workloads than the default with Impala. The default Impala tables use data files
+ stored on HDFS, which are ideal for bulk loads and queries using full-table scans. In contrast, Kudu can do
+ efficient queries for data organized either in data warehouse style (with full table scans) or for OLTP-style
+ workloads (with key-based lookups for single rows or small ranges of values).
+
+
+
+ Certain Impala SQL statements, such as UPDATE and DELETE, only work with
+ Kudu tables. These operations were impractical from a performance perspective to perform at large scale on
+ HDFS data, or on HBase tables.
+
+
+
+
+
+
+ Benefits of Using Kudu Tables with Impala
+
+
+
+
+ The combination of Kudu and Impala works best for tables where scan performance is important, but data
+ arrives continuously, in small batches, or needs to be updated without being completely replaced. In these
+ scenarios (such as for streaming data), it might be impractical to use Parquet tables because Parquet works
+ best with multi-megabyte data files, requiring substantial overhead to replace or reorganize data files to
+ accomodate frequent additions or changes to data. Impala can query Kudu tables with scan performance close
+ to that of Parquet, and Impala can also perform update or delete operations without replacing the entire
+ table contents. You can also use the Kudu API to do ingestion or transformation operations outside of
+ Impala, and Impala can query the current data at any time.
+
+
+
+
+
+
+
+
+ Primary Key Columns for Kudu Tables
+
+
+
+
+ Kudu tables introduce the notion of primary keys to Impala for the first time. The primary key is made up
+ of one or more columns, whose values are combined and used as a lookup key during queries. These columns
+ cannot contain any NULL values or any duplicate values, and can never be updated. For a
+ partitioned Kudu table, all the partition key columns must come from the set of primary key columns.
+
+
+
+ Impala itself still does not have the notion of unique or non-NULL constraints. These
+ restrictions on the primary key columns are enforced on the Kudu side.
+
+
+
+ The primary key columns must be the first ones specified in the CREATE TABLE statement.
+ You specify which column or columns make up the primary key in the table properties, rather than through
+ attributes in the column list.
+
+
+
+ Kudu can do extra optimizations for queries that refer to the primary key columns in the
+ WHERE clause. It is not crucial though to include the primary key columns in the
+ WHERE clause of every query. The benefit is mainly for partitioned tables,
+ which divide the data among various tablet servers based on the distribution of
+ data values in some or all of the primary key columns.
+
+
+
+
+
+
+
+
+ Impala DML Support for Kudu Tables
+
+
+
+
+ Impala supports certain DML statements for Kudu tables only. The UPDATE and
+ DELETE statements let you modify data within Kudu tables without rewriting substantial
+ amounts of table data.
+
+
+
+ The INSERT statement for Kudu tables honors the unique and non-NULL
+ requirements for the primary key columns.
+
+
+
+ Because Impala and Kudu do not support transactions, the effects of any INSERT,
+ UPDATE, or DELETE statement are immediately visible. For example, you
+ cannot do a sequence of UPDATE statements and only make the change visible after all the
+ statements are finished. Also, if a DML statement fails partway through, any rows that were already
+ inserted, deleted, or changed remain in the table; there is no rollback mechanism to undo the changes.
+
+
+
+
+
+
+
+
+ Partitioning for Kudu Tables
+
+
+
+
+ Kudu tables use special mechanisms to evenly distribute data among the underlying tablet servers. Although
+ we refer to such tables as partitioned tables, they are distinguished from traditional Impala partitioned
+ tables by use of different clauses on the CREATE TABLE statement. Partitioned Kudu tables
+ use DISTRIBUTE BY, HASH, RANGE, and SPLIT
+ ROWS clauses rather than the traditional PARTITIONED BY clause. All of the
+ columns involved in these clauses must be primary key columns. These clauses let you specify different ways
+ to divide the data for each column, or even for different value ranges within a column. This flexibility
+ lets you avoid problems with uneven distribution of data, where the partitioning scheme for HDFS tables
+ might result in some partitions being much larger than others. By setting up an effective partitioning
+ scheme for a Kudu table, you can ensure that the work for a query can be parallelized evenly across the
+ hosts in a cluster.
+
+
+
+
+
+
+
+
+ Impala Query Performance for Kudu Tables
+
+
+
+
+ For queries involving Kudu tables, Impala can delegate much of the work of filtering the result set to
+ Kudu, avoiding some of the I/O involved in full table scans of tables containing HDFS data files. This type
+ of optimization is especially effective for partitioned Kudu tables, where the Impala query
+ WHERE clause refers to one or more primary key columns that are also used as partition key
+ columns. For example, if a partitioned Kudu table uses a HASH clause for
+ col1 and a RANGE clause for col2, a query using a clause
+ such as WHERE col1 IN (1,2,3) AND col2 > 100 can determine exactly which tablet servers
+ contain relevant data, and therefore parallelize the query very efficiently.
+
+
+
+
+
+
+
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_langref.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_langref.xml b/docs/topics/impala_langref.xml
new file mode 100644
index 0000000..f81b76f
--- /dev/null
+++ b/docs/topics/impala_langref.xml
@@ -0,0 +1,74 @@
+
+
+
+
+ Impala SQL Language Reference
+ SQL Reference
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Impala uses SQL as its query language. To protect user investment in skills development and query
+ design, Impala provides a high degree of compatibility with the Hive Query Language (HiveQL):
+
+
+
+ -
+ Because Impala uses the same metadata store as Hive to record information about table structure and
+ properties, Impala can access tables defined through the native Impala CREATE TABLE
+ command, or tables created using the Hive data definition language (DDL).
+
+
+ -
+ Impala supports data manipulation (DML) statements similar to the DML component of HiveQL.
+
+
+ -
+ Impala provides many built-in functions with the same
+ names and parameter types as their HiveQL equivalents.
+
+
+
+
+ Impala supports most of the same statements and
+ clauses as HiveQL, including, but not limited to JOIN, AGGREGATE,
+ DISTINCT, UNION ALL, ORDER BY, LIMIT and
+ (uncorrelated) subquery in the FROM clause. Impala also supports INSERT
+ INTO and INSERT OVERWRITE.
+
+
+
+ Impala supports data types with the same names and semantics as the equivalent Hive data types:
+ STRING, TINYINT, SMALLINT, INT,
+ BIGINT, FLOAT, DOUBLE, BOOLEAN,
+ STRING, TIMESTAMP.
+
+
+
+ For full details about Impala SQL syntax and semantics, see
+ .
+
+
+
+ Most HiveQL SELECT and INSERT statements run unmodified with Impala. For
+ information about Hive syntax not available in Impala, see
+ .
+
+
+
+ For a list of the built-in functions available in Impala queries, see
+ .
+
+
+
+
+
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_langref_sql.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_langref_sql.xml b/docs/topics/impala_langref_sql.xml
new file mode 100644
index 0000000..18b6726
--- /dev/null
+++ b/docs/topics/impala_langref_sql.xml
@@ -0,0 +1,35 @@
+
+
+
+
+ Impala SQL Statements
+ SQL Statements
+
+
+
+
+
+
+
+
+
+
+
+
+ The Impala SQL dialect supports a range of standard elements, plus some extensions for Big Data use cases
+ related to data loading and data warehousing.
+
+
+
+
+ In the impala-shell interpreter, a semicolon at the end of each statement is required.
+ Since the semicolon is not actually part of the SQL syntax, we do not include it in the syntax definition
+ of each statement, but we do show it in examples intended to be run in impala-shell.
+
+
+
+
+ The following sections show the major SQL statements that you work with in Impala:
+
+
+
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_langref_unsupported.xml
----------------------------------------------------------------------
diff --git a/docs/topics/impala_langref_unsupported.xml b/docs/topics/impala_langref_unsupported.xml
new file mode 100644
index 0000000..82910d6
--- /dev/null
+++ b/docs/topics/impala_langref_unsupported.xml
@@ -0,0 +1,312 @@
+
+
+
+
+ SQL Differences Between Impala and Hive
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Hive
+ HiveQL
+ Impala's SQL syntax follows the SQL-92 standard, and includes many industry extensions in areas such as
+ built-in functions. See for a general discussion of adapting SQL
+ code from a variety of database systems to Impala.
+
+
+
+ Because Impala and Hive share the same metastore database and their tables are often used interchangeably,
+ the following section covers differences between Impala and Hive in detail.
+
+
+
+
+
+
+
+ HiveQL Features not Available in Impala
+
+
+
+
+ The current release of Impala does not support the following SQL features that you might be familiar with
+ from HiveQL:
+
+
+
+
+
+
+
+ -
+ Extensibility mechanisms such as TRANSFORM, custom file formats, or custom SerDes.
+
+
+ -
+ The DATE data type.
+
+
+ -
+ XML and JSON functions.
+
+
+ -
+ Certain aggregate functions from HiveQL: covar_pop, covar_samp,
+ corr, percentile, percentile_approx,
+ histogram_numeric, collect_set; Impala supports the set of aggregate
+ functions listed in and analytic
+ functions listed in .
+
+
+ -
+ Sampling.
+
+
+ -
+ Lateral views. In and higher, Impala supports queries on complex types
+ (STRUCT, ARRAY, or MAP), using join notation
+ rather than the EXPLODE() keyword.
+ See for details about Impala support for complex types.
+
+
+ -
+ Multiple DISTINCT clauses per query, although Impala includes some workarounds for this
+ limitation.
+
+
+
+
+
+ User-defined functions (UDFs) are supported starting in Impala 1.2. See
+ for full details on Impala UDFs.
+
+ -
+
+ Impala supports high-performance UDFs written in C++, as well as reusing some Java-based Hive UDFs.
+
+
+
+ -
+
+ Impala supports scalar UDFs and user-defined aggregate functions (UDAFs). Impala does not currently
+ support user-defined table generating functions (UDTFs).
+
+
+
+ -
+
+ Only Impala-supported column types are supported in Java-based UDFs.
+
+
+
+ -
+
+
+
+
+
+
+ Impala does not currently support these HiveQL statements:
+
+
+
+ -
+ ANALYZE TABLE (the Impala equivalent is COMPUTE STATS)
+
+
+ -
+ DESCRIBE COLUMN
+
+
+ -
+ DESCRIBE DATABASE
+
+
+ -
+ EXPORT TABLE
+
+
+ -
+ IMPORT TABLE
+
+
+ -
+ SHOW TABLE EXTENDED
+
+
+ -
+ SHOW INDEXES
+
+
+ -
+ SHOW COLUMNS
+
+
+ -
+ INSERT OVERWRITE DIRECTORY; use INSERT OVERWRITE table_name
+ or CREATE TABLE AS SELECT to materialize query results into the HDFS directory associated
+ with an Impala table.
+
+
+
+
+
+
+
+ Semantic Differences Between Impala and HiveQL Features
+
+
+
+
+ This section covers instances where Impala and Hive have similar functionality, sometimes including the
+ same syntax, but there are differences in the runtime semantics of those features.
+
+
+
+ Security:
+
+
+
+ Impala utilizes the Apache
+ Sentry authorization framework, which provides fine-grained role-based access control
+ to protect data against unauthorized access or tampering.
+
+
+
+ The Hive component included in CDH 5.1 and higher now includes Sentry-enabled GRANT,
+ REVOKE, and CREATE/DROP ROLE statements. Earlier Hive releases had a
+ privilege system with GRANT and REVOKE statements that were primarily
+ intended to prevent accidental deletion of data, rather than a security mechanism to protect against
+ malicious users.
+
+
+
+ Impala can make use of privileges set up through Hive GRANT and REVOKE statements.
+ Impala has its own GRANT and REVOKE statements in Impala 2.0 and higher.
+ See for the details of authorization in Impala, including
+ how to switch from the original policy file-based privilege model to the Sentry service using privileges
+ stored in the metastore database.
+
+
+
+ SQL statements and clauses:
+
+
+
+ The semantics of Impala SQL statements varies from HiveQL in some cases where they use similar SQL
+ statement and clause names:
+
+
+
+ -
+ Impala uses different syntax and names for query hints, [SHUFFLE] and
+ [NOSHUFFLE] rather than MapJoin or StreamJoin. See
+ for the Impala details.
+
+
+ -
+ Impala does not expose MapReduce specific features of SORT BY, DISTRIBUTE
+ BY, or CLUSTER BY.
+
+
+ -
+ Impala does not require queries to include a FROM clause.
+
+
+
+
+ Data types:
+
+
+
+
+
+ Miscellaneous features:
+
+
+
+ -
+ Impala does not provide virtual columns.
+
+
+ -
+ Impala does not expose locking.
+
+
+ -
+ Impala does not expose some configuration properties.
+
+
+
+
+