Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 2B0BF200BCE for ; Fri, 18 Nov 2016 00:11:59 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 28666160B1E; Thu, 17 Nov 2016 23:11:59 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A1E84160B18 for ; Fri, 18 Nov 2016 00:11:57 +0100 (CET) Received: (qmail 40944 invoked by uid 500); 17 Nov 2016 23:11:56 -0000 Mailing-List: contact commits-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@impala.incubator.apache.org Delivered-To: mailing list commits@impala.incubator.apache.org Received: (qmail 40929 invoked by uid 99); 17 Nov 2016 23:11:56 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Nov 2016 23:11:56 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 598FE1A0568 for ; Thu, 17 Nov 2016 23:11:56 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -6.218 X-Spam-Level: X-Spam-Status: No, score=-6.218 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-2.999, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 14ycRtlsZuHp for ; Thu, 17 Nov 2016 23:11:51 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with SMTP id 62C5C60E4E for ; Thu, 17 Nov 2016 23:11:41 +0000 (UTC) Received: (qmail 40219 invoked by uid 99); 17 Nov 2016 23:11:40 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Nov 2016 23:11:40 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 59A2EF1790; Thu, 17 Nov 2016 23:11:40 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: jbapple@apache.org To: commits@impala.incubator.apache.org Date: Thu, 17 Nov 2016 23:12:19 -0000 Message-Id: <69d2e10371b54b58a38c4c3ac09c259f@git.apache.org> In-Reply-To: References: X-Mailer: ASF-Git Admin Mailer Subject: [41/51] [partial] incubator-impala git commit: IMPALA-3398: Add docs to main Impala branch. archived-at: Thu, 17 Nov 2016 23:11:59 -0000 http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_char.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_char.xml b/docs/topics/impala_char.xml new file mode 100644 index 0000000..0298d57 --- /dev/null +++ b/docs/topics/impala_char.xml @@ -0,0 +1,278 @@ + + + + + CHAR Data Type (<keyword keyref="impala20"/> or higher only) + CHAR + + + + + + + + + + + + + +

+ CHAR data type + A fixed-length character type, padded with trailing spaces if necessary to achieve the specified length. If + values are longer than the specified length, Impala truncates any trailing characters. +

+ +

+ +

+ In the column definition of a CREATE TABLE statement: +

+ +column_name CHAR(length) + +

+ The maximum length you can specify is 255. +

+ +

+ Semantics of trailing spaces: +

+ +
    +
  • + When you store a CHAR value shorter than the specified length in a table, queries return + the value padded with trailing spaces if necessary; the resulting value has the same length as specified in + the column definition. +
  • + +
  • + If you store a CHAR value containing trailing spaces in a table, those trailing spaces are + not stored in the data file. When the value is retrieved by a query, the result could have a different + number of trailing spaces. That is, the value includes however many spaces are needed to pad it to the + specified length of the column. +
  • + +
  • + If you compare two CHAR values that differ only in the number of trailing spaces, those + values are considered identical. +
  • +
+ +

+ +

+ +

+ +

    +
  • + This type can be read from and written to Parquet files. +
  • + +
  • + There is no requirement for a particular level of Parquet. +
  • + +
  • + Parquet files generated by Impala and containing this type can be freely interchanged with other components + such as Hive and MapReduce. +
  • + +
  • + Any trailing spaces, whether implicitly or explicitly specified, are not written to the Parquet data files. +
  • + +
  • + Parquet data files might contain values that are longer than allowed by the + CHAR(n) length limit. Impala ignores any extra trailing characters when + it processes those values during a query. +
  • +
+ +

+ +

+ Text data files might contain values that are longer than allowed for a particular + CHAR(n) column. Any extra trailing characters are ignored when Impala + processes those values during a query. Text data files can also contain values that are shorter than the + defined length limit, and Impala pads them with trailing spaces up to the specified length. Any text data + files produced by Impala INSERT statements do not include any trailing blanks for + CHAR columns. +

+ +

Avro considerations:

+

+ +

+ +

+ This type is available using Impala 2.0 or higher under CDH 4, or with Impala on CDH 5.2 or higher. There are + no compatibility issues with other components when exchanging data files or running Impala on CDH 4. +

+ +

+ Some other database systems make the length specification optional. For Impala, the length is required. +

+ + + +

+ +

+ +

+ + + +

+ +

+ +

+ These examples show how trailing spaces are not considered significant when comparing or processing + CHAR values. CAST() truncates any longer string to fit within the defined + length. If a CHAR value is shorter than the specified length, it is padded on the right with + spaces until it matches the specified length. Therefore, LENGTH() represents the length + including any trailing spaces, and CONCAT() also treats the column value as if it has + trailing spaces. +

+ +select cast('x' as char(4)) = cast('x ' as char(4)) as "unpadded equal to padded"; ++--------------------------+ +| unpadded equal to padded | ++--------------------------+ +| true | ++--------------------------+ + +create table char_length(c char(3)); +insert into char_length values (cast('1' as char(3))), (cast('12' as char(3))), (cast('123' as char(3))), (cast('123456' as char(3))); +select concat("[",c,"]") as c, length(c) from char_length; ++-------+-----------+ +| c | length(c) | ++-------+-----------+ +| [1 ] | 3 | +| [12 ] | 3 | +| [123] | 3 | +| [123] | 3 | ++-------+-----------+ + + +

+ This example shows a case where data values are known to have a specific length, where CHAR + is a logical data type to use. + +

+ +create table addresses + (id bigint, + street_name string, + state_abbreviation char(2), + country_abbreviation char(2)); + + +

+ The following example shows how values written by Impala do not physically include the trailing spaces. It + creates a table using text format, with CHAR values much shorter than the declared length, + and then prints the resulting data file to show that the delimited values are not separated by spaces. The + same behavior applies to binary-format Parquet data files. +

+ +create table char_in_text (a char(20), b char(30), c char(40)) + row format delimited fields terminated by ','; + +insert into char_in_text values (cast('foo' as char(20)), cast('bar' as char(30)), cast('baz' as char(40))), (cast('hello' as char(20)), cast('goodbye' as char(30)), cast('aloha' as char(40))); + +-- Running this Linux command inside impala-shell using the ! shortcut. +!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*'; +foo,bar,baz +hello,goodbye,aloha + + +

+ The following example further illustrates the treatment of spaces. It replaces the contents of the previous + table with some values including leading spaces, trailing spaces, or both. Any leading spaces are preserved + within the data file, but trailing spaces are discarded. Then when the values are retrieved by a query, the + leading spaces are retrieved verbatim while any necessary trailing spaces are supplied by Impala. +

+ +insert overwrite char_in_text values (cast('trailing ' as char(20)), cast(' leading and trailing ' as char(30)), cast(' leading' as char(40))); +!hdfs dfs -cat 'hdfs://127.0.0.1:8020/user/hive/warehouse/impala_doc_testing.db/char_in_text/*.*'; +trailing, leading and trailing, leading + +select concat('[',a,']') as a, concat('[',b,']') as b, concat('[',c,']') as c from char_in_text; ++------------------------+----------------------------------+--------------------------------------------+ +| a | b | c | ++------------------------+----------------------------------+--------------------------------------------+ +| [trailing ] | [ leading and trailing ] | [ leading ] | ++------------------------+----------------------------------+--------------------------------------------+ + + +

+ +

+ Because the blank-padding behavior requires allocating the maximum length for each value in memory, for + scalability reasons avoid declaring CHAR columns that are much longer than typical values in + that column. +

+ +

+ +

+ When an expression compares a CHAR with a STRING or + VARCHAR, the CHAR value is implicitly converted to STRING + first, with trailing spaces preserved. +

+ +select cast("foo " as char(5)) = 'foo' as "char equal to string"; ++----------------------+ +| char equal to string | ++----------------------+ +| false | ++----------------------+ + + +

+ This behavior differs from other popular database systems. To get the expected result of + TRUE, cast the expressions on both sides to CHAR values of the appropriate + length: +

+ +select cast("foo " as char(5)) = cast('foo' as char(3)) as "char equal to string"; ++----------------------+ +| char equal to string | ++----------------------+ +| true | ++----------------------+ + + +

+ This behavior is subject to change in future releases. +

+ +

+ +

+ , , + , + +

+
+
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_cluster_sizing.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_cluster_sizing.xml b/docs/topics/impala_cluster_sizing.xml new file mode 100644 index 0000000..382f68c --- /dev/null +++ b/docs/topics/impala_cluster_sizing.xml @@ -0,0 +1,353 @@ + + + + + Cluster Sizing Guidelines for Impala + Cluster Sizing + + + + + + + + + + + + + + + + + + + + + +

+ cluster sizing + This document provides a very rough guideline to estimate the size of a cluster needed for a specific + customer application. You can use this information when planning how much and what type of hardware to + acquire for a new cluster, or when adding Impala workloads to an existing cluster. +

+ + + Before making purchase or deployment decisions, consult your Cloudera representative to verify the + conclusions about hardware requirements based on your data volume and workload. + + + + +

+ Always use hosts with identical specifications and capacities for all the nodes in the cluster. Currently, + Impala divides the work evenly between cluster nodes, regardless of their exact hardware configuration. + Because work can be distributed in different ways for different queries, if some hosts are overloaded + compared to others in terms of CPU, memory, I/O, or network, you might experience inconsistent performance + and overall slowness +

+ +

+ For analytic workloads with star/snowflake schemas, and using consistent hardware for all nodes (64 GB RAM, + 12 2 TB hard drives, 2x E5-2630L 12 cores total, 10 GB network), the following table estimates the number of + DataNodes needed in the cluster based on data size and the number of concurrent queries, for workloads + similar to TPC-DS benchmark queries: +

+ + + Cluster size estimation based on the number of concurrent queries and data size with a 20 second average query response time + + + + + + + + + + + Data Size + + + 1 query + + + 10 queries + + + 100 queries + + + 1000 queries + + + 2000 queries + + + + + + + 250 GB + + + 2 + + + 2 + + + 5 + + + 35 + + + 70 + + + + + 500 GB + + + 2 + + + 2 + + + 10 + + + 70 + + + 135 + + + + + 1 TB + + + 2 + + + 2 + + + 15 + + + 135 + + + 270 + + + + + 15 TB + + + 2 + + + 20 + + + 200 + + + N/A + + + N/A + + + + + 30 TB + + + 4 + + + 40 + + + 400 + + + N/A + + + N/A + + + + + 60 TB + + + 8 + + + 80 + + + 800 + + + N/A + + + N/A + + + + +
+ +
+ + Factors Affecting Scalability + +

+ A typical analytic workload (TPC-DS style queries) using recommended hardware is usually CPU-bound. Each + node can process roughly 1.6 GB/sec. Both CPU-bound and disk-bound workloads can scale almost linearly with + cluster size. However, for some workloads, the scalability might be bounded by the network, or even by + memory. +

+ +

+ If the workload is already network bound (on a 10 GB network), increasing the cluster size won’t reduce + the network load; in fact, a larger cluster could increase network traffic because some queries involve + broadcast operations to all DataNodes. Therefore, boosting the cluster size does not improve query + throughput in a network-constrained environment. +

+ +

+ Let’s look at a memory-bound workload. A workload is memory-bound if Impala cannot run any additional + concurrent queries because all memory allocated has already been consumed, but neither CPU, disk, nor + network is saturated yet. This can happen because currently Impala uses only a single core per node to + process join and aggregation queries. For a node with 128 GB of RAM, if a join node takes 50 GB, the system + cannot run more than 2 such queries at the same time. +

+ +

+ Therefore, at most 2 cores are used. Throughput can still scale almost linearly even for a memory-bound + workload. It’s just that the CPU will not be saturated. Per-node throughput will be lower than 1.6 + GB/sec. Consider increasing the memory per node. +

+ +

+ As long as the workload is not network- or memory-bound, we can use the 1.6 GB/second per node as the + throughput estimate. +

+
+ +
+ + A More Precise Approach + +

+ A more precise sizing estimate would require not only queries per minute (QPM), but also an average data + size scanned per query (D). With the proper partitioning strategy, D is usually a fraction of the total + data size. The following equation can be used as a rough guide to estimate the number of nodes (N) needed: +

+ +Eq 1: N > QPM * D / 100 GB + + +

+ Here is an example. Suppose, on average, a query scans 50 GB of data and the average response time is + required to be 15 seconds or less when there are 100 concurrent queries. The QPM is 100/15*60 = 400. We can + estimate the number of node using our equation above. +

+ +N > QPM * D / 100GB +N > 400 * 50GB / 100GB +N > 200 + + +

+ Because this figure is a rough estimate, the corresponding number of nodes could be between 100 and 500. +

+ +

+ Depending on the complexity of the query, the processing rate of query might change. If the query has more + joins, aggregation functions, or CPU-intensive functions such as string processing or complex UDFs, the + process rate will be lower than 1.6 GB/second per node. On the other hand, if the query only does scan and + filtering on numbers, the processing rate can be higher. +

+
+ +
+ + Estimating Memory Requirements + + +

+ Impala can handle joins between multiple large tables. Make sure that statistics are collected for all the + joined tables, using the COMPUTE + STATS statement. However, joining big tables does consume more memory. Follow the steps + below to calculate the minimum memory requirement. +

+ +

+ Suppose you are running the following join: +

+ +select a.*, b.col_1, b.col_2, … b.col_n +from a, b +where a.key = b.key +and b.col_1 in (1,2,4...) +and b.col_4 in (....); + + +

+ And suppose table B is smaller than table A (but still a large table). +

+ +

+ The memory requirement for the query is the right-hand table (B), after decompression, + filtering (b.col_n in ...) and after projection (only using certain columns) must be less + than the total memory of the entire cluster. +

+ +Cluster Total Memory Requirement = Size of the smaller table * + selectivity factor from the predicate * + projection factor * compression ratio + + +

+ In this case, assume that table B is 100 TB in Parquet format with 200 columns. The + predicate on B (b.col_1 in ...and b.col_4 in ...) will select only 10% of + the rows from B and for projection, we are only projecting 5 columns out of 200 columns. + Usually, Snappy compression gives us 3 times compression, so we estimate a 3x compression factor. +

+ +Cluster Total Memory Requirement = Size of the smaller table * + selectivity factor from the predicate * + projection factor * compression ratio + = 100TB * 10% * 5/200 * 3 + = 0.75TB + = 750GB + + +

+ So, if you have a 10-node cluster, each node has 128 GB of RAM and you give 80% to Impala, then you have 1 + TB of usable memory for Impala, which is more than 750GB. Therefore, your cluster can handle join queries + of this magnitude. +

+
+
+
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_cm_installation.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_cm_installation.xml b/docs/topics/impala_cm_installation.xml new file mode 100644 index 0000000..2cc2ac5 --- /dev/null +++ b/docs/topics/impala_cm_installation.xml @@ -0,0 +1,56 @@ + + + + + Installing Impala with Cloudera Manager + + + + + + + + + + + +

+ Before installing Impala through the Cloudera Manager interface, make sure all applicable nodes have the + appropriate hardware configuration and levels of operating system and CDH. See + for details. +

+ + +

+ To install the latest Impala under CDH 4, upgrade Cloudera Manager to 4.8 or higher. Cloudera Manager 4.8 is + the first release that can manage the Impala catalog service introduced in Impala 1.2. Cloudera Manager 4.8 + requires this service to be present, so if you upgrade to Cloudera Manager 4.8, also upgrade Impala to the + most recent version at the same time. + +

+
+ +

+ For information on installing Impala in a Cloudera Manager-managed environment, see + Installing Impala. +

+ +

+ Managing your Impala installation through Cloudera Manager has a number of advantages. For example, when you + make configuration changes to CDH components using Cloudera Manager, it automatically applies changes to the + copies of configuration files, such as hive-site.xml, that Impala keeps under + /etc/impala/conf. It also sets up the Hive Metastore service that is required for + Impala running under CDH 4.1. +

+ +

+ In some cases, depending on the level of Impala, CDH, and Cloudera Manager, you might need to add particular + component configuration details in some of the free-form option fields on the Impala configuration pages + within Cloudera Manager. +

+
+
http://git-wip-us.apache.org/repos/asf/incubator-impala/blob/3be0f122/docs/topics/impala_comments.xml ---------------------------------------------------------------------- diff --git a/docs/topics/impala_comments.xml b/docs/topics/impala_comments.xml new file mode 100644 index 0000000..07531dc --- /dev/null +++ b/docs/topics/impala_comments.xml @@ -0,0 +1,53 @@ + + + + + Comments + + + + + + + + + + + +

+ comments (SQL) + Impala supports the familiar styles of SQL comments: +

+ +
    +
  • + All text from a -- sequence to the end of the line is considered a comment and ignored. + This type of comment can occur on a single line by itself, or after all or part of a statement. +
  • + +
  • + All text from a /* sequence to the next */ sequence is considered a + comment and ignored. This type of comment can stretch over multiple lines. This type of comment can occur + on one or more lines by itself, in the middle of a statement, or before or after a statement. +
  • +
+ +

+ For example: +

+ +-- This line is a comment about a table. +create table ...; + +/* +This is a multi-line comment about a query. +*/ +select ...; + +select * from t /* This is an embedded comment about a query. */ where ...; + +select * from t -- This is a trailing comment within a multi-line command. +where ...; + +
+