impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dimitris Tsirogiannis (Code Review)" <>
Subject [Impala-ASF-CR] [DOCS] Major update to Impala + Kudu page
Date Tue, 24 Jan 2017 23:52:22 GMT
Dimitris Tsirogiannis has posted comments on this change.

Change subject: [DOCS] Major update to Impala + Kudu page

Patch Set 7:

File docs/shared/impala_common.xml:

PS7, Line 3736: The <codeph>REFRESH</codeph> and <codeph>INVALIDATE METADATA</codeph>
              :         statements are needed less frequently for Kudu tables than for
              :         HDFS-backed tables. Neither statement is needed when data is
              :         added to, removed, or updated in a Kudu table, even if the changes
              :         are made directly to Kudu through a client program using the Kudu
              :         Run <codeph>REFRESH <varname>table_name</varname></codeph>
              :         <codeph>INVALIDATE METADATA <varname>table_name</varname></codeph>
              :         for a Kudu table only after making a change to the Kudu table schema,
              :         such as adding or dropping a column, by a mechanism other than
              :         Impala.
I would start by saying when to use REFRESH and INVALIDATE METADATA for Kudu table and then
talk about the cases where it is not required.

PS7, Line 3751: internal
Don't we also call them "managed"? Not sure.

PS7, Line 3749: The distinction between internal and external tables has some special
              :         details for Kudu tables. Tables created entirely through Impala are
              :         internal tables.
It also has "special details" for non-Kudu tables. I would simply say "Impala supports both
internal (managed) and external Kudu tables. Internal tables.... "
File docs/topics/impala_alter_table.xml:

PS7, Line 833: can specify
"can only specify"
File docs/topics/impala_create_table.xml:

PS7, Line 338: clauses
I think you forgot caching which is also not supported for Kudu.

PS7, Line 352: <codeph>CREATE TABLE AS SELECT</codeph>
This is wrong. We do support CTAS for internal Kudu tables. See for some
File docs/topics/impala_drop_table.xml:

PS7, Line 160: managed
It's good to be consistent. In some places we use "managed" and in others "internal".
File docs/topics/impala_explain.xml:

PS7, Line 250:  
"in a SCAN KUDU node"

PS7, Line 251: , and might involve transmitting
             :       non-matching rows that are filtered out on the Impala side.
File docs/topics/impala_kudu.xml:

PS7, Line 76: scan performance close to that of Parquet
Make sure you check with Mostafa about this claim.

PS7, Line 147: The work is parallelized
             :               across units of computing called <term>tablet servers</term>.
I believe the unit of computing is the tablet not the tablet server, unless I am misinterpreting
the term. Please check with Todd.

PS7, Line 149: between the tablets and tablet servers.
I think it would be nice to mention that the user is not responsible for managing this mapping
of tablet to tablet server.

It is weird to mention CREATE without DROP. Please remove this and mention in the paragraph
below that you can also drop Kudu tables through Impala.

PS7, Line 397: <p>
             :             For non-Kudu tables, Impala allows any column to contain <codeph>NULL</codeph>
             :             values, because it is not practical to enforce a <q>not null</q>
constraint on HDFS
             :             data files that could be prepared using external tools and ETL
             :           </p>
             :           <p conref="../shared/impala_common.xml#common/pk_implies_not_null"/>
             :           <p>
             :             For example, a table containing geographic information might require
the latitude
             :             and longitude coordinates to always be specified. Other attributes
might be allowed
             :             to be <codeph>NULL</codeph>. For example, a location
might not have a designated
             :             place name, its altitude might be unimportant, and its population
might be initially
             :             unknown, to be filled in later.
             :           </p>
You may want to swap these two paragraphs.

Impala still uses "PARTITIONED BY" for HDFS tables.

PS7, Line 727: . By setting
             :           up an effective partitioning scheme for a Kudu table, you can ensure
that the work for
             :           a query can be parallelized evenly across the hosts in a cluster.
Remove. Sometimes the goal is to scan as little as possible. You can say that more flexible
schema designs allows you to perform more elaborate tuning based on the characteristics of
the workload or something along these lines.

PS7, Line 936: To see the distribution of data in a Kudu table across the underlying buckets
             :             partitions, use the <codeph>SHOW TABLE STATS</codeph>
This is unfortunately not accurate. SHOW TABLE STATS will only show the distribution of replicas
to tablet servers. Also, it will always show -1 in the #rows column and the key ranges in
hex, so the output is confusing. I would just remove this paragraph.

PS7, Line 1122: change

PS7, Line 1159: strong consistency for order of operations
I am not sure I know what that means.

PS7, Line 1159: total
              :         success or total failure of a multi-row statement
This is "atomicity". Maybe just mention atomic multi-row statements.

PS7, Line 1160: or data that is read while a write
              :         operation is in progress

PS7, Line 1288: <title>Memory Usage for Operations on Kudu Tables</title>
              :       <conbody>
              :         <p>
              :           The Apache Kudu architecture, topology, and data storage techniques
result in
              :           different patterns of memory usage for Impala statements than with
HDFS-backed tables.
              :         </p>
I don't find this particularly informative and suggest we remove it unless someone objects.
File docs/topics/impala_literals.xml:

PS7, Line 408: most Impala tables
Impala tables backed by HDFS or S3? "most" is kind of vague
File docs/topics/impala_revoke.xml:

PS7, Line 115: access to a Kudu table is <q>all or nothing</q>.
"only table-level permissions are enforced in Kudu tables. Column-level permissions as well
as permissions on certain operations, such as INSERT, are not supported. "
File docs/topics/impala_tables.xml:

PS7, Line 293: using the Apache Kudu storage system
"stored in Apache Kudu"

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I76dcb948dab08532fe41326b22ef78d73282db2c
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell <>
Gerrit-Reviewer: Ambreen Kazi <>
Gerrit-Reviewer: Dimitris Tsirogiannis <>
Gerrit-Reviewer: Jean-Daniel Cryans <>
Gerrit-Reviewer: John Russell <>
Gerrit-Reviewer: Matthew Jacobs <>
Gerrit-Reviewer: Todd Lipcon <>
Gerrit-HasComments: Yes

View raw message