impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Russell (Code Review)" <>
Subject [Impala-ASF-CR] [DOCS] Major update to Impala + Kudu page
Date Fri, 20 Jan 2017 06:49:55 GMT
John Russell has posted comments on this change.

Change subject: [DOCS] Major update to Impala + Kudu page

Patch Set 6:

File docs/topics/impala_invalidate_metadata.xml:

PS4, Line 245: <p rev="kudu" conref="../shared/impala_common.xml#common/kudu_blurb"/>
             :     <p conref="../shared/impala_common.xml#common/kudu_metadata_intro"/>
> Where do we talk about Kudu table metadata? We need to mention that source 
OK, I'm going to implement that with a #include-style mechanism to use the same wording under
both REFRESH and INVALIDATE METADATA. The actual text will go in docs/shared/impala_common.xml.
File docs/topics/impala_kudu.xml:

PS4, Line 39: d by the Apache Kudu componen
> stored by Apache Kudu
OK. I still phrase it as "the Apache Kudu component" just because trademark rules say to always
use the trademark as an adjective. E.g. "the ___ database system", "the ___ product", "the
___ storage engine", etc.

PS4, Line 98: ultiple Kudu hosts 
> should say "separated by commas"

> should say which table property to set

PS4, Line 147: physicall
> I'd say these are physical

PS4, Line 149: tablet servers is independent of the number of DataNodes in the cluster.
> One question that someone might have is whether tablet servers and data nod
I will mention that it's possible but not required. That's something I got from Juan's training
session. Is there some nuance where, e.g. a join of a Kudu table with a Parquet table would
benefit from tablet server and impalad running on the same host?

PS4, Line 156: 
             :           One consideration for the cluster topology is that the number of
replicas for a Kudu table
             :           must be odd.
             :         </p>
> I'd scrap this whole <p>
Leave the first sentence or not? There was a fair bit of discussion of tablets and tablet
servers in the JD and Juan training sessions. I feel like I'm undercovering it from the Impala
side. Although I'm happy to leave that for a future update, or link over to the Kudu docs
for such detail.

PS4, Line 200: umns, whose values are combined and
> worth noting that it's the _tuple_ that has to be unique. EG you can have P

PS4, Line 201:  be unique and cannot contain any
             :           <codeph>NULL</codeph> v
> Remove. There are only partitioned Kudu tables.

PS4, Line 221: On the logical side, the uniqueness constraint allows you to avoid duplicate
data in a table.
             :               For example, if an <codeph>INSERT</codeph> operation
fails partway through, only some of the
             :               new rows might be present in the table. You can re-run the same
<codeph>INSERT</codeph>, and
             :               only the missing rows will be added. Or if data in the table
is stale, you can run an
             :               <codeph>UPSERT</codeph> statement that brings 
> Why blending the semantics of upsert and insert with the description of pri
This is a subject I see come up again and again in Kudu discussions, so I might need to include
it in multiple places -- when introducing the notion of the PK uniqueness constraint, in the
reference pages for INSERT / UPSERT, and in discussions of ETL procedures. Why don't I leave
it here for the time being and see what user feedback is like.

PS4, Line 233: codeph> clauses and <codeph>NOT NULL</codeph>
> How about the nullability restrictions on non-primary key columns. Where ar
I'll take out the "on the primary key columns" wording.

PS4, Line 247: onbody>
             :         <p>
             :           For the general syntax of the <codeph>CREATE TABLE</codeph>
             :           st
> I am not sure I understand the purpose of this paragraph here. This either 
I'll remove it from here.

PS4, Line 269: 
             :         <p outputclass="toc inpage">
             :           See the following sections for details about each column attribute.
             :         </p>
             :       </conbody>
> Wouldn't it make more sense to first describe the Kudu-specific CREATE TABL
I've rearranged this area to be under a "DDL" subheading, and added a couple of links back
to the CREATE TABLE page. It'll be a work item in the future to fine-tune how much syntax
gets repeated here, or how much detail and examples goes into the SQL Ref pages for DDL and
DML statements.

PS4, Line 299: 
             :           <p>
> Even though this is the place to formally define the syntax for PRIMARY KEY

PS4, Line 368: 
             :             The contents of the primary key columns cannot be changed by an
             :             <codeph>UPDATE</codeph> or <co
> Remove.
Why don't I rephrase it a little. Juan's presentation mentioned how performance could suffer
if there were more then 5-6 columns in the primary key.

PS4, Line 383: ept>
             :       <concept id="kudu_not_null_attribute">
> Only primary key columns can be used in the PARTITION BY clause. This is a 

PS4, Line 399: <p>
             :             For non-Kudu tables, Impala allows any column to contain <codeph>NULL</codeph>
             :             values, because it is not practical to enforce a <q>not null</q>
constraint on HDFS
             :             data files that could be prepared using external tools and ETL
             :           </p>
> Move below L411.

PS4, Line 407: 
             :             For example, a table containing geographic information m
> I would rephrase it to something like "If an application requires a field t

Line 438:           </p>
> Maybe we should extend this paragraph and say that for those reasons, it is

PS4, Line 449: 
             :       <concept id="kudu_default_attribute">
> Do you suggest that this is not executed because we know it doesn't return 
This deduction was disproved by the EXPLAIN plan, so removing this example.

PS4, Line 460:             function calls.
             :           </p>
> that's true, but I think it's good practice to specify NOT NULL regardless,
I won't hint at future features, but why don't I say the code is more self-describing with
NOT NULL included.

PS4, Line 531: 
             :               <!-- GROUP_VARINT is internal use only, not documenting that
> I would mention the encodings and maybe some guidelines on which one to use
Got some more details on these from Dan Burkert. For the real heavy lifting of definitions
and usage, I'll link to the Kudu docs.

I see the Impala parser accepts GROUP_VARINT but that gave me an error applying it to TINYINT
or BIGINT columns. Does it serve any purpose right now?

Line 553:               <li>
> I'd disagree with this -- in many cases encodings have a net speedup - ther

PS4, Line 567: >
             :             The following example shows the Impala keywords representing the
encoding types.
             :             (The Impala keywords match the symbolic names used within Kudu.)
             :             For usage guidelines on the different kinds of encoding, see
             :             <xref href=""
scope="external" format="html">the Kudu documentation</xref>.
             :             The <codeph>DESCRIBE</codeph> output
> Every compression alg has a tradeoff of compression ratio and cpu overhead 
I'll ping him offline rather than add him as a reviewer and dump the whole thing on him.

PS4, Line 592: ------+-----------------+
> that's slightly unexpected. I think a better example for dictionary would b

PS4, Line 597:    | AUTO_ENCODING   |
> seems somewhat contrived. Prefix encoding isn't super useful for most cases
OK, I'll take out this sentence entirely.

PS4, Line 601: 
             : | c7   | string  | false       | true     | PREFIX_ENCODING |
             : +------+---------+-------------+----------+-----------------+
             : </codeblock>
> hrm, I'm not sure I'd draw such a distinction between LZ4 and SNAPPY. In fa

PS4, Line 636: 
             :           <p>
> I think I'd not mention cfiles here, but say that there is an underlying un

PS4, Line 642:   therefore this column is a good candidate for dictionary encoding. The
             :             <codeph>post_id</codeph> column contains an ascending
sequence of integers, where
             :             several leading bits are likely to be all zeroes, therefore this
column is a good
> nope, not this block.

PS4, Line 648:   they employ the <codeph>COMPRESSION</codeph> attribute instead.
The ideal compression
             :             codec in each case would require some experimentation to determine
how much space
             :             savings it provided and how much CPU overhead it added, based on
real-world data.
             :           </p>
             : <codeblock>
             : CREATE TABLE blog_posts
             : (
             :   user_id STRING ENCODING DICT_ENCODING,
             :   post_id BIGINT ENCODING BIT_SHUFFLE,
             :   subject STRING ENCODING PLAIN_ENCODING,
             :   body STRING COMPRESSION LZ4,
             :   spanish_translation STRING COMPRESSION SNAPPY,
             :   esperanto_translation STRING COMPRESSION ZLIB,
             :   PRIMARY KEY (user_id, post_id)
             : ) PARTITION
> I think it's probably better to just defer the user to the Kudu docs for bl
Where's the right page to link to? I find most references to block size on client API reference
pages like

which don't have much usage advice.

PS4, Line 690: 

PS4, Line 691: LE performance_for_benchmark_xyz
             : (
             :   id BIGINT PRIMARY KEY,
> Why not putting the formal syntax instead of trying to describe the differe
I'm trying to outline the concepts behind Kudu partitioning before sending readers off to
the CREATE TABLE or ALTER TABLE pages for the full syntax.
File docs/topics/impala_refresh.xml:

Line 337:     <p rev="kudu" conref="../shared/impala_common.xml#common/kudu_blurb"/>
> AFAIK no
I'm going to reuse the same discussion of when to run REFRESH or INVALIDATE METADATA, and
why they're needed less frequently, as I put under INVALIDATE METADATA.
File docs/topics/impala_show.xml:

Line 1240:         partition covers the same range as originally specified.
> is this example supposed to be here?
This is the spot for examples of this statement for tables of all types. I'll clarify with
some intro text, and move the earlier Kudu example down here too.
File docs/topics/impala_tables.xml:

PS4, Line 293: Tables using the Apache Kudu storage system are treated specially, because
Kudu manages its data independently of HDFS files.
             :         Some information about the table is stored in the metastore database
for use by Impala. Other table metadata is
             :         managed internally by Kudu.
> Where do we talk about external vs managed Kudu tables?
Details to be added. I have been working purely with Impala-created tables so my knowledge
about external tables is mainly from the Kudu training sessions.

PS4, Line 308:         <codeph># Rows</codeph> (although this number is not currently
computed), <codeph>Start Key</codeph>, <codeph>Stop Key</codeph>,
<codeph>Leader Replica</codeph>, and <codeph># Replicas</codeph>.
> worth noting that #rows is inaccurate

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I76dcb948dab08532fe41326b22ef78d73282db2c
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell <>
Gerrit-Reviewer: Ambreen Kazi <>
Gerrit-Reviewer: Dimitris Tsirogiannis <>
Gerrit-Reviewer: Jean-Daniel Cryans <>
Gerrit-Reviewer: John Russell <>
Gerrit-Reviewer: Matthew Jacobs <>
Gerrit-Reviewer: Todd Lipcon <>
Gerrit-HasComments: Yes

View raw message