impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dimitris Tsirogiannis (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] Major update to Impala + Kudu page
Date Fri, 13 Jan 2017 23:49:59 GMT
Dimitris Tsirogiannis has posted comments on this change.

Change subject: Major update to Impala + Kudu page
......................................................................


Patch Set 4:

(31 comments)

Initial round of comments.

http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_explain.xml
File docs/topics/impala_explain.xml:

PS4, Line 239: he <codeph>EXPLAIN</codeph> statement displays equivalent plan
             :       information for queries against Kudu tables as for queries
             :       against HDFS-based tables.
Don't we need to talk about the predicates that get pushed to Kudu?


http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_invalidate_metadata.xml
File docs/topics/impala_invalidate_metadata.xml:

PS4, Line 245: <!-- Anything to say for INVALIDATE METADATA with Kudu? -->
             :     <!-- <p rev="kudu" conref="../shared/impala_common.xml#common/kudu_blurb"/>
-->
Where do we talk about Kudu table metadata? We need to mention that source of truth for table
metadata (e.g. schema, partitions) is Kudu and not HMS. Invalidate metadata and refresh cause
the table metadata to be reloaded from Kudu and any external schema changes to be reflected
to the catalog.


http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_kudu.xml
File docs/topics/impala_kudu.xml:

PS4, Line 149: tablet servers is independent of the number of DataNodes in the cluster.
One question that someone might have is whether tablet servers and data nodes need to be collocated
or not for functional or performance reasons. It would be nice to comment on that here.


PS4, Line 201: For a
             :           partitioned Kudu table,
Remove. There are only partitioned Kudu tables.


PS4, Line 221: For example, if an <codeph>INSERT</codeph> operation fails partway
through, only some of the
             :               new rows might be present in the table. You can re-run the same
<codeph>INSERT</codeph>, and
             :               only the missing rows will be added. Or if data in the table
is stale, you can run an
             :               <codeph>UPSERT</codeph> statement that brings the
data up to date, without the possibility
             :               of creating duplicate copies of existing rows.
Why blending the semantics of upsert and insert with the description of primary keys? I don't
think this section belongs here.


PS4, Line 233: These restrictions on the primary key columns
How about the nullability restrictions on non-primary key columns. Where are they enforced?


PS4, Line 247: <p>
             :           Kudu can do extra optimizations for queries that refer to the primary
key columns in
             :           the <codeph>WHERE</codeph> clause. It is not required
though to include the primary
             :           key columns in the <codeph>WHERE</codeph> clause of every
query.
             :         </p>
I am not sure I understand the purpose of this paragraph here. This either belongs to the
place where you discuss schema design or even better performance tuning.


PS4, Line 269: PRIMARY KEY
             : | [NOT] NULL
             : | ENCODING <varname>codec</varname>
             : | COMPRESSION <varname>algorithm</varname>
             : | DEFAULT <varname>expression</varname>
             : | BLOCK_SIZE <varname>number</varname>
Wouldn't it make more sense to first describe the Kudu-specific CREATE TABLE syntax and then
analyze the individual components of it?


PS4, Line 299: You can specify the <codeph>PRIMARY KEY</codeph> attribute either
inline in a single
             :             column definition, or as a separate clause at the end of the column
list:
Even though this is the place to formally define the syntax for PRIMARY KEYs, you seem to
have more detailed information about PKs in the paragraph in L238. I would remove that paragraph
entirely and move that content here.


PS4, Line 360: <p>
             :             The notion of primary key only applies to Kudu tables. Every Kudu
table requires a
             :             primary key. The primary key consists of one or more columns. You
must specify any
             :             primary key columns first in the column list.
             :           </p>
Move to the beginning or in the general description of primary keys for Kudu tables.


PS4, Line 368: Therefore, only
             :             include a column in the primary key specification if you are certain
of its value at
             :             the time each row is inserted.
Remove.


PS4, Line 383: Because the primary key columns are typically carefully planned and have known
             :             characteristics in terms of range and distribution, those columns
are usually good
             :             candidates for using with the <codeph>PARTITION BY</codeph>
clause of the
             :             <codeph>CREATE TABLE</codeph> statement.
Only primary key columns can be used in the PARTITION BY clause. This is a requirement and
not some nice-to-have property.


PS4, Line 399: <p>
             :             For non-Kudu tables, Impala allows any column to contain <codeph>NULL</codeph>
             :             values, because it is not practical to enforce a <q>not null</q>
constraint on HDFS
             :             data files that could be prepared using external tools and ETL
processes.
             :           </p>
Move below L411.


PS4, Line 407: If a
             :             missing data field is considered to be a serious problem
I would rephrase it to something like "If an application requires a field to be always specified...".


Line 438:             operations.
Maybe we should extend this paragraph and say that for those reasons, it is recommended that
users specify nullability constraints when they are known.


PS4, Line 449: -- This query can return a count of 0 very quickly because
             : -- column C3 by definition cannot contain any null values.
             : SELECT COUNT(*) FROM c3_has_no_nulls WHERE c3 IS NULL;
Do you suggest that this is not executed because we know it doesn't return any results? Have
you verified this? The profile of a query like that doesn't seem to suggest any optimization
like that, but I could be wrong.


PS4, Line 531: By default, each
             :             column uses the <q>plain</q> encoding where the data
is stored unchanged
I would mention the encodings and maybe some guidelines on which one to use for different
data types. Maybe check with Todd on recommendations.


PS4, Line 567: You can specify a compression algorithm to use for each column in a Kudu table.
This
             :             attribute imposes more CPU overhead when retrieving the values
than the
             :             <codeph>ENCODING</codeph> attribute does. Therefore,
use it primarily for columns
             :             that are rarely included in result sets, or rarely used in <codeph>WHERE</codeph>
             :             clauses, or for very long strings that do not benefit much from
the less-expensive
             :             <codeph>ENCODING</codeph> attribute.
Every compression alg has a tradeoff of compression ratio and cpu overhead and choosing the
right comp. algorithm is always tricky and data dependent. Have you discussed these recommendations
with Mostafa?


PS4, Line 690: PARTITIONED
PARTITION


PS4, Line 691: range specification clauses rather than a simple <codeph>PARTITIONED
BY</codeph>
             :           clause that specifies only a column name and creates a new partition
for each
             :           different value.
Why not putting the formal syntax instead of trying to describe the differences between Kudu
and HDFS partitioning?


PS4, Line 697: With Kudu tables, all of the columns involved in these clauses must be primary
key
             :           columns. These clauses let you specify different ways to divide the
data for each
             :           column, or even for different value ranges within a column. This
flexibility lets you
             :           avoid problems with uneven distribution of data, where the partitioning
scheme for
             :           HDFS tables might result in some partitions being much larger than
others. By setting
             :           up an effective partitioning scheme for a Kudu table, you can ensure
that the work for
             :           a query can be parallelized evenly across the hosts in a cluster.
Remove this paragraph. You can use Kudu's white paper (section 3.2) for a better description
of table partitioning.


PS4, Line 706: <note>
             :           <p>
             :             The Impala DDL syntax for Kudu tables is different than in early
Kudu versions,
             :             which used an experimental fork of the Impala code. For example,
the
             :             <codeph>DISTRIBUTE BY</codeph> clause is now <codeph>PARTITION
BY</codeph>, the
             :             <codeph>INTO <varname>n</varname> BUCKETS</codeph>
clause is now
             :             <codeph>PARTITIONS <varname>n</varname></codeph>and
the range partitioning syntax
             :             is reworked to replace the <codeph>SPLIT ROWS</codeph>
clause with more expressive
             :             syntax involving comparison operators.
             :           </p>
             :         </note>
I don't understand the point of this paragraph before even presenting the formal syntax for
partitioning Kudu tables. If you want to mention that the old syntax is not supported add
a note at the end.


PS4, Line 727: simplest
remove


PS4, Line 727: default type
Why "default"? Both hash and range need to be explicitly specified. None of them is implied.


PS4, Line 728: inserted rows are divided up between a fixed number
             :             of <q>buckets</q> in an essentially random way. Applying
the hash function to the
             :             column values ensures that rows with similar values are evenly
distributed, instead of
             :             clumping together all in the same bucket.
Why not just saying that rows are assigned to buckets by applying a hash function on the values
of the columns used to define the hash partitioning?


PS4, Line 731: Spreading new rows across the buckets this
             :             way lets insertion operations work in parallel across all the tablet
servers.
You need to mention the drawback of using hash partitioning as well. i.e. queries with range
predicates may end up scanning multiple tables.


PS4, Line 743: -- No matter how the values are distributed in the source table, the data will
be
             : -- evenly distributed between the buckets in the destination table.
That claim in not universally true. I would just remove it or make it case specific. e.g.
since ids are unique you expect the rows to be evenly distributed across partitions.


PS4, Line 765: for when you know the
             :             expected distribution of values in a column and can make informed
decisions about
             :             how to divide them.
That's not necessarily the primary reason for using range partitioning. You may want to stick
to the formal description and syntax here and add a section about use cases (schema design)
later.


PS4, Line 767: more <codeph>RANGE</codeph> clauses to the
             :             <codeph>CREATE TABLE</codeph> statement, following
the <codeph>PARTITION BY</codeph
Can we add a formal syntax here?


PS4, Line 774: <codeph>VALUE</codeph> or <codeph>VALUES</codeph>
You need to talk when VALUE and VALUES is used (single vs multi-column range) and give proper
examples for both cases.


http://gerrit.cloudera.org:8080/#/c/5649/4/docs/topics/impala_tables.xml
File docs/topics/impala_tables.xml:

PS4, Line 293: Tables using the Apache Kudu storage system are treated specially, because
Kudu manages its data independently of HDFS files.
             :         Some information about the table is stored in the metastore database
for use by Impala. Other table metadata is
             :         managed internally by Kudu.
Where do we talk about external vs managed Kudu tables?


-- 
To view, visit http://gerrit.cloudera.org:8080/5649
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I76dcb948dab08532fe41326b22ef78d73282db2c
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell <jrussell@cloudera.com>
Gerrit-Reviewer: Ambreen Kazi <ambreen.kazi@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jdcryans@apache.org>
Gerrit-Reviewer: John Russell <jrussell@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <todd@apache.org>
Gerrit-HasComments: Yes

Mime
View raw message