impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dimitris Tsirogiannis (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] [DOCS] Major update to Impala + Kudu page
Date Fri, 20 Jan 2017 21:21:21 GMT
Dimitris Tsirogiannis has posted comments on this change.

Change subject: [DOCS] Major update to Impala + Kudu page
......................................................................


Patch Set 6:

(15 comments)

Another round of comments. I've seen that not all previous comments have been addressed, so
I'll wait for a new patch before continuing this review.

http://gerrit.cloudera.org:8080/#/c/5649/6/docs/topics/impala_kudu.xml
File docs/topics/impala_kudu.xml:

PS6, Line 39: the Apache Kudu component
That still sounds weird. I'd switch to what Todd suggested.


PS6, Line 45: The default Impala tables use data files stored on HDFS, which are ideal for
bulk loads
            :       and queries using full-table scans. In contrast, Kudu can do efficient
queries for data
            :       organized either in data warehouse style (with full table scans) or for
OLTP-style
            :       workloads (with key-based and range-based lookups for single rows or groups
of rows). Kudu
            :       tables are suitable for frequent small additions or changes.
By default, Impala tables are stored in HDFS using various file formats. HDFS files allow
for fast bulk loads (appends) and full-table scans but cannot support in-place updates (updates,
deletes). Kudu is an alternative storage engine that can be used in Impala and supports both
in-place updates (for OLTP-style operations) and fast scans (for data-warehouse/analytic operations).


PS6, Line 55: work 
work only


PS6, Line 73: In these scenarios (such as for streaming data), it
            :         might be impractical to use Parquet tables because Parquet works best
with
            :         multi-megabyte data files, requiring substantial overhead to replace
or reorganize data
            :         files to accomodate frequent additions or changes to data. 
I don't think we should emphasize Parquet here. It is a limitation of the storage engine not
the file format. You can mention parquet as an example of a commonly used file format.


PS6, Line 78: without replacing the entire table contents
remove. Just say "efficiently".


PS6, Line 79: API
Maybe mention supported languages (Python, Java, etc).


PS6, Line 138: Data is physically divided automatically by Kudu. You do not deal with explicit
             :               partitions, as in typical large Impala tables. New data that
arrives is organized
             :               based on the data values of each row, not kept together in partitions
that must be
             :               created and managed individually.
I don't agree with this description. You have to decide for each table the partitioning scheme
and all its details (number of partitions, actual range partitions, etc). What you don't control
is the mapping of rows to physical nodes.


PS6, Line 147: Data is physically divided, and work is parallelized, based on units called
             :               <term>tablets</term> and <term>tablet servers</term>.
This is pretty vague. You need to make the distinction between tablets and tablet servers
more clear.


PS6, Line 169: CREATE TABLE and ALTER TABLE
How about DROP TABLE?


PS6, Line 181: Because Kudu
incomplete sentence


PS6, Line 184: tables have features and properties that do not apply to other kinds of Impala
tables,
             :         familiarize yourself with Kudu-related concepts and syntax first.
incomplete sentence


PS6, Line 214: arrange
What does "arrange" mean? If you refer to mapping of rows to tablets say so, otherwise remove.


PS6, Line 215: The primary key columns are typically ones that are frequently used in <codeph>WHERE</codeph>
             :               clauses and are highly selective.
That is not necessarily true.


PS6, Line 234: These restrictions
You mean the uniqueness and nullability constraints? These are indeed enforced in Kudu but
I wouldn't call them restrictions. Allowing PRIMARY KEY and NOT NULL on only on Kudu tables
is a restriction enforced by Impala during the analysis.


PS6, Line 714: evenly
remove


-- 
To view, visit http://gerrit.cloudera.org:8080/5649
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I76dcb948dab08532fe41326b22ef78d73282db2c
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell <jrussell@cloudera.com>
Gerrit-Reviewer: Ambreen Kazi <ambreen.kazi@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jdcryans@apache.org>
Gerrit-Reviewer: John Russell <jrussell@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <todd@apache.org>
Gerrit-HasComments: Yes

Mime
View raw message