impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Russell (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] [DOCS] Major update to Impala + Kudu page
Date Tue, 24 Jan 2017 18:40:03 GMT
John Russell has posted comments on this change.

Change subject: [DOCS] Major update to Impala + Kudu page
......................................................................


Patch Set 4:

(32 comments)

Did all of the recent comments. I see some earlier ones still to do.

http://gerrit.cloudera.org:8080/#/c/5649/6/docs/shared/impala_common.xml
File docs/shared/impala_common.xml:

PS6, Line 3741: 
              : 
              : 
              : 
              : 
> We do need it when the changes happen externally.
Done. Per Dimitris's comment, I'm leaving this text unchanged.


http://gerrit.cloudera.org:8080/#/c/5649/6/docs/topics/impala_kudu.xml
File docs/topics/impala_kudu.xml:

PS6, Line 39: he Apache Kudu component.
> That still sounds weird. I'd switch to what Todd suggested.
Done. It is the nature though of language involving trademarks to always sound weird.


PS6, Line 45: The default Impala tables use data files stored on HDFS, which are ideal for
bulk loads
            :       and queries using full-table scans. In contrast, Kudu can do efficient
queries for data
            :       organized either in data warehouse style (with full table scans) or for
OLTP-style
            :       workloads (with key-based and range-based lookups for single rows or groups
of rows). Kudu
            :       tables are suitable for frequent small additions or changes.
> By default, Impala tables are stored in HDFS using various file formats. HD
Done


PS6, Line 55: work 
> work only
Done


PS6, Line 73: In these scenarios (such as for streaming data), it
            :         might be impractical to use Parquet tables because Parquet works best
with
            :         multi-megabyte data files, requiring substantial overhead to replace
or reorganize data
            :         files to accomodate frequent additions or changes to data. 
> I don't think we should emphasize Parquet here. It is a limitation of the s
Done. I'm going to give a nod to "simplifying the ETL pipeline" in the reworded 2nd paragraph
on this page.


PS6, Line 78: without replacing the entire table contents
> remove. Just say "efficiently".
Done


PS6, Line 79: API
> Maybe mention supported languages (Python, Java, etc).
Done


PS6, Line 138: Data is physically divided automatically by Kudu. You do not deal with explicit
             :               partitions, as in typical large Impala tables. New data that
arrives is organized
             :               based on the data values of each row, not kept together in partitions
that must be
             :               created and managed individually.
> I don't agree with this description. You have to decide for each table the 
Let me reword to say you get a combination of control and flexibility, since you can still
make as many narrow range partitions as you like, or specify wide range partitions or hashing
on top of range partitions.


PS6, Line 147: Data is logically divided, and work is parallelized, based on units called
             :               <term>tablets</term> and <term>tablet servers</term>.
> This is pretty vague. You need to make the distinction between tablets and 
I'll elaborate a little more, than link to the Kudu docs for full definitions.


PS6, Line 169: 
> How about DROP TABLE?
I'm primarily covering new syntax here. Why don't I say Impala DDL "Enhancements" in the title
since DROP TABLE is the same syntax as always.


PS6, Line 181: TABLE</codep
> incomplete sentence
Done


PS6, Line 184: familiarize yourself with Kudu-related concepts and syntax first.
             :       </p>
> incomplete sentence
Done. A sentence got inserted in the middle of another sentence, so the 2 incomplete ones
are part of the same original sentence.


PS6, Line 214: y ones 
> What does "arrange" mean? If you refer to mapping of rows to tablets say so
Done. I'll say it maps the rows to tablets.


PS6, Line 215: clauses and are highly selective.
             :             </p>
> That is not necessarily true.
Done. I'll take out the mention of WHERE clauses. I think by definition the combination of
primary key columns is highly selective because of the uniqueness aspect. So having a repetitive
column as part of the primary key would be wasteful and therefore rare.


PS6, Line 234: 
> You mean the uniqueness and nullability constraints? These are indeed enfor
Done. OK, reworded as "these constraints".


PS6, Line 266: 
> constant expression
Done


PS6, Line 490: 
> colloquial phrasing, how about among rel. db mgmt systems
Done. I'll leave out the word "relational" because I always worry about people getting the
wrong idea w.r.t. transactions, foreign keys, etc.


PS6, Line 561: 
             :         <title>COMPRESSION Attribute</title>
             : 
             :         <conbody>
> ?
That's a note to myself in an XML comment, it doesn't appear in the output. If I specify a
bogus keyword with the ENCODING clause, the Impala error message tells me that it's expecting
keywords including UNKNOWN and GROUP_VARINT, but in experiments I couldn't get Impala to accept
a CREATE TABLE statement with ENCODING UNKNOWN or ENCODING GROUP_VARINT. Maybe there are some
obsolete keywords left in our parser from earlier revisions of Kudu?


PS6, Line 677: 
> internally
Done.


PS6, Line 714: s.
> remove
Done


Line 740:   PARTITION BY HASH(id) PARTITIONS 50
> missing space
Done


PS6, Line 755: 
> there is no default
Done


PS6, Line 760: 
> multiple
Done


PS6, Line 778: 
             : <codeblock><![CDATA[
> I don't think this is a limitation
Ah right, I think it depends on the number of tablet servers and the cluster where I tried
it happened to state an upper limit of 60 in the error message. But a bigger cluster would
have stated a higher limit.  I'll reword the upper limit without trying to be too specific.


PS6, Line 856: p>
             : 
> I see what this is saying but I think this sentence will be  confusing. It 
Done. I'll clarify this particular sentence on this pass. Perhaps go into more detail about
the gap aspect in a subsequent iteration.


PS6, Line 903: <codeph>CREATE TABLE</codeph> syntax displayed by this statement
includes all the
             :             hash, range, or both clauses that reflect the original table structure
pl
> this makes it sound like you can't drop a range unless it's empty which is 
Done


PS6, Line 993:   <concept id="kudu_etl">
             : 
> one of these says:
Already discussed in impala_common.xml. I'll leave as-is.


PS6, Line 1099: e. For example, you cannot do a sequence of
              :         <codeph>UPDATE</codeph> statements and only make the change
visible after all the
              :         statements are finished. Also, if a DML statement fails partway through,
any rows that
              :         were already inserted, deleted, or changed remain in the table; there
is no rollback
              :         mechanism to undo the changes.
              :    
> looks like something is missing
I'm going to use some discretion about which assertions to back up and illustrate with examples.
I'll remove most of the empty <codeblock> elements on this page for now, and fill in
additional examples in a subsequent iteration.


PS6, Line 1207: 
              :           <p>
              :             Sentry authorization.
              :           </p>
              :         </li>
              : 
> empty code block?
This one in particular I'll delete for now rather than fill in. Because the success messages
for DELETE / UPDATE / UPSERT don't report number of rows affected.


PS6, Line 1250: y using a clause such
> list the limitations?
OK, I'll reuse the same verbiage as under the GRANT statement. (There's a #include-like mechanism
in this XML dialect so it'll show up as a 1-liner here, referencing an ID with the main text
in impala_common.xml.)


PS6, Line 1308: 
              : 
              : 
> I don't see any other content
I've been toning down the compare-and-contrast with Parquet in other areas. I don't have enough
concrete info right now to make a compelling section. I'm going to suppress this topic in
the output by applying audience="hidden" in the <concept> tag.


PS6, Line 1323: 
              : 
              : 
> not sure if this section is useful as-is
Exactly, I'll hide but leave the skeleton here in case it makes sense to revisit later.


-- 
To view, visit http://gerrit.cloudera.org:8080/5649
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I76dcb948dab08532fe41326b22ef78d73282db2c
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: John Russell <jrussell@cloudera.com>
Gerrit-Reviewer: Ambreen Kazi <ambreen.kazi@cloudera.com>
Gerrit-Reviewer: Dimitris Tsirogiannis <dtsirogiannis@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jdcryans@apache.org>
Gerrit-Reviewer: John Russell <jrussell@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <todd@apache.org>
Gerrit-HasComments: Yes

Mime
View raw message