impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input
Date Fri, 28 Oct 2016 16:17:21 GMT
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input

Patch Set 4:


Thanks for adding the tests, this is looking pretty good - I think it could be refined and
polished a bit more but no major concerns here.
File be/src/exec/

Line 268:   // Pass the row batch to the writer. If new_file is returned true then the current
I find this comment a little confusing. Maybe explain more the "why", i.e. that the rows may
be written to multiple files.

Line 272:   do {
I feel like this might be clearer if you write it as while(true) and then break out if 'new_file'
is false.

Line 288:   // TODO: Should we adapt hdfs writer to accept start,num pair instead of a vector?
My guess is that it would only help a little bit. It's nice to use the same code path for
this and dynamic partitioning at least for now.

Line 296:     PartitionPair* next_partition_pair = NULL;
It would be more efficient to keep track of the previous key and comparing directly instead
of looking up the partition hash table for every row. It's a little more complicated but I
think would help perf.

PS4, Line 301: Flush
I think it might be clearer to just say "Write rows" instead of "flush". Flush makes me think
it's writing to the filesystem, but it's really just pushing the rows into the HdfsTableWriter,
which does its own buffering.

Line 301:         // Flush previous partition
I think the comments in here are a little too low level. Here maybe a single comment along
the lines of "Done with previous partition - write rows and close." might be more helpful.

Line 310:     // Collect row index
This comment isn't too helpful.

PS4, Line 313: Flush
"Write rows to" is probably clearer than "flush".

PS4, Line 607: WriteRowsOfPartition
File be/src/exec/hdfs-table-sink.h:

PS4, Line 183: rows
Maybe reword to be clearer, took me a while to figure out:
"a vector of indices of the rows in the current batch to insert into the partition"

Line 209:   /// Writes all rows of a partition to the partitions writer and clears the row
This could be clearer that it's only writing the rows with indices in 'partition_pair', it
kind-of sounds like it's writing all the rows in 'batch' into the partition.

PS4, Line 209: partitions
partition's (missing apostrophe).

PS4, Line 215:  is expected to
This could be more declarative - i.e. "must" instead of "is expected to"
File tests/query_test/

PS4, Line 71: text
parquet, right?
File tests/query_test/

Line 476: 
Hmm, I think we could also do with a test that writes some large partitions split across files.
E.g. do a CTAS of lineitem partitioned by some low-NDV column. We could also maybe adjust
PARQUET_FILE_SIZE to force it to split it into more files.

Depending on how long it takes, might make sense to only run it in exhaustive.

Line 477: 
Extra blank line

Line 501:       assert len(files) == 1
Comment why we only expect one file - it might not be that obvious to people looking at this
code later.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker <>
Gerrit-Reviewer: Lars Volker <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-HasComments: Yes

View raw message