impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Henry Robinson (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner
Date Thu, 17 Nov 2016 05:54:35 GMT
Henry Robinson has posted comments on this change.

Change subject: IMPALA-2494: Support for byte array-encoded decimals in Parquet scanner
......................................................................


Patch Set 1:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/5115/1//COMMIT_MSG
Commit Message:

Line 21:  * Tested computing SUM(col) for 1 billion distinct dictionary-encoded
> I'm assuming you are measuring response time. Since there is overall more w
Right, but I'm measuring the absolute difference in response time, not  the relative difference.
So it doesn't totally matter what the rest of the query is doing, as long as it's stable (I
agree that variance can drown out the signal). Thankfully all the runtimes I measured were
remarkably stable. However, your next point is a very good one.


Line 23:  * No performance difference measured by introduction of extra
> I'm assuming you compared response times. With multi-threaded scans the los
Doh, of course you're right, and I wasn't properly measuring the overhead, because of parallelism.

With a single thread, decoding 1B decimals is about 1 second slower, which is about 1ns per
decode. I've updated the commit message (still think that's not a perf difference worth chasing).


http://gerrit.cloudera.org:8080/#/c/5115/1/be/src/exec/parquet-common.h
File be/src/exec/parquet-common.h:

Line 203:   /// assumed to be BYTE_ARRAY, otherwise it is assumed to be FIXED_LEN_BYTE_ARRAY.
> Document return value?
Done


PS1, Line 328: fixed_len_size
> This name seems inaccurate since it can be variable. Maybe 'encoded_size'?
Hm, I think it's accurate - either it's set (> 0) and fixed, or it's not set (-1). 'encoded_size'
wouldn't work well later in the method (because it's set to the size of the buffer, not the
total encoded size). I could of course have another variable to manage that, but this seems
ok to me.


PS1, Line 338: fixed_len_size
> What if fixed_len_size is negative?
Done.


http://gerrit.cloudera.org:8080/#/c/5115/1/testdata/bin/create-load-data.sh
File testdata/bin/create-load-data.sh:

Line 150:   # Decimal Parquet encodings written by Java Parquet library.
> If you're going to create the table as part of the test, might as well copy
Done.


-- 
To view, visit http://gerrit.cloudera.org:8080/5115
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: If95171e65aa48f08b08b8e87f4555dc75e867977
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Henry Robinson <henry@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Henry Robinson <henry@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message