impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tianyi Wang (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics
Date Tue, 12 Sep 2017 20:33:56 GMT
Tianyi Wang has posted comments on this change.

Change subject: IMPALA-5250: Unify decompressor output_length semantics
......................................................................


Patch Set 1:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/8030/1//COMMIT_MSG
Commit Message:

Line 9: This patch makes the semantics output_length parameter in
> the semantics of the output_length
Done


Line 10: Codec::ProcessBlock to be the same. In existing code different
> Codec::ProcessBlock() to be the same across all codecs. In the existing cod
Done


Line 13:    to actual decompressed length, but it does not set it to actual
> to the actual decompressed length
Done


Line 16:    be exactly the same as actual decompressed length, otherwise
> as the actual decompressed length
Done


Line 19:    actual decompressed length and will set it to actual decompressed
> the actual decompressed length
Done


Line 21: This inconsistency leads to a bug where the error message is
> This inconsistency lead to a bug where ....
Do you mean changing it to plural?


Line 24: decompressors can handle oversized output buffer correctly.
> can handle an oversized output buffer correctly.
Done


Line 25: Lz4Decompressor will use the "safe" version of decompression function
> will use the "safe" instead of the "fast" decompression function
Done


Line 26: instead of the "fast" version, for the latter is insecure with corrupted
> We use LZ4 for compressing exchange data, so we check the perf impact of th
Is there an individual benchmark?
There is one for compression in be/src/experiments/compression-test.cc.


Line 27: data and requires decompressed length to be known.
> and requires the decompressed length to be known
Done


http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress-test.cc
File be/src/util/decompress-test.cc:

Line 153:   // Test the behavior when the decompressor is given too little / too much space
> "." at end of sentence
Done


Line 155:   // correct output size when the space is enough, and does not write beyond the
output
> the correct output size
Done


Line 157:   void DecompressOverUnderSizedOutputBuffer(Codec *compressor, Codec *decompressor,
> Out style is "Codec* compressor"
Done


Line 196:     }
> I think we should test more directly the scenario mentioned in the JIRA whe
I'm not sure if my interpretation of JIRA is correct. In my understanding:
1. The parquet file is corrupted so either current_page_header_.uncompressed_page_size is
greater or current_page_header_.compressed_page_size is less than what it should be. 
2. Either way we allocated a larger buffer and the decompressor decompressed the (probably
incomplete) data successfully. 
3. The part of the buffer not written by the decompressor is uninitilized. Because the decompressor
returned OK, somehow we read this uninitilized part later and trigger undeterministic errors.
The problem here is in 2: The decompressor should report how much data it decompressed even
if the decompression is successful. Then ReadDataPage() can check whether it is the same as
current_page_header_.uncompressed_page_size.
So I think we should test whether output_len is set properly with an successful decompression.
I don't understand "set to 0" part. When should it be set to 0?


http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress.cc
File be/src/util/decompress.cc:

Line 426: // If size_only is false, output buffer size must be at least output_len. *output_len
is
> Mention what output_len is set to if a non-OK status is returned.
Done. It is set to 0 in this case, while other decompressor may leave it unmodified or set
it to an arbitrary intermediate value.
I think the complication here is caused by the design that 'output_len' and 'output' are output
parameters. When a parameter is both input and output, and could be passed as reference further
into callees of callee, there could be all kinds of bugs forgetting to set the correct value
here and there. I think if output and output_len is in the return value, creating such bugs
is much more difficult, and the code will be much easier to follow since the data flow is
clear and explicit.

I think we should have a Status<T> type for return values. It represents either a value
or an error status, and then we can avoid output parameters in new codes.


Line 427: // also updated with actual output size.
> the actual output size
Done


Line 575:   // LZ4_decompress_fast will cause a segmentation fault if passed a nullptr output.
> Comment still relevant?
Removed.


Line 579:   if (ret < 0) {
> single line
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/8030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <twang@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <twang@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message