impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Ho (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-3079: Fix sequence file writer
Date Mon, 13 Mar 2017 20:59:28 GMT
Michael Ho has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 4:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.cc
File be/src/exec/hdfs-sequence-table-writer.cc:

PS4, Line 179: 
             : 
Is this the reason the old sequence file writer created corrupted files ?


PS4, Line 139: reinterpret_cast<const uint8_t*>(KEY_CLASS_NAME));
nit: indent 4. Same below.


Line 191:   record.WriteBytes(output_length, output);
It seems a bit unfortunate that we don't free the temp buffer (i.e. output) after copying
the data to record.

Do you know if there may be other allocations from mem_pool_ which need to persist across
calls to WriteCompressedBlock() ? Otherwise, we should consider calling mem_pool_->FreeAll()
at the caller of this function to avoid accumulating too much memory when writing a lot of
data.


PS4, Line 193:   // Output compressed keys block-size & compressed keys block.
             :   // The keys block contains "\0\0\0\0" byte sequence as a key for each row
(this is what
             :   // Hive does).
Does not writing key-lengths block and key block prevent the file from being read back correctly
in Hive ?


http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.h
File be/src/exec/hdfs-sequence-table-writer.h:

Line 29: 
Would be good to add the details of the Sequence file's layout as top level comment here.
The following link seems to include quite a bit of useful information:

https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/io/SequenceFile.html


http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

Line 230: // https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/io/WritableUtils.html
DCHECK_GE(num_bytes, 2);
DCHECK_LE(num_bytes, 9);


http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/util/compress.cc
File be/src/util/compress.cc:

Line 248:     outp += size;
DCHECK_LE(outp - out_buffer_,  length);


http://gerrit.cloudera.org:8080/#/c/6107/3/testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
File testdata/workloads/functional-query/queries/QueryTest/seq-writer.test:

Line 212: SET COMPRESSION_CODEC=SNAPPY_BLOCKED;
May be helpful to also add a test for writing empty file:

select * from tpcds_parquet.store_sales where ss_item_sk=-1;


http://gerrit.cloudera.org:8080/#/c/6107/3/tests/query_test/test_compressed_formats.py
File tests/query_test/test_compressed_formats.py:

PS3, Line 170:     # Write block-compressed sequence file and then read it back in Hive
             :     table_name = '%s.seq_tbl_block' % unique_database
             :     self.client.execute("set COMPRESSION_CODEC=DEFAULT")
             :     self.client.execute("set SEQ_COMPRESSION_MODE=BLOCK")
             :     self.client.execute("create table %s like functional.alltypes" % table_name)
             :     self.client.execute("insert into %s partition(year, month) "
             :         "select * from functional.alltypes" % table_name)
             :     output = self.run_stmt_in_hive("select count(*) from %s" % table_name)
             :     assert "7300" == output.split('\n')[1]
Doesn't this duplicate the second test ? May help to test with empty file and file less than
4K and greater than 4K.


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <attilaj@cloudera.com>
Gerrit-Reviewer: Attila Jeges <attilaj@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Michael Ho <kwho@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message