impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Ho (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-3079: Fix sequence file writer
Date Mon, 13 Mar 2017 20:59:28 GMT
Michael Ho has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer

Patch Set 4:

File be/src/exec/

PS4, Line 179: 
Is this the reason the old sequence file writer created corrupted files ?

PS4, Line 139: reinterpret_cast<const uint8_t*>(KEY_CLASS_NAME));
nit: indent 4. Same below.

Line 191:   record.WriteBytes(output_length, output);
It seems a bit unfortunate that we don't free the temp buffer (i.e. output) after copying
the data to record.

Do you know if there may be other allocations from mem_pool_ which need to persist across
calls to WriteCompressedBlock() ? Otherwise, we should consider calling mem_pool_->FreeAll()
at the caller of this function to avoid accumulating too much memory when writing a lot of

PS4, Line 193:   // Output compressed keys block-size & compressed keys block.
             :   // The keys block contains "\0\0\0\0" byte sequence as a key for each row
(this is what
             :   // Hive does).
Does not writing key-lengths block and key block prevent the file from being read back correctly
in Hive ?
File be/src/exec/hdfs-sequence-table-writer.h:

Line 29: 
Would be good to add the details of the Sequence file's layout as top level comment here.
The following link seems to include quite a bit of useful information:
File be/src/exec/read-write-util.h:

Line 230: //
DCHECK_GE(num_bytes, 2);
DCHECK_LE(num_bytes, 9);
File be/src/util/

Line 248:     outp += size;
DCHECK_LE(outp - out_buffer_,  length);
File testdata/workloads/functional-query/queries/QueryTest/seq-writer.test:

May be helpful to also add a test for writing empty file:

select * from tpcds_parquet.store_sales where ss_item_sk=-1;
File tests/query_test/

PS3, Line 170:     # Write block-compressed sequence file and then read it back in Hive
             :     table_name = '%s.seq_tbl_block' % unique_database
             :     self.client.execute("set COMPRESSION_CODEC=DEFAULT")
             :     self.client.execute("set SEQ_COMPRESSION_MODE=BLOCK")
             :     self.client.execute("create table %s like functional.alltypes" % table_name)
             :     self.client.execute("insert into %s partition(year, month) "
             :         "select * from functional.alltypes" % table_name)
             :     output = self.run_stmt_in_hive("select count(*) from %s" % table_name)
             :     assert "7300" == output.split('\n')[1]
Doesn't this duplicate the second test ? May help to test with empty file and file less than
4K and greater than 4K.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <>
Gerrit-Reviewer: Attila Jeges <>
Gerrit-Reviewer: Marcel Kornacker <>
Gerrit-Reviewer: Michael Ho <>
Gerrit-HasComments: Yes

View raw message