impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Attila Jeges (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-3079: Fix sequence file writer
Date Wed, 22 Mar 2017 18:33:27 GMT
Attila Jeges has posted comments on this change.

Change subject: IMPALA-3079: Fix sequence file writer
......................................................................


Patch Set 5:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.cc
File be/src/exec/hdfs-sequence-table-writer.cc:

PS4, Line 179: 
             : 
> Is this the reason the old sequence file writer created corrupted files ?
The old sequence file writer had multiple issues:

1. ReadWriteUtil::VLongRequiredBytes() and ReadWriteUtil::PutVLong() were broken. As a result,
Impala could not read back uncompressed sequence files created by Impala (see be/src/exec/read-write-util.h).

2. KEY_CLASS_NAME was missing from the sequence file header. As a result, Hive could not read
back uncompressed sequence files created by Impala (see be/src/exec/hdfs-sequence-table-writer.cc).

3. Keys were missing from record-compressed sequence files. Hive could not read back record-compressed
sequence files created by Impala (see be/src/exec/hdfs-sequence-table-writer.cc).

4. In some cases the wrong Record-compression flag was written to the sequence file header.
As a result, Hive could not read back record-compressed sequence files created by Impala (see
be/src/exec/hdfs-sequence-table-writer.cc).

5.Impala added 'sync_marker' instead of 'neg1_sync_marker' to the beginning of blocks in block-compressed
sequence files. Hive could not read these files back.

6. Impala created block-compressed files with:
- empty key-lengths block (L176)
- empty keys block (L177)
- empty value-lengths block (L180)

This resulted in invalid block-compressed sequence files that Hive could not read back (see
HdfsSequenceTableWriter::WriteCompressedBlock()).


PS4, Line 139: _cast<const uint8_t*>(KEY_CLASS_NAME));
> nit: indent 4. Same below.
Done


Line 191:   record.WriteBytes(output_length, output);
> It seems a bit unfortunate that we don't free the temp buffer (i.e. output)
Added FreeAll to the end of the 'Flush()' function.


PS4, Line 193:   // Output compressed keys block-size & compressed keys block.
             :   // The keys block contains "\0\0\0\0" byte sequence as a key for each row
(this is what
             :   // Hive does).
> Does not writing key-lengths block and key block prevent the file from bein
Yes, Hive failed with an exception when I tried that.


http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.h
File be/src/exec/hdfs-sequence-table-writer.h:

Line 29: 
> Would be good to add the details of the Sequence file's layout as top level
Done


http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/exec/read-write-util.h
File be/src/exec/read-write-util.h:

Line 230: // For more information, see the documentation for 'WritableUtils.writeVLong()'
method:
> DCHECK_GE(num_bytes, 2);
Done


PS3, Line 233: nt64_t num_b
> May also want to state that the source of this behavior.
Done


http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/util/compress.cc
File be/src/util/compress.cc:

Line 248:     outp += size;
> DCHECK_LE(outp - out_buffer_,  length);
Done


http://gerrit.cloudera.org:8080/#/c/6107/3/testdata/workloads/functional-query/queries/QueryTest/seq-writer.test
File testdata/workloads/functional-query/queries/QueryTest/seq-writer.test:

Line 212: stored as SEQUENCEFILE;
> May be helpful to also add a test for writing empty file:
I tried this and it doesn't write an empty file. It doesn't create a file at all. Probably
there's no easy way to force the sequence file writer to create an empty-file.


http://gerrit.cloudera.org:8080/#/c/6107/3/tests/query_test/test_compressed_formats.py
File tests/query_test/test_compressed_formats.py:

PS3, Line 170:       # Read it back in Impala
             :       output = self.client.execute('select count(*) from %s' % table_name)
             :       assert '16541' == output.get_data()
             :       # Read it back in Hive
             :       output = self.run_stmt_in_hive('select count(*) from %s' % table_name)
             :       assert '16541' == output.split('\n')[1]
             : 
             :   def test_avro_writer(self, vector):
             :     self.run_test_case('QueryTest/avro-wri
> Doesn't this duplicate the second test ? May help to test with empty file a
The 2nd test is for record-compressed sequence files while the 3rd is for block-compressed
seq files.
Added tests for files greater than 4K and less than 4K. I could not figure out how to create
an empty file.


-- 
To view, visit http://gerrit.cloudera.org:8080/6107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <attilaj@cloudera.com>
Gerrit-Reviewer: Attila Jeges <attilaj@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <marcel@cloudera.com>
Gerrit-Reviewer: Michael Ho <kwho@cloudera.com>
Gerrit-HasComments: Yes

Mime
View raw message