Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id DB072200C3F for ; Wed, 22 Mar 2017 19:33:33 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id D988A160B86; Wed, 22 Mar 2017 18:33:33 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 03B08160B74 for ; Wed, 22 Mar 2017 19:33:32 +0100 (CET) Received: (qmail 11033 invoked by uid 500); 22 Mar 2017 18:33:32 -0000 Mailing-List: contact reviews-help@impala.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@impala.incubator.apache.org Received: (qmail 11020 invoked by uid 99); 22 Mar 2017 18:33:31 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Mar 2017 18:33:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 90B301A038D for ; Wed, 22 Mar 2017 18:33:31 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.362 X-Spam-Level: X-Spam-Status: No, score=0.362 tagged_above=-999 required=6.31 tests=[RDNS_DYNAMIC=0.363, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id Q5-c91SPdTDa for ; Wed, 22 Mar 2017 18:33:30 +0000 (UTC) Received: from ip-10-146-233-104.ec2.internal (ec2-75-101-130-251.compute-1.amazonaws.com [75.101.130.251]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id E286260CE2 for ; Wed, 22 Mar 2017 18:33:29 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by ip-10-146-233-104.ec2.internal (8.14.4/8.14.4) with ESMTP id v2MIXRFV015814; Wed, 22 Mar 2017 18:33:27 GMT Message-Id: <201703221833.v2MIXRFV015814@ip-10-146-233-104.ec2.internal> Date: Wed, 22 Mar 2017 18:33:27 +0000 From: "Attila Jeges (Code Review)" To: impala-cr@cloudera.com, reviews@impala.incubator.apache.org CC: Marcel Kornacker , Michael Ho Reply-To: attilaj@cloudera.com X-Gerrit-MessageType: comment Subject: =?UTF-8?Q?=5BImpala-ASF-CR=5D_IMPALA-3079=3A_Fix_sequence_file_writer=0A?= X-Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487 X-Gerrit-ChangeURL: X-Gerrit-Commit: 24d588877e5d799f2897c4a9d596da8097af09bc In-Reply-To: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Content-Disposition: inline User-Agent: Gerrit/2.12.7 archived-at: Wed, 22 Mar 2017 18:33:34 -0000 Attila Jeges has posted comments on this change. Change subject: IMPALA-3079: Fix sequence file writer ...................................................................... Patch Set 5: (10 comments) http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.cc File be/src/exec/hdfs-sequence-table-writer.cc: PS4, Line 179: : > Is this the reason the old sequence file writer created corrupted files ? The old sequence file writer had multiple issues: 1. ReadWriteUtil::VLongRequiredBytes() and ReadWriteUtil::PutVLong() were broken. As a result, Impala could not read back uncompressed sequence files created by Impala (see be/src/exec/read-write-util.h). 2. KEY_CLASS_NAME was missing from the sequence file header. As a result, Hive could not read back uncompressed sequence files created by Impala (see be/src/exec/hdfs-sequence-table-writer.cc). 3. Keys were missing from record-compressed sequence files. Hive could not read back record-compressed sequence files created by Impala (see be/src/exec/hdfs-sequence-table-writer.cc). 4. In some cases the wrong Record-compression flag was written to the sequence file header. As a result, Hive could not read back record-compressed sequence files created by Impala (see be/src/exec/hdfs-sequence-table-writer.cc). 5.Impala added 'sync_marker' instead of 'neg1_sync_marker' to the beginning of blocks in block-compressed sequence files. Hive could not read these files back. 6. Impala created block-compressed files with: - empty key-lengths block (L176) - empty keys block (L177) - empty value-lengths block (L180) This resulted in invalid block-compressed sequence files that Hive could not read back (see HdfsSequenceTableWriter::WriteCompressedBlock()). PS4, Line 139: _cast(KEY_CLASS_NAME)); > nit: indent 4. Same below. Done Line 191: record.WriteBytes(output_length, output); > It seems a bit unfortunate that we don't free the temp buffer (i.e. output) Added FreeAll to the end of the 'Flush()' function. PS4, Line 193: // Output compressed keys block-size & compressed keys block. : // The keys block contains "\0\0\0\0" byte sequence as a key for each row (this is what : // Hive does). > Does not writing key-lengths block and key block prevent the file from bein Yes, Hive failed with an exception when I tried that. http://gerrit.cloudera.org:8080/#/c/6107/4/be/src/exec/hdfs-sequence-table-writer.h File be/src/exec/hdfs-sequence-table-writer.h: Line 29: > Would be good to add the details of the Sequence file's layout as top level Done http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/exec/read-write-util.h File be/src/exec/read-write-util.h: Line 230: // For more information, see the documentation for 'WritableUtils.writeVLong()' method: > DCHECK_GE(num_bytes, 2); Done PS3, Line 233: nt64_t num_b > May also want to state that the source of this behavior. Done http://gerrit.cloudera.org:8080/#/c/6107/3/be/src/util/compress.cc File be/src/util/compress.cc: Line 248: outp += size; > DCHECK_LE(outp - out_buffer_, length); Done http://gerrit.cloudera.org:8080/#/c/6107/3/testdata/workloads/functional-query/queries/QueryTest/seq-writer.test File testdata/workloads/functional-query/queries/QueryTest/seq-writer.test: Line 212: stored as SEQUENCEFILE; > May be helpful to also add a test for writing empty file: I tried this and it doesn't write an empty file. It doesn't create a file at all. Probably there's no easy way to force the sequence file writer to create an empty-file. http://gerrit.cloudera.org:8080/#/c/6107/3/tests/query_test/test_compressed_formats.py File tests/query_test/test_compressed_formats.py: PS3, Line 170: # Read it back in Impala : output = self.client.execute('select count(*) from %s' % table_name) : assert '16541' == output.get_data() : # Read it back in Hive : output = self.run_stmt_in_hive('select count(*) from %s' % table_name) : assert '16541' == output.split('\n')[1] : : def test_avro_writer(self, vector): : self.run_test_case('QueryTest/avro-wri > Doesn't this duplicate the second test ? May help to test with empty file a The 2nd test is for record-compressed sequence files while the 3rd is for block-compressed seq files. Added tests for files greater than 4K and less than 4K. I could not figure out how to create an empty file. -- To view, visit http://gerrit.cloudera.org:8080/6107 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I0db642ad35132a9a5a6611810a6cafbbe26e7487 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Attila Jeges Gerrit-Reviewer: Attila Jeges Gerrit-Reviewer: Marcel Kornacker Gerrit-Reviewer: Michael Ho Gerrit-HasComments: Yes