impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Behm (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet
Date Tue, 04 Apr 2017 06:22:27 GMT
Alex Behm has posted comments on this change.

Change subject: IMPALA-2716: Hive/Impala incompatibility for timestamp data in Parquet

Patch Set 5:

File fe/src/main/java/org/apache/impala/analysis/

Line 169:   private static void analyzeParquetMrWriteZone(Table table,
These cases need tests in AnalyzeDDL
File fe/src/main/java/org/apache/impala/analysis/

Line 183:     if (getTblProperties().containsKey(HdfsTable.TBL_PROP_PARQUET_MR_WRITE_ZONE))
These cases need tests in AnalyzeDDL
File fe/src/main/java/org/apache/impala/catalog/

Line 54: import org.apache.impala.common.InternalException;
not needed
File tests/custom_cluster/

Line 25: class TestParquetTimestampCompatibility(CustomClusterTestSuite):
Test needs a comment. In particular, what is covered here and what is covered in integration
tests not part of Impala tests.

Line 26:   TEST_DB_NAME = 'timezone_test_db'
Why not use the unique_database fixture? That's the best practice.

Line 39:     self.client = impalad.service.create_beeswax_client()
I think we already have a self.client from ImpalaTestSuite.

Line 49:         "/test-warehouse/alltypesagg_hive_13_1_parquet/alltypesagg_hive_13_1.parquet"
We need the appropriate filesystem prefix here. This will break on local fs or S3 builds.
Take a look at: get_fs_path()

and how that is used elsewhere

Line 56:     self.client.execute('USE ' + self.TEST_DB_NAME)
Why? Better to avoid changing the session state

Line 59:     self.client.execute('INVALIDATE METADATA')

Line 61:   def _set_impala_table_timezone(self, timezone):
simplify to pass table name as param

Line 71:   @pytest.mark.execute_serially
How long does this test take? I'm thinking we should only have minimal tests for the impalad
startup option in a custom cluster test, probably in the existing test for convert_legacy_hive_parquet_utc_timestamps.

The other tests should go into a regular test (subclass of ImpalaTestSuite) and run in parallel.

Line 81:     statement = '''ALTER TABLE hive_table
Should be an analyzer test in AnalyzeDDL

Line 104:     select_from_impala_table = '''SELECT timestamp_col FROM impala_table WHERE id
= 1'''
Can you think of a way to validate all values in the table in bulk e.g. using our existing
timestamp conversion functions?
Having a few example values is still good, but we should also test that we are internally
consistent with out conversion functions.

Line 113:     self._set_impala_table_timezone('EST')
Add some comments about the behavior here. We are getting different values for the same timezone
because the underlying Parquet files were written by Hive/Impala.

Line 141:     statement = '''ALTER TABLE hive_table
analyzer test is more appropriate (also fix elsewhere)

Line 185:   def test_ddl(self, vector):
We need a test where Hive sets a garbage timezone and Impala throws an error when analyzing
a query against that table.

We also need a test that shows what happens when a table property timezone is set and the
convert_legacy_hive_parquet_utc_timestamps option is used. Probably good to consolidate with
the existing custom cluster test.

To view, visit
To unsubscribe, visit

Gerrit-MessageType: comment
Gerrit-Change-Id: I3f24525ef45a2814f476bdee76655b30081079d6
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Attila Jeges <>
Gerrit-Reviewer: Alex Behm <>
Gerrit-Reviewer: Attila Jeges <>
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Taras Bobrovytsky <>
Gerrit-Reviewer: Zoltan Ivanfi <>
Gerrit-HasComments: Yes

View raw message