hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashish Singh" <asi...@cloudera.com>
Subject Re: Review Request 28372: HIVE-8950: Add support in ParquetHiveSerde to create table schema from a parquet file
Date Sun, 07 Dec 2014 01:41:11 GMT


> On Dec. 1, 2014, 8:13 p.m., Ryan Blue wrote:
> > common/src/java/org/apache/hadoop/hive/conf/HiveConf.java, line 762
> > <https://reviews.apache.org/r/28372/diff/2/?file=777120#file777120line762>
> >
> >     What does this change do?

Serdes in this list must have one or more columns in the create table statement. This change
removes this check for ParquetHiveSerDe.


> On Dec. 1, 2014, 8:13 p.m., Ryan Blue wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ColInfoFromParquetFile.java,
line 36
> > <https://reviews.apache.org/r/28372/diff/2/?file=777121#file777121line36>
> >
> >     I don't think ColInfoFromParquetFile is a great name for this class. What you've
implemented here is a schema converter, like HiveSchemaConverter but from MessageType to TypeInfo
rather than the other way around. I think this should either go in HiveSchemaConverter or
a new class, like ParquetToHiveSchemaConverter.
> >     
> >     This update would also help clarify the methods exposed. I think that the primary
method this should expose is:
> >       StructType convert(GroupType parquetSchema)
> >       
> >     File reading should be done elsewhere if it is necessary.
> >     
> >     Also, I think this should return a StructType instead of a Pair with Strings.
That would be a lot cleaner and would avoid the custom type string building immediately followed
by parsing those type strings.

Made appropriate changes.


> On Dec. 1, 2014, 8:13 p.m., Ryan Blue wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java, line
107
> > <https://reviews.apache.org/r/28372/diff/2/?file=777122#file777122line107>
> >
> >     Is it possible to avoid setting a file property? There are lots of cases where
a file in the dataset would be removed so this is a brittle method of configuring the table.
Ideally, we would check for the LIST_COLUMN_TYPES property and if we don't find it, conver
the schema of the first file that we find.

Below are the different scenarios possible and how they will be handled.

1. Managed table with parquet.file specified:
1.a. While creating the table ParquetHiveSerDe will use that file to create hive schema. 
1.b. If at a later point of time, say while select data from table, file specified by parquet.file
is deleted, then ParquetHiveSerDe will look for a file in table dir and will try to create
hive schema from that file's parquet metadata.
1.c. If parquet.file is deleted and table dir is empty, there is no way hive schema can be
created. So, an informative exception will be thrown, "Either provide schema for table or
point to parquet file using parquet.file in tblproperties or make sure that table has atleast
one parquet file with required metadata". Note that if a table reaches this state, it can
be fixed either by altering table and pointing parquet.file to a valid file or by copying
a parquet file to table dir. This is valid for all cases.

2. Managed table without parquet.file specified:
2.a. While creating the table ParquetHiveSerDe will use any file in table dir to create hive
schema. If table dir is empty, it will throw an informative exception.

3. External table with parquet.file specified: Same as (1).
4. External table without parquet.file specified and empty table dir: Same as (1.c)
5. External table without parquet.file specified and a parquet file present in table dir:
Same as (2)


- Ashish


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/28372/#review63424
-----------------------------------------------------------


On Nov. 27, 2014, 1:08 a.m., Ashish Singh wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/28372/
> -----------------------------------------------------------
> 
> (Updated Nov. 27, 2014, 1:08 a.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-8950
>     https://issues.apache.org/jira/browse/HIVE-8950
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> HIVE-8950: Add support in ParquetHiveSerde to create table schema from a parquet file
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/conf/HiveConf.java fafd78e63e9b41c9fdb0e017b567dc719d151784

>   metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java 2db2658fbc57fba01c892c9213baef6c498e659b

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ParquetSchemaReader.java PRE-CREATION

>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/convert/ParquetToHiveSchemaConverter.java
PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.java 4effe736fcf9d3715f03eed9885c299a7aa040dd

>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java cd3d349e8bd8785d6cadaf9ed8fa7598f223774a

>   ql/src/test/queries/clientpositive/parquet_array_of_multi_field_struct_gen_schema.q
PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_optional_elements_gen_schema.q
PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_required_elements_gen_schema.q
PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_single_field_struct_gen_schema.q
PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_structs_gen_schema.q PRE-CREATION

>   ql/src/test/queries/clientpositive/parquet_array_of_structs_gen_schema_ext.q PRE-CREATION

>   ql/src/test/queries/clientpositive/parquet_array_of_unannotated_groups_gen_schema.q
PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_array_of_unannotated_primitives_gen_schema.q
PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_avro_array_of_primitives_gen_schema.q PRE-CREATION

>   ql/src/test/queries/clientpositive/parquet_avro_array_of_single_field_struct_gen_schema.q
PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_decimal_gen_schema.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_thrift_array_of_primitives_gen_schema.q
PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_thrift_array_of_single_field_struct_gen_schema.q
PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_multi_field_struct_gen_schema.q.out
PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_optional_elements_gen_schema.q.out
PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_required_elements_gen_schema.q.out
PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_single_field_struct_gen_schema.q.out
PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_structs_gen_schema.q.out PRE-CREATION

>   ql/src/test/results/clientpositive/parquet_array_of_structs_gen_schema_ext.q.out PRE-CREATION

>   ql/src/test/results/clientpositive/parquet_array_of_unannotated_groups_gen_schema.q.out
PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_array_of_unannotated_primitives_gen_schema.q.out
PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_avro_array_of_primitives_gen_schema.q.out
PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_avro_array_of_single_field_struct_gen_schema.q.out
PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_decimal_gen_schema.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_thrift_array_of_primitives_gen_schema.q.out
PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_thrift_array_of_single_field_struct_gen_schema.q.out
PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/28372/diff/
> 
> 
> Testing
> -------
> 
> Tested by adding appropriate qTests.
> 
> 
> Thanks,
> 
> Ashish Singh
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message