hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergio Pena" <sergio.p...@cloudera.com>
Subject Re: Review Request 32499: HIVE-10086: Hive throws error when accessing Parquet file schema using field name match
Date Thu, 26 Mar 2015 20:21:45 GMT


> On March 26, 2015, 6:17 p.m., Ryan Blue wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java,
line 214
> > <https://reviews.apache.org/r/32499/diff/1/?file=906071#file906071line214>
> >
> >     This gets the columns without changing the order, and the selected columns are
the first N where N is the size of the list of names. So the only effect of this line is to
shorten the schema to just what is defined in the table? In that case is it necessary to do
this or can we just pass the table schema to the projection call later? Assuming the projected
ids are always `< columnNamesList.size()` then it should do the same thing.

What I understood about the parquet.column.index.access variable is that it is used when table
column names do not match with the parquet file schema. So, users use this index access to
access the column by index. This is different of the order issue.

See parquet_columnar.q test about how it is used.


> On March 26, 2015, 6:17 p.m., Ryan Blue wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java,
line 222
> > <https://reviews.apache.org/r/32499/diff/1/?file=906071#file906071line222>
> >
> >     This isn't a blocker, but I find it odd that the "HIVE_TABLE_SCHEMA" isn't a
Hive schema. It's a Parquet schema. It might be too late to rename the constant's value, but
renaming the variable might help readability.

Thanks. I renamed the variable as it is used only when checking the table schema on DataWritableRecordConverter.java


- Sergio


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/32499/#review77918
-----------------------------------------------------------


On March 25, 2015, 10:42 p.m., Sergio Pena wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/32499/
> -----------------------------------------------------------
> 
> (Updated March 25, 2015, 10:42 p.m.)
> 
> 
> Review request for hive.
> 
> 
> Bugs: HIVE-10086
>     https://issues.apache.org/jira/browse/HIVE-10086
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Attached is the patch that handles schema that do not match between Parquet and Hive.
> 
> The access to Parquet data is with name matching in this case. The table column may have
different schema order, but if the name matches the parquet column name, then the value is
retrieved.
> 
> Also, if the Hive schema has columns and struct elements that do not match with the Parquet
schema, then it will return NULL values instead.
> 
> 
> Diffs
> -----
> 
>   ql/src/java/org/apache/hadoop/hive/ql/io/parquet/read/DataWritableReadSupport.java
57ae7a9740d55b407cadfc8bc030593b29f90700 
>   ql/src/test/queries/clientpositive/parquet_schema_evolution.q PRE-CREATION 
>   ql/src/test/queries/clientpositive/parquet_table_with_subschema.q PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_schema_evolution.q.out PRE-CREATION 
>   ql/src/test/results/clientpositive/parquet_table_with_subschema.q.out PRE-CREATION

> 
> Diff: https://reviews.apache.org/r/32499/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Sergio Pena
> 
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message