spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yin Huai (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-14543) SQL/Hive insertInto has unexpected results
Date Mon, 20 Jun 2016 19:59:58 GMT

     [ https://issues.apache.org/jira/browse/SPARK-14543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yin Huai updated SPARK-14543:
-----------------------------
    Description: 
*Updated description*
There should be an option to match input data to output columns by name. The API allows operations
on tables, which hide the column resolution problem. It's easy to copy from one table to another
without listing the columns, and in the API it is common to work with columns by name rather
than by position. I think the API should add a way to match columns by name, which is closer
to what users expect. I propose adding something like this:

{code}
CREATE TABLE src (id: bigint, count: int, total: bigint)
CREATE TABLE dst (id: bigint, total: bigint, count: int)

sqlContext.table("src").write.byName.insertInto("dst")
{code}


*Original description*
The Hive write path adds a pre-insertion cast (projection) to reconcile incoming data columns
with the outgoing table schema. Columns are matched by position and casts are inserted to
reconcile the two column schemas.

When columns aren't correctly aligned, this causes unexpected results. I ran into this by
not using a correct {{partitionBy}} call (addressed by SPARK-14459), which caused an error
message that an int could not be cast to an array. However, if the columns are vaguely compatible,
for example string and float, then no error or warning is produced and data is written to
the wrong columns using unexpected casts (string -> bigint -> float).

A real-world use case that will hit this is when a table definition changes by adding a column
in the middle of a table. Spark SQL statements that copied from that table to a destination
table will then map the columns differently but insert casts that mask the problem. The last
column's data will be dropped without a reliable warning for the user.

This highlights a few problems:
* Too many or too few incoming data columns should cause an AnalysisException to be thrown
* Only "safe" casts should be inserted automatically, like int -> long, using UpCast
* Pre-insertion casts currently ignore extra columns by using zip
* The pre-insertion cast logic differs between Hive's MetastoreRelation and LogicalRelation

Also, I think there should be an option to match input data to output columns by name. The
API allows operations on tables, which hide the column resolution problem. It's easy to copy
from one table to another without listing the columns, and in the API it is common to work
with columns by name rather than by position. I think the API should add a way to match columns
by name, which is closer to what users expect. I propose adding something like this:

{code}
CREATE TABLE src (id: bigint, count: int, total: bigint)
CREATE TABLE dst (id: bigint, total: bigint, count: int)

sqlContext.table("src").write.byName.insertInto("dst")
{code}


  was:
The Hive write path adds a pre-insertion cast (projection) to reconcile incoming data columns
with the outgoing table schema. Columns are matched by position and casts are inserted to
reconcile the two column schemas.

When columns aren't correctly aligned, this causes unexpected results. I ran into this by
not using a correct {{partitionBy}} call (addressed by SPARK-14459), which caused an error
message that an int could not be cast to an array. However, if the columns are vaguely compatible,
for example string and float, then no error or warning is produced and data is written to
the wrong columns using unexpected casts (string -> bigint -> float).

A real-world use case that will hit this is when a table definition changes by adding a column
in the middle of a table. Spark SQL statements that copied from that table to a destination
table will then map the columns differently but insert casts that mask the problem. The last
column's data will be dropped without a reliable warning for the user.

This highlights a few problems:
* Too many or too few incoming data columns should cause an AnalysisException to be thrown
* Only "safe" casts should be inserted automatically, like int -> long, using UpCast
* Pre-insertion casts currently ignore extra columns by using zip
* The pre-insertion cast logic differs between Hive's MetastoreRelation and LogicalRelation

Also, I think there should be an option to match input data to output columns by name. The
API allows operations on tables, which hide the column resolution problem. It's easy to copy
from one table to another without listing the columns, and in the API it is common to work
with columns by name rather than by position. I think the API should add a way to match columns
by name, which is closer to what users expect. I propose adding something like this:

{code}
CREATE TABLE src (id: bigint, count: int, total: bigint)
CREATE TABLE dst (id: bigint, total: bigint, count: int)

sqlContext.table("src").write.byName.insertInto("dst")
{code}



> SQL/Hive insertInto has unexpected results
> ------------------------------------------
>
>                 Key: SPARK-14543
>                 URL: https://issues.apache.org/jira/browse/SPARK-14543
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Ryan Blue
>            Assignee: Ryan Blue
>
> *Updated description*
> There should be an option to match input data to output columns by name. The API allows
operations on tables, which hide the column resolution problem. It's easy to copy from one
table to another without listing the columns, and in the API it is common to work with columns
by name rather than by position. I think the API should add a way to match columns by name,
which is closer to what users expect. I propose adding something like this:
> {code}
> CREATE TABLE src (id: bigint, count: int, total: bigint)
> CREATE TABLE dst (id: bigint, total: bigint, count: int)
> sqlContext.table("src").write.byName.insertInto("dst")
> {code}
> *Original description*
> The Hive write path adds a pre-insertion cast (projection) to reconcile incoming data
columns with the outgoing table schema. Columns are matched by position and casts are inserted
to reconcile the two column schemas.
> When columns aren't correctly aligned, this causes unexpected results. I ran into this
by not using a correct {{partitionBy}} call (addressed by SPARK-14459), which caused an error
message that an int could not be cast to an array. However, if the columns are vaguely compatible,
for example string and float, then no error or warning is produced and data is written to
the wrong columns using unexpected casts (string -> bigint -> float).
> A real-world use case that will hit this is when a table definition changes by adding
a column in the middle of a table. Spark SQL statements that copied from that table to a destination
table will then map the columns differently but insert casts that mask the problem. The last
column's data will be dropped without a reliable warning for the user.
> This highlights a few problems:
> * Too many or too few incoming data columns should cause an AnalysisException to be thrown
> * Only "safe" casts should be inserted automatically, like int -> long, using UpCast
> * Pre-insertion casts currently ignore extra columns by using zip
> * The pre-insertion cast logic differs between Hive's MetastoreRelation and LogicalRelation
> Also, I think there should be an option to match input data to output columns by name.
The API allows operations on tables, which hide the column resolution problem. It's easy to
copy from one table to another without listing the columns, and in the API it is common to
work with columns by name rather than by position. I think the API should add a way to match
columns by name, which is closer to what users expect. I propose adding something like this:
> {code}
> CREATE TABLE src (id: bigint, count: int, total: bigint)
> CREATE TABLE dst (id: bigint, total: bigint, count: int)
> sqlContext.table("src").write.byName.insertInto("dst")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message