spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheng Lian (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-6644) After adding new columns to a partitioned table and inserting data to an old partition, data of newly added columns are all NULL
Date Wed, 01 Apr 2015 12:20:55 GMT

     [ https://issues.apache.org/jira/browse/SPARK-6644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Cheng Lian updated SPARK-6644:
------------------------------
    Description: 
In Hive, the schema of a partition may differ from the table schema. For example, we may add
new columns to the table after importing existing partitions. When using {{spark-sql}} to
query the data in a partition whose schema is different from the table schema, problems may
arise. Part of them have been solved in [PR #4289|https://github.com/apache/spark/pull/4289].
However, after adding new column(s) to the table, when inserting data into old partitions,
values of newly added columns are all {{NULL}}.

The following snippet can be used to reproduce this issue:
{code}
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i => TestData(i, i.toString))).toDF()
testData.registerTempTable("testData")

sql("DROP TABLE IF EXISTS table_with_partition ")
sql(s"CREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING) PARTITIONED
BY (ds STRING) LOCATION '${tmpDir.toURI.toString}'")
sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value FROM
testData")

// Add new columns to the table
sql("ALTER TABLE table_with_partition ADD COLUMNS (key1 STRING)")
sql("ALTER TABLE table_with_partition ADD COLUMNS (destlng DOUBLE)") 
sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value, 'test',
1.11 FROM testData")

sql("SELECT * FROM table_with_partition WHERE ds = '1'").collect().foreach(println)	 
{code}
Actual result:
{noformat}
[1,1,null,null,1]
[2,2,null,null,1]
{noformat}
Expected result:
{noformat}
[1,1,test,1.11,1]
[2,2,test,1.11,1]
{noformat}

  was:
In Hive, the schema of a partition may differ from the table schema. For example, we may add
new columns to the table after importing existing partitions. When using {{spark-sql}} to
query the data in a partition whose schema is different from the table schema, problems may
arise. Part of them have been solved in [PR #4289|https://github.com/apache/spark/pull/4289].
However, after adding new column(s) to the table, when inserting data into old partitions,
values of newly added columns are all {{NULL}}.

The following snippet can be used to reproduce this issue:
{code}
case class TestData(key: Int, value: String)

val testData = TestHive.sparkContext.parallelize((1 to 2).map(i => TestData(i, i.toString))).toDF()
testData.registerTempTable("testData")

sql("DROP TABLE IF EXISTS table_with_partition ")
sql(s"CREATE TABLE IF NOT EXISTS table_with_partition(key int, value string) PARTITIONED by
(ds string) location '${tmpDir.toURI.toString}'")
sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value FROM
testData")

// Add new columns to the table
sql("ALTER TABLE table_with_partition ADD COLUMNS(key1 string)")
sql("ALTER TABLE table_with_partition ADD COLUMNS(destlng double)") 
sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value, 'test',
1.11 FROM testData")

sql("SELECT * FROM table_with_partition WHERE ds = '1'").collect().foreach(println)	 
{code}
Actual result:
{noformat}
[1,1,null,null,1]
[2,2,null,null,1]
{noformat}
Expected result:
{noformat}
[1,1,test,1.11,1]
[2,2,test,1.11,1]
{noformat}


> After adding new columns to a partitioned table and inserting data to an old partition,
data of newly added columns are all NULL
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-6644
>                 URL: https://issues.apache.org/jira/browse/SPARK-6644
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>            Reporter: dongxu
>
> In Hive, the schema of a partition may differ from the table schema. For example, we
may add new columns to the table after importing existing partitions. When using {{spark-sql}}
to query the data in a partition whose schema is different from the table schema, problems
may arise. Part of them have been solved in [PR #4289|https://github.com/apache/spark/pull/4289].
However, after adding new column(s) to the table, when inserting data into old partitions,
values of newly added columns are all {{NULL}}.
> The following snippet can be used to reproduce this issue:
> {code}
> case class TestData(key: Int, value: String)
> val testData = TestHive.sparkContext.parallelize((1 to 2).map(i => TestData(i, i.toString))).toDF()
> testData.registerTempTable("testData")
> sql("DROP TABLE IF EXISTS table_with_partition ")
> sql(s"CREATE TABLE IF NOT EXISTS table_with_partition (key INT, value STRING) PARTITIONED
BY (ds STRING) LOCATION '${tmpDir.toURI.toString}'")
> sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value
FROM testData")
> // Add new columns to the table
> sql("ALTER TABLE table_with_partition ADD COLUMNS (key1 STRING)")
> sql("ALTER TABLE table_with_partition ADD COLUMNS (destlng DOUBLE)") 
> sql("INSERT OVERWRITE TABLE table_with_partition PARTITION (ds = '1') SELECT key, value,
'test', 1.11 FROM testData")
> sql("SELECT * FROM table_with_partition WHERE ds = '1'").collect().foreach(println)	

> {code}
> Actual result:
> {noformat}
> [1,1,null,null,1]
> [2,2,null,null,1]
> {noformat}
> Expected result:
> {noformat}
> [1,1,test,1.11,1]
> [2,2,test,1.11,1]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message