spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "kobefeng (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-20663) Data missing after insert overwrite table partition which is created on specific location
Date Tue, 09 May 2017 00:48:04 GMT

     [ https://issues.apache.org/jira/browse/SPARK-20663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

kobefeng updated SPARK-20663:
-----------------------------
    Description: 
Use spark sql to create partition table first, and alter table by adding partition on specific
location, then insert overwrite into this partition by selection, which will cause data missing
compared with HIVE.

bq.
-- create partition table first
$ hadoop fs -mkdir /user/kofeng/partitioned_table
$ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
spark-sql> create table kofeng.partitioned_table(
         >     id bigint,
         >     name string,
         >     dt string
         > ) using parquet options ('compression'='snappy', 'path'='hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table')
         > partitioned by (dt);

-- add partition with specific location
spark-sql> alter table kofeng.partitioned_table add if not exists partition(dt='20170507')
location '/user/kofeng/partitioned_table/20170507';

$ hadoop fs -ls /user/kofeng/partitioned_table
drwxr-xr-x   - kofeng kofeng          0 2017-05-08 17:00 /user/kofeng/partitioned_table/20170507

-- insert overwrite this partition, and the specific location folder gone, data is missing,
job is success by attaching _SUCCESS
spark-sql> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select
123 as id, "kofeng" as name;

$ hadoop fs -ls /user/kofeng/partitioned_table
-rw-r--r--   3 kofeng hdmi-technology          0 2017-05-08 17:06 /user/kofeng/partitioned_table/_SUCCESS

----
----
-- Then drop this partition and use hive to add partition and insert overwrite this partition
data, then verify:
spark-sql> alter table kofeng.partitioned_table drop if exists partition(dt='20170507');

hive> alter table kofeng.partitioned_table add if not exists partition(dt='20170507') location
'/user/kofeng/partitioned_table/20170507';
OK

-- could see hive also drop the specific location but data is preserved on auto-created partition
folder

hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select 123
as id, "kofeng" as name from kofeng.test;
	Loading data to table kofeng.partitioned_table partition (dt=20170507)
	Moved: 'hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table/dt=20170507/000000_0' to trash
at: hdfs://ares-lvs-nn-ha/user/kofeng/.Trash/Current
	Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, numRows=1, totalSize=338,
rawDataSize=2]
	MapReduce Jobs Launched:
	Stage-Stage-1: Map: 2   Cumulative CPU: 10.61 sec   HDFS Read: 9767 HDFS Write: 577 SUCCESS
	Stage-Stage-3: Map: 1   Cumulative CPU: 12.36 sec   HDFS Read: 3635 HDFS Write: 338 SUCCESS

hive> select * from kofeng.partitioned_table;
OK
123	kofeng	20170507

$ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507
-rwxr-xr-x   3 kofeng hdmi-technology        338 2017-05-08 17:26 /user/kofeng/partitioned_table/dt=20170507/000000_0

  was:Use spark sql to create partition table first, and alter table by adding partition on
specific location, then insert overwrite into this partition by selection, which will cause
data missing compared with HIVE.


> Data missing after insert overwrite table partition which is created on specific location
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-20663
>                 URL: https://issues.apache.org/jira/browse/SPARK-20663
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0, 2.1.1
>            Reporter: kobefeng
>            Priority: Minor
>
> Use spark sql to create partition table first, and alter table by adding partition on
specific location, then insert overwrite into this partition by selection, which will cause
data missing compared with HIVE.
> bq.
> -- create partition table first
> $ hadoop fs -mkdir /user/kofeng/partitioned_table
> $ /apache/spark-2.1.0-bin-hadoop2.7/bin/spark-sql
> spark-sql> create table kofeng.partitioned_table(
>          >     id bigint,
>          >     name string,
>          >     dt string
>          > ) using parquet options ('compression'='snappy', 'path'='hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table')
>          > partitioned by (dt);
> -- add partition with specific location
> spark-sql> alter table kofeng.partitioned_table add if not exists partition(dt='20170507')
location '/user/kofeng/partitioned_table/20170507';
> $ hadoop fs -ls /user/kofeng/partitioned_table
> drwxr-xr-x   - kofeng kofeng          0 2017-05-08 17:00 /user/kofeng/partitioned_table/20170507
> -- insert overwrite this partition, and the specific location folder gone, data is missing,
job is success by attaching _SUCCESS
> spark-sql> insert overwrite table kofeng.partitioned_table partition(dt='20170507')
select 123 as id, "kofeng" as name;
> $ hadoop fs -ls /user/kofeng/partitioned_table
> -rw-r--r--   3 kofeng hdmi-technology          0 2017-05-08 17:06 /user/kofeng/partitioned_table/_SUCCESS
> ----
> ----
> -- Then drop this partition and use hive to add partition and insert overwrite this partition
data, then verify:
> spark-sql> alter table kofeng.partitioned_table drop if exists partition(dt='20170507');
> hive> alter table kofeng.partitioned_table add if not exists partition(dt='20170507')
location '/user/kofeng/partitioned_table/20170507';
> OK
> -- could see hive also drop the specific location but data is preserved on auto-created
partition folder
> hive> insert overwrite table kofeng.partitioned_table partition(dt='20170507') select
123 as id, "kofeng" as name from kofeng.test;
> 	Loading data to table kofeng.partitioned_table partition (dt=20170507)
> 	Moved: 'hdfs://ares-lvs-nn-ha/user/kofeng/partitioned_table/dt=20170507/000000_0' to
trash at: hdfs://ares-lvs-nn-ha/user/kofeng/.Trash/Current
> 	Partition kofeng.partitioned_table{dt=20170507} stats: [numFiles=1, numRows=1, totalSize=338,
rawDataSize=2]
> 	MapReduce Jobs Launched:
> 	Stage-Stage-1: Map: 2   Cumulative CPU: 10.61 sec   HDFS Read: 9767 HDFS Write: 577
SUCCESS
> 	Stage-Stage-3: Map: 1   Cumulative CPU: 12.36 sec   HDFS Read: 3635 HDFS Write: 338
SUCCESS
> hive> select * from kofeng.partitioned_table;
> OK
> 123	kofeng	20170507
> $ hadoop fs -ls /user/kofeng/partitioned_table/dt=20170507
> -rwxr-xr-x   3 kofeng hdmi-technology        338 2017-05-08 17:26 /user/kofeng/partitioned_table/dt=20170507/000000_0



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message