hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sahil Takiar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-15199) INSERT INTO data on S3 is replacing the old rows with the new ones
Date Tue, 15 Nov 2016 18:21:58 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15667862#comment-15667862
] 

Sahil Takiar commented on HIVE-15199:
-------------------------------------

The code block Sergio posted looks like it could have some major inefficiencies, even for
HDFS. If my understanding is correct, the code basically tries to rename the data with the
suffix {{... + "_copy_" + counter}}, if it fails (because the files already exists), it increments
the counter and then tries again. This doesn't sound like a scalable solution, what happens
if there are 1000 files under the directory, any insert will require explicitly checking for
the existence of files from {{... + "_copy_0"}} to {{... + "_copy_1000"}}. On HDFS, and especially
on S3, this doesn't seem to be a very efficient approach (would be good to confirm this behavior).

If the logic above is indeed what happens, there could be a few different ways to fix this.

1: Append an UUID to the end of the file name rather than using a counter, since UUID are
globally unique there should be no chance of conflict
2: Append the query_id + a synchronized counter ({{private synchronized long counter}}) to
the file name

> INSERT INTO data on S3 is replacing the old rows with the new ones
> ------------------------------------------------------------------
>
>                 Key: HIVE-15199
>                 URL: https://issues.apache.org/jira/browse/HIVE-15199
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>            Reporter: Sergio Peña
>            Assignee: Sergio Peña
>            Priority: Critical
>
> Any INSERT INTO statement run on S3 tables and when the scratch directory is saved on
S3 is deleting old rows of the table.
> {noformat}
> hive> set hive.blobstore.use.blobstore.as.scratchdir=true;
> hive> create table t1 (id int, name string) location 's3a://spena-bucket/t1';
> hive> insert into table t1 values (1,'name1');
> hive> select * from t1;
> 1       name1
> hive> insert into table t1 values (2,'name2');
> hive> select * from t1;
> 2       name2
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message