hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashish Thusoo (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-718) Load data inpath into a new partition without overwrite does not move the file
Date Fri, 11 Sep 2009 09:05:57 GMT

    [ https://issues.apache.org/jira/browse/HIVE-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754058#action_12754058
] 

Ashish Thusoo commented on HIVE-718:
------------------------------------

Apologies on following this earlier. It caught my attention as Todd brought up whether we
should get this into 0.4.0 release as this is a regression when compared to 0.3.0. I checked
the code on 0.3.0 and it seems to be the same as that in 0.4.0. So I am not sure if this is
a regression. If this is not a regression then potentially we can go out with 0.4.0 without
this and document this?

As is evident by this discussion LOAD INTO and its cousin INSERT INTO (when we have it) are
very tricky. Almost all our code has been written with the overwrite semantics. Appending
new data to an existing partition would need more work to get right and I feel we should punt
it and document that insert into is not reliable - I think it has never been reliable.

In order to safely implement the INSERT INTO and LOAD INTO semantics one approach is to introduce
a notion of versions on the DML commands which is encoded in the directory structure i.e.

instead of storing things as 

xyz/part-0000

we store the files as

xyz/v1/part-0000

and so on so forth. We store the latest created version in the metastore entry for that table.
When a reader comes in it first looks at this entry and then finds a version corresponding
to that in the table. The versions themselves could be garbage collected by deleting version
directories that are older than say some configurable duration old and this could either be
done lazily by the writer on the table or by an active garbage collector in the background.
These are of course somewhat involved changes and would solve the isolation and atomicity
problems. The later becase v1 is a directory so moving data to that directory would be a rename
and hence atomic. Thoughts?


> Load data inpath into a new partition without overwrite does not move the file
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-718
>                 URL: https://issues.apache.org/jira/browse/HIVE-718
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
>         Attachments: HIVE-718.1.patch, HIVE-718.2.patch, hive-718.txt
>
>
> The bug can be reproduced as following. Note that it only happens for partitioned tables.
The select after the first load returns nothing, while the second returns the data correctly.
> insert.txt in the current local directory contains 3 lines: "a", "b" and "c".
> {code}
> > create table tmp_insert_test (value string) stored as textfile;
> > load data local inpath 'insert.txt' into table tmp_insert_test;
> > select * from tmp_insert_test;
> a
> b
> c
> > create table tmp_insert_test_p ( value string) partitioned by (ds string) stored
as textfile;
> > load data local inpath 'insert.txt' into table tmp_insert_test_p partition (ds =
'2009-08-01');
> > select * from tmp_insert_test_p where ds= '2009-08-01';
> > load data local inpath 'insert.txt' into table tmp_insert_test_p partition (ds =
'2009-08-01');
> > select * from tmp_insert_test_p where ds= '2009-08-01';
> a       2009-08-01
> b       2009-08-01
> d       2009-08-01
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message