hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-718) Load data inpath into a new partition without overwrite does not move the file
Date Tue, 04 Aug 2009 20:38:15 GMT

    [ https://issues.apache.org/jira/browse/HIVE-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739146#action_12739146
] 

Todd Lipcon commented on HIVE-718:
----------------------------------

In looking through this code, I've found a few more issues:

- In isolation, it looks like copyFiles/replaceFiles are supposed to be able to handle a srcf
like "/foo/*" with a directory layout like:

/foo/subdir1/part-00000
/foo/subdir2/part-00000

I'm assuming this because it first does fs.globStatus on srcf, and then for each of the results
of the glob, it calls fs.listStatus (implying that they are directories).

However, given the example above, this would actually fail, since both files are named part-00000
and the could would attempt to rename both to tmpdir/part-00000.

- In fact, using the tmpdir like this is consistent from the view of an outside observer,
but not atomic. If the renamer crashes in the middle of the operation, the files will have
been moved out of the original location and into the tmpdir, but the tmpdir has not been renamed
into the destination. Is this OK? I feel like the solution would be to make dstdir/_staging_<timestamp>,
move the files one-by-one into there, and then rename _staging_<timestamp> to the destination.
This way if there is a failure in the middle, the client can at least determine where their
files went without looking through a temporary directory.

> Load data inpath into a new partition without overwrite does not move the file
> ------------------------------------------------------------------------------
>
>                 Key: HIVE-718
>                 URL: https://issues.apache.org/jira/browse/HIVE-718
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Zheng Shao
>         Attachments: HIVE-718.1.patch, HIVE-718.2.patch, hive-718.txt
>
>
> The bug can be reproduced as following. Note that it only happens for partitioned tables.
The select after the first load returns nothing, while the second returns the data correctly.
> insert.txt in the current local directory contains 3 lines: "a", "b" and "c".
> {code}
> > create table tmp_insert_test (value string) stored as textfile;
> > load data local inpath 'insert.txt' into table tmp_insert_test;
> > select * from tmp_insert_test;
> a
> b
> c
> > create table tmp_insert_test_p ( value string) partitioned by (ds string) stored
as textfile;
> > load data local inpath 'insert.txt' into table tmp_insert_test_p partition (ds =
'2009-08-01');
> > select * from tmp_insert_test_p where ds= '2009-08-01';
> > load data local inpath 'insert.txt' into table tmp_insert_test_p partition (ds =
'2009-08-01');
> > select * from tmp_insert_test_p where ds= '2009-08-01';
> a       2009-08-01
> b       2009-08-01
> d       2009-08-01
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message