hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (JIRA)" <>
Subject [jira] Updated: (HIVE-1515) archive is not working when multiple partitions inside one table are archived.
Date Fri, 13 Aug 2010 01:12:16 GMT


He Yongqiang updated HIVE-1515:

    Attachment: hive-1515.2.patch

Attache a possible fix.

Talked with Namit and Paul this afternoon about this issue. Actually there is config which
can disable FileSystem cache: fs.%s.impl.disable.cache . where %s is the filesystem schema,
for archive, it's har.

So if you set "fs.har.impl.disable.cache" to false, the archive will automatically work. This
should be the clean way to fix this issue.
In order to do this, you need to apply if
your hadoop does not include the code to disable FileSystem cache.

> archive is not working when multiple partitions inside one table are archived.
> ------------------------------------------------------------------------------
>                 Key: HIVE-1515
>                 URL:
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.7.0
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: hive-1515.1.patch, hive-1515.2.patch
> set hive.exec.compress.output = true;
> set;
> set mapred.min.split.size=256;
> set mapred.min.split.size.per.node=256;
> set mapred.min.split.size.per.rack=256;
> set mapred.max.split.size=256;
> set hive.archive.enabled = true;
> drop table combine_3_srcpart_seq_rc;
> create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string,
hr string) stored as sequencefile;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="00")
select * from src;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="001")
select * from src;
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="00");
> ALTER TABLE combine_3_srcpart_seq_rc ARCHIVE PARTITION (ds="2010-08-03", hr="001");
> select key, value, ds, hr from combine_3_srcpart_seq_rc where ds="2010-08-03" order by
key, hr limit 30;
> drop table combine_3_srcpart_seq_rc;
> will fail.
> Invalid file name: har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001
in har:/data/users/heyongqiang/hive-trunk-clean/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason it fails is because:
> there are 2 input paths (one for each partition) for the above query:
> 1): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00
> 2): har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001/data.har/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=001
> But when doing path.getFileSystem() for these 2 input paths. they both return same one
file system instance which points the first caller, in this case which is har:/Users/heyongqiang/Documents/workspace/Hive-Index/build/ql/test/data/warehouse/combine_3_srcpart_seq_rc/ds=2010-08-03/hr=00/data.har
> The reason here is Hadoop's FileSystem has a global cache, and when trying to load a
FileSystem instance from a given path, it only take the path's scheme and username to lookup
the cache. So when we do Path.getFileSystem for the second har path, it actually returns the
file system handle for the first path.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message