hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <>
Subject [jira] Commented: (HIVE-1510) HiveCombineInputFormat should not use prefix matching to find the partitionDesc for a given path
Date Wed, 04 Aug 2010 21:24:16 GMT


Ning Zhang commented on HIVE-1510:

It's fine for me if you feel strong for it. The concern from me (besides har+CHIF support)
is the performance implication when using CHIF merging large number of small files inside
a partition. Siying has a use case where the pathToPartitionInfo is very large and the # of
files in the splits is also very large. Determining whether partitionDesc for each input path
takes a long time. In your patch, you have another HashMap for the path part of the pathToPartitionInfo
(which trade memory for speed), but introduced another loop for comparing parent of paths.
It would be nice (better performance) if you could avoid this loop by simply appending '/'
at the end.  But if it doesn't hurt the performance or appending '/' doesn't work, the current
patch is fine for me too.

As an aside, we should find out why pathToPartitionInfo in some cases contains paths only
rather than the full URI. The ideal case is that it should always contains the full URI so
that we don't rely on heuristics. But this could be another JIRA.

> HiveCombineInputFormat should not use prefix matching to find the partitionDesc for a
given path
> ------------------------------------------------------------------------------------------------
>                 Key: HIVE-1510
>                 URL:
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: hive-1510.1.patch
> set;
> drop table combine_3_srcpart_seq_rc;
> create table combine_3_srcpart_seq_rc (key int , value string) partitioned by (ds string,
hr string) stored as sequencefile;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="00")
select * from src;
> alter table combine_3_srcpart_seq_rc set fileformat rcfile;
> insert overwrite table combine_3_srcpart_seq_rc partition (ds="2010-08-03", hr="001")
select * from src;
> desc extended combine_3_srcpart_seq_rc partition(ds="2010-08-03", hr="00");
> desc extended combine_3_srcpart_seq_rc partition(ds="2010-08-03", hr="001");
> select * from combine_3_srcpart_seq_rc where ds="2010-08-03" order by key;
> drop table combine_3_srcpart_seq_rc;
> will fail.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message