hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vaibhav Aggarwal (Resolved) (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (HIVE-2088) Improve Hive Query split calculation performance over large partitions (potentially 5x speedup)
Date Mon, 24 Oct 2011 18:36:33 GMT

     [ https://issues.apache.org/jira/browse/HIVE-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vaibhav Aggarwal resolved HIVE-2088.
------------------------------------

    Resolution: Duplicate
      Assignee: Vaibhav Aggarwal
    
> Improve Hive Query split calculation performance over large partitions (potentially 5x
speedup)
> -----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2088
>                 URL: https://issues.apache.org/jira/browse/HIVE-2088
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Vaibhav Aggarwal
>            Assignee: Vaibhav Aggarwal
>
> While working on improving start up time for query spanning over large number of partitions
I found a particular inefficiency in CombineHiveInputFormat.
> It seems that for each input partition we create a CombineFilter to make sure that files
are combined within a particular partition.
> CombineHiveInputFormat:
>   Path filterPath = path;
>   if (!poolSet.contains(filterPath)) {
>         LOG.info("CombineHiveInputSplit creating pool for " + path +
>             "; using filter path " + filterPath);
>         combine.createPool(job, new CombineFilter(filterPath));
>         poolSet.add(filterPath);
>   }
> These filters are passed then passed to CombineFileInputFormat along with all the input
paths.
> CombineFileInputFormat computes a list of all the files in the input paths.
> It them loops over each filter and then checks whether a particular file belongs to a
particular filter.
> ConbineFileInputFormat:
> for (MultiPathFilter onepool : pools) {
>       ArrayList<Path> myPaths = new ArrayList<Path>();
>       
>       // pick one input path. If it matches all the filters in a pool,
>       // add it to the output set
>       for (int i = 0; i < paths.length; i++) {
>       ...
> This results in computation of the order O(p*f) where p is the number of partitions and
f is the total number of files in all partitions.
> For a case of 10,000 partitions with 10 files in each partition, this results in 1,000,000,000
iterations.
> We can replace this code with processing splits for one input path at a time without
passing any filters at all.
> Do you happen to see a case where this approach will not work?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message