hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ning Zhang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-908) optimize limit
Date Thu, 29 Oct 2009 21:43:59 GMT

    [ https://issues.apache.org/jira/browse/HIVE-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12771638#action_12771638
] 

Ning Zhang commented on HIVE-908:
---------------------------------

I agree that for most cases, if the limit number if small, we should reduce the number of
mappers by increasing the split size. This is particularly true when the limit can be pushed
down to the TableScan operator. However if the the query has joins or group-by, it could be
more complicated.

I think a more general solution would be to introduce a limit operator and a set of rewrite
rules to push the limit operator down as much as possible. In case of reduce-side joins and
groupby, we cannot push the limit operator down to the map side and it has to be on the reduce
side. There are techniques that make join and groupby limit-aware in the top-k query processing
techniques (the ranking function for limit is just a constant function). A survey can be found
at http://www.cs.uwaterloo.ca/~ilyas/papers/IlyasTopkSurvey.pdf.

> optimize limit
> --------------
>
>                 Key: HIVE-908
>                 URL: https://issues.apache.org/jira/browse/HIVE-908
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Namit Jain
>             Fix For: 0.5.0
>
>
> If there is a limit, all the mappers have to finish and create 'limit' number of rows
- this can be pretty expensive for a large file.
> The following optimizations can be performed in this area:
> 1. Start fewer mappers if there is a limit - before submitting a job, the compiler knows
that there is a limit - so, it might be useful to increase the split size, thereby reducing
the number of mappers.
> 2. A counter is maintained for the total outputs rows - the mappers can look at those
counters and decide to exit instead of emitting 'limit' number of rows themselves.
> 2. may lead to some bugs because of bugs in counters, but 1. should definitely help

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message