hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam Kramer (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-588) LIMIT n is slower than it needs to be
Date Wed, 15 Jul 2009 04:27:14 GMT

    [ https://issues.apache.org/jira/browse/HIVE-588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12731263#action_12731263

Adam Kramer commented on HIVE-588:

This is because * allows you to output whole rows at a time, while specifying columns requires
that rows be split and then certain indices returned, hence a map job. That's reasonable,
but really, this could be optimized as well for straight-up selects with no transform necessary.

But at least, when any mapper has printed 10 rows, Hive should print those 10 rows and kill
the rest of the job.

> LIMIT n is slower than it needs to be
> -------------------------------------
>                 Key: HIVE-588
>                 URL: https://issues.apache.org/jira/browse/HIVE-588
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Adam Kramer
> ...simply prints the output of the first 10 lines of the first file in the database.
That's good.
> However,
> SELECT function(a) FROM t LIMIT 10;
> appears to send all of t to the mappers, runs the function, and and then returns the
first 10 rows from whatever mapper(s) finish first. This is very slow in some cases!
> Appropriate behavior for LIMIT would be to use ONE mapper, and to push files from the
table into that mapper, and then auto-kill the mapper once it has output 10 rows...just take
the first 10 rows and kill the whole task if necessary. On dying, throw some informative error
message like, "Dying intentionally; LIMIT has been reached." This should be the case even
for TRANSFORMs in the mapper...the TRANSFORM could spit out 20 rows, but once it has split
out 10, the whole task should die and the 10 should be returned immediately.
> The purpose of LIMIT is not just to have "only one response," but it's also to speed
up queries a whole lot. Running the function over the entire table is a big waste.
> Obviously, when a reduce step is necessary, the whole table will have to be pushed through
mappers and then copied and then sorted--but in those cases, whenever 10 total rows have been
output by any reducer(s), at which point all reduce tasks should be killed.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message