hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley" <>
Subject Re: non map-reduce for simple queries
Date Tue, 31 Jul 2012 15:53:52 GMT
On Mon, Jul 30, 2012 at 11:38 PM, Namit Jain <> wrote:

> That would be difficult. The % done can be estimated from the data already
> read.

I'm confused. Wouldn't the maximum size of the data remaining over the
maximum size of the original query give a reasonable approximation of the
amount of work done?

> It might be simpler to have a check like: if the query isn't done in
> the first 5 seconds of running locally, you switch to mapreduce.

There are three problems I see:
  * If the query is 95% done at 5 seconds,  it is a shame to kill it and
start over again at 0% on mapreduce with a much longer latency. (Instead of
spending the additional 0.25 seconds you spend an additional 60+.)
  * You can't print anything until you know whether you are going to kill
it or not. (The mapreduce results might come back in a different order....)
With user-facing programs, it is much better to start printing early
instead of later since it gives faster feedback to the user.
  * It isn't predictable how the query will run. That makes it very hard to
build applications on top of Hive.

Do those make sense?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message