From Namit Jain <>
Subject Re: non map-reduce for simple queries
Date Tue, 31 Jul 2012 17:47:18 GMT

On 7/31/12 9:23 PM, "Owen O'Malley" <> wrote:

>On Mon, Jul 30, 2012 at 11:38 PM, Namit Jain <> wrote:
>> That would be difficult. The % done can be estimated from the data
>> read.
>I'm confused. Wouldn't the maximum size of the data remaining over the
>maximum size of the original query give a reasonable approximation of the
>amount of work done?

Yes and No, the filter behavior can vary a lot with the rows.
But, yes that is the best approximation we can have.

>> It might be simpler to have a check like: if the query isn't done in
>> the first 5 seconds of running locally, you switch to mapreduce.
>There are three problems I see:
>  * If the query is 95% done at 5 seconds,  it is a shame to kill it and
>start over again at 0% on mapreduce with a much longer latency. (Instead
>spending the additional 0.25 seconds you spend an additional 60+.)
>  * You can't print anything until you know whether you are going to kill
>it or not. (The mapreduce results might come back in a different
>With user-facing programs, it is much better to start printing early
>instead of later since it gives faster feedback to the user.

We cannot do this in either of the above approaches.

>  * It isn't predictable how the query will run. That makes it very hard
>build applications on top of Hive.
>Do those make sense?

