hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zheng Shao <>
Subject Re: Non-MR execution plan for simple WHERE
Date Tue, 19 Jan 2010 01:42:04 GMT
There is already a JIRA for it:
Please add your thoughts to that JIRA.

On Mon, Jan 18, 2010 at 10:19 AM, Raghu Murthy <> wrote:

> Currently, only SELECT * with a partition predicate does not create an MR
> job. The client directly streams the files corresponding to the
> table/partition. I think there is a JIRA to also do simple projections and
> filters in the client.
> For now, if you know which queries are small, you could run MR locally on
> the client via by providing the following options before running the query.
> set mapred.job.tracker=local;
> set mapred.local.dir=/tmp;
> On 1/18/10 10:10 AM, "Eric Sammer" <> wrote:
> > All:
> >
> > I'm using Hive as a back end for a multilevel aggregation reporting
> > system. There is a core fact table that is currently multiple millions
> > of rows, but will quickly hit billions. A web service accepts user
> > queries, turns them into Hive queries, dynamically builds Hive tables
> > containing different aggregations that are commonly requested by front
> > end users if they don't exist, and eventually returns the results. This
> > web service is async, initially giving the UI a ticket which then must
> > poll for results. These "summary" tables, which are usually 1 million or
> > less in row count, are then queried instead of the raw fact table.
> >
> > In many cases, very simple WHERE clauses are applied to the summary
> > tables which causes a MR job to be spawned. In cases where we know the
> > row count will be low, it would be very nice to avoid the MR overhead
> > and simply apply the filter inline and stream the results back
> > immediately. By simple, I mean a query containing only equality criteria
> > with no nested conditional groups, no GROUP BY clause, and no
> > complicated UDFs. The goal would be to reduce the time to begin
> > streaming results at the expensive of data locality and the
> > centralization of the query execution.
> >
> > The summary tables generally have a block count of only 3 or so, so the
> > degree of parallelism being sacrificed is minimal (compared to true back
> > end queries that have no user waiting on results). I know Hive's stated
> > goals indicate that low latency isn't in the cards, but I don't consider
> > this to be low latency in the RDBMS sense. I'm thinking of reducing the
> > start of result streaming from a number of minutes to seconds.
> >
> > Obviously this is somewhat specialized, but I wanted to get people's
> > ideas on a keyword that can indicate this type of execution plan, the
> > feasibility within the code base, and the suitability of this being
> > within Hive's court. The alternatives of building the results via Hive
> > and then having another (albeit simple) query layer to comb through the
> > files is cumbersome to maintain. The notion that one could do something
> > within Hive is appealing because then, based on a threshold of row
> > count, you could resort to doing a MR job if necessary even if you think
> > you're querying a small table. Or, maybe Hive can fail with some
> > indication that centralized query exceeds the configured threshold.
> >
> > I wanted to throw this out to the list before filing a feature request
> > in JIRA for obvious reasons.
> >
> > Thoughts? Comments?
> > --
> > Eric Sammer
> >
> >


View raw message