hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "E. Sammer (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HIVE-887) Allow SELECT <col> without a mapreduce job
Date Tue, 19 Jan 2010 04:44:55 GMT

    [ https://issues.apache.org/jira/browse/HIVE-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802089#action_12802089

E. Sammer commented on HIVE-887:

I would also like this kind of functionality. I would add WHERE clause support to the request
though as there are cases where you know a table will be small. What would be really ideal
is to be able to define a projected threshold where, if the query execution engine think there
may be many rows, it resorts to a MR job, but if under, performs client side fetch and filter.
The expectation is that GROUP BY, joins, ORDER / SORT / CLUSTER and related would always cause
a MR job.


SELECT a, b FROM t WHERE c = 'foo' FETCH n;

where n is an upper limit for which a fetch should be done based on the projected number of
rows. If projection is still not yet on the table in Hive (I haven't looked at the internals),
maybe FETCH n acts like a fetch + limit operation. Maybe n is simply some global configuration
parameter, although that seems too inflexible.

For me, Hive has been excellent for storing raw parsed log data which can be queried into
summary tables of around 1 million rows. These summary tables containing aggregations are
then queried by a UI for visualization. This "fetch" functionality would allow for the UI
load times to go from minutes to seconds and reduce contention for task slots in a production
Hadoop cluster.

> Allow SELECT <col> without a mapreduce job
> ------------------------------------------
>                 Key: HIVE-887
>                 URL: https://issues.apache.org/jira/browse/HIVE-887
>             Project: Hadoop Hive
>          Issue Type: New Feature
>         Environment: All
>            Reporter: Eric Sun
>            Assignee: Ning Zhang
> I often find myself needing to take a quick look at a particular column of a Hive table.
> I usually do this by doing a 
> SELECT * from <table> LIMIT 20;
> from the CLI.  Doing this is pretty fast since it doesn't require a mapreduce job.  However,
it's tough to examine just 1 or 2 columns when the table is very wide.
> So, I might do
> SELECT <col> from <table> LIMIT 20;
> but it's much slower since it requires a map-reduce.  It'd be really convenient if a
map-reduce wasn't necessary.
> Currently a good work around is to do
> hive -e "select * from table" | cut --key=n
> but it'd be more convenient if it were built in since it alleviates the need for column

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message