hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashish Thusoo (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3601) Hive as a contrib project
Date Thu, 21 Aug 2008 07:40:45 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624278#action_12624278

Ashish Thusoo commented on HADOOP-3601:

Hi Tenaali,

By default we do the group by in 2 stages. In the first stage we generate partial aggregates
and then in the second stage we generate the final aggregates. If there is a query of the

select t.c1, count(DISTINCT t.c2) from t group by t.c1

We would first run a map reduce job with the key as c1, c2 to generate the partial aggregates
c1, count(DISTINCT c2).
We would then follow this up with a second stage map reduce with key as c1 on the output of
the previous stage and we would generate the final aggregate as c1, sum(partial aggregates).

In case distinct is absent e.g.

select t.c1, count(t.c2) from t group by t.c1

we would run the first map reduce by randomly distributing the rows to the reducers in order
to generate partial aggregates c1, count(c2)
and then we would generate the final aggregates similar to the case mentioned above.

We only support DISTINCT only one column right now.

For join, every join is done in a map-reduce job and the result from one stage is fed into
the next join etc. The join keys are used as the map keys and the join is done as a cartesian
product on the values that arrive for the different tables in the reducer. We do optimizations
so that we can join multiple tables in the same map reduce task (in case the join key for
a table is same e.g. a.c1 = b.c1 and b.c1 = c.c1)

We will be doing many more optimizations for both of these and we will be putting out all
this information on the wiki very soon.

> Hive as a contrib project
> -------------------------
>                 Key: HADOOP-3601
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3601
>             Project: Hadoop Core
>          Issue Type: Wish
>    Affects Versions: 0.17.0
>         Environment: N/A
>            Reporter: Joydeep Sen Sarma
>            Priority: Minor
>         Attachments: HiveTutorial.pdf
>   Original Estimate: 1080h
>  Remaining Estimate: 1080h
> Hive is a data warehouse built on top of flat files (stored primarily in HDFS). It includes:
> - Data Organization into Tables with logical and hash partitioning
> - A Metastore to store metadata about Tables/Partitions etc
> - A SQL like query language over object data stored in Tables
> - DDL commands to define and load external data into tables
> Hive's query language is executed using Hadoop map-reduce as the execution engine. Queries
can use either single stage or multi-stage map-reduce. Hive has a native format for tables
- but can handle any data set (for example json/thrift/xml) using an IO library framework.
> Hive uses Antlr for query parsing, Apache JEXL for expression evaluation and may use
Apache Derby as an embedded database for MetaStore. Antlr has a BSD license and should be
compatible with Apache license.
> We are currently thinking of contributing to the 0.17 branch as a contrib project (since
that is the version under which it will get tested internally) - but looking for advice on
the best release path.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message