hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3601) Hive as a contrib project
Date Tue, 22 Jul 2008 06:18:32 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12615533#action_12615533

Owen O'Malley commented on HADOOP-3601:

The straight to subproject path is only available if the code base is from a single organization.
Non-Apache projects that want to become Apache projects need to go through the incubator.
Getting out of incubator takes a fair amount of effort.

Another serious advantage for the hbase approach was that the hbase contributors got trained
in the way that the Hadoop process and community works. That didn't happen for pig and the
training took longer. Hbase had its first release after 2 months and pig hasn't released yet.
Also the process and infrastructure overhead was much much lower for creating hbase than pig
or zookeeper. It would take an hour to create Hive as a contrib module and a month to create
it as a subproject. I agree with the disadvantages though that if the project gets busy, it
can start to swamp the hadoop jiras and mailing lists. Certainly, we would have pushed HBase
to a subproject much sooner if Hadoop hadn't been a subproject of Lucene at the time.

If we are going to take Hive in contrib, I think we probably should disengage our process
a bit from the current model. In particular, I don't think we should run the contrib unit
tests for our patches. The only downside to that is that we should probably promote streaming
and data_join into map/reduce, which will take some cleanup.

> Hive as a contrib project
> -------------------------
>                 Key: HADOOP-3601
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3601
>             Project: Hadoop Core
>          Issue Type: New Feature
>    Affects Versions: 0.17.0
>            Reporter: Joydeep Sen Sarma
>            Priority: Minor
>         Attachments: HiveTutorial.pdf
>   Original Estimate: 1080h
>  Remaining Estimate: 1080h
> Hive is a data warehouse built on top of flat files (stored primarily in HDFS). It includes:
> - Data Organization into Tables with logical and hash partitioning
> - A Metastore to store metadata about Tables/Partitions etc
> - A SQL like query language over object data stored in Tables
> - DDL commands to define and load external data into tables
> Hive's query language is executed using Hadoop map-reduce as the execution engine. Queries
can use either single stage or multi-stage map-reduce. Hive has a native format for tables
- but can handle any data set (for example json/thrift/xml) using an IO library framework.
> Hive uses Antlr for query parsing, Apache JEXL for expression evaluation and may use
Apache Derby as an embedded database for MetaStore. Antlr has a BSD license and should be
compatible with Apache license.
> We are currently thinking of contributing to the 0.17 branch as a contrib project (since
that is the version under which it will get tested internally) - but looking for advice on
the best release path.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message