Return-Path: Delivered-To: apmail-hadoop-core-commits-archive@www.apache.org Received: (qmail 60615 invoked from network); 2 Apr 2009 00:37:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Apr 2009 00:37:51 -0000 Received: (qmail 38639 invoked by uid 500); 2 Apr 2009 00:37:51 -0000 Delivered-To: apmail-hadoop-core-commits-archive@hadoop.apache.org Received: (qmail 38567 invoked by uid 500); 2 Apr 2009 00:37:51 -0000 Mailing-List: contact core-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-commits@hadoop.apache.org Received: (qmail 38557 invoked by uid 99); 2 Apr 2009 00:37:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Apr 2009 00:37:51 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [192.87.106.226] (HELO aurora.apache.org) (192.87.106.226) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Apr 2009 00:37:49 +0000 Received: from aurora.apache.org (localhost [127.0.0.1]) by aurora.apache.org (8.13.8+Sun/8.13.8) with ESMTP id n320bTDU003078 for ; Thu, 2 Apr 2009 00:37:29 GMT Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Apache Wiki To: core-commits@hadoop.apache.org Date: Thu, 02 Apr 2009 00:37:29 -0000 Message-ID: <20090402003729.2823.88439@aurora.apache.org> Subject: [Hadoop Wiki] Trivial Update of "Hive/Tutorial" by NeilConway X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification. The following page has been changed by NeilConway: http://wiki.apache.org/hadoop/Hive/Tutorial ------------------------------------------------------------------------------ = Concepts = == What is Hive == - Hive is the next generation infrastructure made with the goal of providing tools to enable easy data summarization, adhoc querying and analysis of detail data. In addition it also provides a simple query language called QL which is based on SQL and which enables users familiar with SQL to do adhoc querying, summarization and data analysis. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built in capabilities of the language. + Hive is the next generation infrastructure made with the goal of providing tools to enable easy data summarization, adhoc querying and analysis of detail data. In addition it also provides a simple query language called QL which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built in capabilities of the language. == What is NOT Hive == - Hive is based on hadoop which is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours. + Hive is based on Hadoop, which is a batch processing system. Accordingly, this system does not and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs and being notified when the jobs are completed as opposed to real time queries. As a result it should not be compared with systems like Oracle where analysis is done on a significantly smaller amount of data but the analysis proceeds much more iteratively with the response times between iterations being less than a few minutes. For Hive queries response times for even the smallest jobs can be of the order of 5-10 minutes and for larger jobs this may even run into hours. In the following sections we provide a tutorial on the capabilities of the system. We start by describing the concepts of data types, tables and partitions (which are very similar to what you would find in a traditional relational database) and then illustrate the capabilities of the language with the help of some examples