hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/Tutorial" by JoydeepSensarma
Date Sun, 20 Jun 2010 07:53:10 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/Tutorial" page has been changed by JoydeepSensarma.
http://wiki.apache.org/hadoop/Hive/Tutorial?action=diff&rev1=27&rev2=28

--------------------------------------------------

  
  = Concepts =
  == What is Hive ==
- Hive is the next generation infrastructure designed with the goals of providing data processing
systems to enable easy data summarization, adhoc querying and analysis of large volumes of
data. In addition it also provides a simple query language called QL, which is based on SQL
and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis
easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able
to plug in their custom mappers and reducers to do more sophisticated analysis that may not
be supported by the built-in capabilities the language.
+ Hive is a data warehousing infrastructure based on the Hadoop. Hadoop provides massive scale
out and fault tolerance capabilities for data storage and processing (using the map-reduce
programming paradigm) on commodity hardware.
+ 
+ Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large
volumes of data. It provides a simple query language called Hive QL, which is based on SQL
and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis
easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able
to plug in their custom mappers and reducers to do more sophisticated analysis that may not
be supported by the built-in capabilities of the language. 
  
  == What is NOT Hive ==
- Hive is based on Hadoop, which is a batch processing system. Accordingly, this system does
not and cannot promise low latencies on queries. The paradigm here is strictly of submitting
jobs and being notified when the jobs are completed as opposed to real time queries. As a
result it should not be compared with systems such as Oracle where analyses are conducted
on a significantly smaller amount of data but the analyses proceed much more iteratively with
the response times between iterations being less than a few minutes. A typical Hive query's
response time is usually greater than a couple of minutes. For large jobs they may even run
into hours. What Hive provides is a fault-tolerant and an scale-out option, where more commodity
boxes can be added to the Hadoop cluster as the data size and/or workload increases and Hive
will automatically benefit from that.
+ Hadoop is a batch processing system and Hadoop jobs tend to have high latency and incur
substantial overheads in job submission and scheduling. As a result - latency for Hive queries
is generally very high (minutes) even when data sets involved are very small (say a few hundred
megabytes). As a result it cannot be compared with systems such as Oracle where analyses are
conducted on a significantly smaller amount of data but the analyses proceed much more iteratively
with the response times between iterations being less than a few minutes. Hive aims to provide
acceptable (but not optimal) latency for interactive data browsing, queries over small data
sets or test queries.
+ 
+ Hive is not designed for online transaction processing and does not offer real-time queries
and row level updates. It is best used for batch jobs over large sets of immutable data (like
web logs).
  
  In the following sections we provide a tutorial on the capabilities of the system. We start
by describing the concepts of data types, tables and partitions (which are very similar to
what you would find in a traditional relational DBMS) and then illustrate the capabilities
of the QL language with the help of some examples.
  

Mime
View raw message