hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive" by JoydeepSensarma
Date Mon, 21 Jun 2010 06:22:22 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive" page has been changed by JoydeepSensarma.
http://wiki.apache.org/hadoop/Hive?action=diff&rev1=60&rev2=61

--------------------------------------------------

  Hive does not mandate read or written data be in the "Hive format"---there is no such thing.
Hive works equally well on Thrift, control delimited, or your specialized data formats.  Please
see [[/DeveloperGuide#File_Formats|File Format]] and [[http://www.slideshare.net/ragho/hive-user-meeting-august-2009-facebook|SerDe]]
in [[/DeveloperGuide|Developer Guide]] for details.
  
  = What Hive is NOT =
- Hive is based on Hadoop, which is a batch processing system. As a result, Hive does not
and cannot promise low latencies on queries. The paradigm here is strictly of submitting jobs
and being notified when the jobs are completed as opposed to real-time queries. In contrast
to the systems such as Oracle where analysis is run on a significantly smaller amount of data,
but the analysis proceeds much more iteratively with the response times between iterations
being less than a few minutes, Hive queries response times for even the smallest jobs can
be of the order of several minutes. If your input data is small you can execute a query in
a shorter time. For example, if a table has 100 rows you can 'set mapred.reduce.tasks=1' and
'set mapred.map.tasks=1' and the query time will be around a dozen seconds.  However for larger
jobs (e.g., jobs processing terabytes of data) in general they may run into hours. 
+ Hadoop is a batch processing system and Hadoop jobs tend to have high latency and incur
substantial overheads in job submission and scheduling. As a result - latency for Hive queries
is generally very high (minutes) even when data sets involved are very small (say a few hundred
megabytes). As a result it cannot be compared with systems such as Oracle where analyses are
conducted on a significantly smaller amount of data but the analyses proceed much more iteratively
with the response times between iterations being less than a few minutes. Hive aims to provide
acceptable (but not optimal) latency for interactive data browsing, queries over small data
sets or test queries. Hive also does not provide sort of data or query cache to make repeated
queries over the same data set faster.
  
- In summary, low latency performance is not the top-priority of Hive's design principles.
What Hive values most are scalability (scale out with more machines added dynamically to the
Hadoop cluster), extensibility (with MapReduce framework and UDF/UDAF/UDTF), fault-tolerance,
and loose-coupling with its input formats.
+ Hive is not designed for online transaction processing and does not offer real-time queries
and row level updates. It is best used for batch jobs over large sets of immutable data (like
web logs). What Hive values most are scalability (scale out with more machines added dynamically
to the Hadoop cluster), extensibility (with MapReduce framework and UDF/UDAF/UDTF), fault-tolerance,
and loose-coupling with its input formats.
  
  = Information =
   * General information about Hive

Mime
View raw message