hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/GettingStarted" by JoydeepSensarma
Date Thu, 12 Aug 2010 08:53:15 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/GettingStarted" page has been changed by JoydeepSensarma.
http://wiki.apache.org/hadoop/Hive/GettingStarted?action=diff&rev1=34&rev2=35

--------------------------------------------------

        this sets the variables x1 and x2 to y1 and y2 respectively
      * By setting the HIVE_OPTS environment variable to "-hiveconf x1=y1 -hiveconf x2=y2"
which does the same as above
  
+ === Runtime configuration ===
+ 
+   * Hive queries are executed using map-reduce queries and, therefore, the behavior 
+   of such queries can be controlled by the hadoop configuration variables.
+ 
+   * The cli command 'SET' can be used to set any hadoop (or hive) configuration variable.
For example:
+ 
+ {{{
+     hive> SET mapred.job.tracker=myhost.mycompany.com:50030
+     hive> SET -v 
+ }}}
+ 
+   The latter shows all the current settings. Without the -v option only the 
+   variables that differ from the base hadoop configuration are displayed
+ 
+ === Hive, Map-Reduce and Local-Mode ===
+ 
+ Hive compiler generates map-reduce jobs for most queries. These jobs are then submitted
to the Map-Reduce cluster indicated by the variable:
+ {{{ 
+   mapred.job.tracker
+ }}}
+ 
+ While this usually points to a map-reduce cluster with multiple nodes, Hadoop also offers
a nifty option to run map-reduce jobs locally on the user's workstation. This can be very
useful to run queries over small data sets - in such cases local mode execution is usually
significantly faster than submitting jobs to a large cluster. Data is accessed transparently
from HDFS. Conversely, local mode only runs with one reducer and can be very slow processing
larger data sets. 
+ 
+ Starting v-0.7, Hive fully supports local mode execution. To enable this, the user can enable
the following option:
+ {{{
+   hive> SET mapred.job.tracker=local;
+ }}}
+ In addition, mapred.local.dir should point to a path that's valid on the local machine (for
example /tmp/<username>/mapred/local). (Otherwise, the user will get an exception allocating
local disk space). 
+ 
+ Starting v-0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically.
The relevant options are:
+ {{{
+   hive> SET hive.exec.mode.local.auto=false;
+ }}}
+ 
+ note that this feature is ''disabled'' by default. If enabled - Hive analyzes the size of
each map-reduce job in a query and may run it locally if the following thresholds are satisfied:
+   * The total input size of the job is lower than: ''hive.exec.mode.local.auto.inputbytes.max''
(128MB by default)
+   * The total number of map-tasks is less than: ''hive.exec.mode.local.auto.tasks.max''
(4 by default)
+   * The total number of reduce tasks required is 1 or 0.
+ 
+ So for queries over small data sets, or for queries with multiple map-reduce jobs where
the input to subsequent jobs is substantially smaller (because of reduction/filtering in the
prior job), jobs may be run locally. Note that there may be differences in the runtime environment
of hadoop server nodes and the machine running the hive client (because of different jvm versions
or different software libraries). This can cause unexpected behavior/errors while running
in local mode.
+ 
  === Error Logs ===
  Hive uses log4j for logging. By default logs are not emitted to the 
  console by the CLI. The default logging level is WARN and the logs are stored in the folder:
@@ -218, +260 @@

  Note that loading data from HDFS will result in moving the file/directory. As a result,
the operation is almost instantaneous.
  
  == SQL Operations ==
- === Runtime configuration ===
- 
-   * Hive queries are executed using map-reduce queries and, therefore, the behavior 
-   of such queries can be controlled by the hadoop configuration variables.
- 
-   * The cli command 'SET' can be used to set any hadoop (or hive) configuration variable.
For example:
- 
- {{{
-     hive> SET mapred.job.tracker=myhost.mycompany.com:50030
-     hive> SET -v 
- }}}
- 
-   The latter shows all the current settings. Without the -v option only the 
-   variables that differ from the base hadoop configuration are displayed
-   * In particular, the number of reducers should be set to a reasonable number 
-   to get good performance (the default is 1!)
- 
- 
  === Example Queries ===
  
  Some example queries are shown below. They are available in build/dist/examples/queries.

Mime
View raw message