hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hive/GettingStarted" by CarlSteinbach
Date Wed, 20 Oct 2010 03:08:17 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hive/GettingStarted" page has been changed by CarlSteinbach.
http://wiki.apache.org/hadoop/Hive/GettingStarted?action=diff&rev1=37&rev2=38

--------------------------------------------------

    examples/ (sample input and query files)
  }}}
  
- In the rest of the page, we use build/dist and <install-dir> interchangeably.
+ In the rest of the page, we use `build/dist` and `<install-dir>` interchangeably.
  
  === Running Hive ===
  
  Hive uses hadoop  that means:
    * you must have hadoop in your path OR
-   * export HADOOP_HOME=<hadoop-install-dir>
+   * `export HADOOP_HOME=<hadoop-install-dir>`
  
- In addition, you must create /tmp and /user/hive/warehouse 
+ In addition, you must create `/tmp` and `/user/hive/warehouse` 
- (aka hive.metastore.warehouse.dir) and set them chmod g+w in 
+ (aka `hive.metastore.warehouse.dir`) and set them `chmod g+w` in 
  HDFS before a table can be created in Hive. 
  
  Commands to perform this setup
@@ -56, +56 @@

    $ $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse
  }}}
  
- I also find it useful but not necessary to set HIVE_HOME
+ I also find it useful but not necessary to set `HIVE_HOME`
  {{{
    $ export HIVE_HOME=<hive-install-dir>
  }}}
@@ -68, +68 @@

  
  === Configuration management overview ===
  
- - hive default configuration is stored in <install-dir>/conf/hive-default.xml
+ - hive default configuration is stored in `<install-dir>/conf/hive-default.xml`
-     Configuration variables can be changed by (re-)defining them in <install-dir>/conf/hive-site.xml
+     Configuration variables can be changed by (re-)defining them in `<install-dir>/conf/hive-site.xml`
- - log4j configuration is stored in <install-dir>/conf/hive-log4j.properties
+ - log4j configuration is stored in `<install-dir>/conf/hive-log4j.properties`
  
  - hive configuration is an overlay on top of hadoop - meaning the hadoop configuration variables
are inherited by default.
  
@@ -78, +78 @@

    * editing hive-site.xml and defining any desired variables (including hadoop variables)
in it
    * from the cli using the set command (see below)
    * by invoking hive using the syntax:
-     * $ bin/hive -hiveconf x1=y1 -hiveconf x2=y2
+     * `$ bin/hive -hiveconf x1=y1 -hiveconf x2=y2`
        this sets the variables x1 and x2 to y1 and y2 respectively
-     * By setting the HIVE_OPTS environment variable to "-hiveconf x1=y1 -hiveconf x2=y2"
which does the same as above
+     * By setting the `HIVE_OPTS` environment variable to "-hiveconf x1=y1 -hiveconf x2=y2"
which does the same as above
  
  === Runtime configuration ===
  
@@ -94, +94 @@

      hive> SET -v 
  }}}
  
-   The latter shows all the current settings. Without the -v option only the 
+   The latter shows all the current settings. Without the `-v` option only the 
    variables that differ from the base hadoop configuration are displayed
  
  === Hive, Map-Reduce and Local-Mode ===
@@ -110, +110 @@

  {{{
    hive> SET mapred.job.tracker=local;
  }}}
- In addition, mapred.local.dir should point to a path that's valid on the local machine (for
example /tmp/<username>/mapred/local). (Otherwise, the user will get an exception allocating
local disk space). 
+ In addition, `mapred.local.dir` should point to a path that's valid on the local machine
(for example `/tmp/<username>/mapred/local`). (Otherwise, the user will get an exception
allocating local disk space). 
  
  Starting v-0.7, Hive also supports a mode to run map-reduce jobs in local-mode automatically.
The relevant options are:
  {{{
@@ -118, +118 @@

  }}}
  
  note that this feature is ''disabled'' by default. If enabled - Hive analyzes the size of
each map-reduce job in a query and may run it locally if the following thresholds are satisfied:
-   * The total input size of the job is lower than: ''hive.exec.mode.local.auto.inputbytes.max''
(128MB by default)
+   * The total input size of the job is lower than: `hive.exec.mode.local.auto.inputbytes.max`
(128MB by default)
-   * The total number of map-tasks is less than: ''hive.exec.mode.local.auto.tasks.max''
(4 by default)
+   * The total number of map-tasks is less than: `hive.exec.mode.local.auto.tasks.max` (4
by default)
    * The total number of reduce tasks required is 1 or 0.
  
  So for queries over small data sets, or for queries with multiple map-reduce jobs where
the input to subsequent jobs is substantially smaller (because of reduction/filtering in the
prior job), jobs may be run locally. 
  
- Note that there may be differences in the runtime environment of hadoop server nodes and
the machine running the hive client (because of different jvm versions or different software
libraries). This can cause unexpected behavior/errors while running in local mode. Also note
that local mode execution is done in a separate, child jvm (of the hive client). If the user
so wishes, the maximum amount of memory for this child jvm can be controlled via the option
''hive.mapred.local.mem''. By default, it's set to zero, in which case Hive lets Hadoop determine
the default memory limits of the child jvm.
+ Note that there may be differences in the runtime environment of hadoop server nodes and
the machine running the hive client (because of different jvm versions or different software
libraries). This can cause unexpected behavior/errors while running in local mode. Also note
that local mode execution is done in a separate, child jvm (of the hive client). If the user
so wishes, the maximum amount of memory for this child jvm can be controlled via the option
`hive.mapred.local.mem`. By default, it's set to zero, in which case Hive lets Hadoop determine
the default memory limits of the child jvm.
  
  === Error Logs ===
  Hive uses log4j for logging. By default logs are not emitted to the 
- console by the CLI. The default logging level is WARN and the logs are stored in the folder:
+ console by the CLI. The default logging level is `WARN` and the logs are stored in the folder:
-  * /tmp/{user.name}/hive.log
+  * `/tmp/{user.name}/hive.log`
  
  If the user wishes - the logs can be emitted to the console by adding 
  the arguments shown below:
-  * bin/hive -hiveconf hive.root.logger=INFO,console
+  * `bin/hive -hiveconf hive.root.logger=INFO,console`
  
  Alternatively, the user can change the logging level only by using:
-  * bin/hive -hiveconf hive.root.logger=INFO,DRFA
+  * `bin/hive -hiveconf hive.root.logger=INFO,DRFA`
  
- Note that setting hive.root.logger via the 'set' command does not 
+ Note that setting `hive.root.logger` via the 'set' command does not 
  change logging properties since they are determined at initialization time.
  
  Logging during Hive execution on a Hadoop cluster is controlled by Hadoop configuration.
Usually Hadoop will produce  one log file per map and reduce task stored on the cluster machine(s)
where the task was executed. The log files can be obtained by clicking through to the Task
Details page from the Hadoop JobTracker Web UI.
  
- When using local mode (using mapred.job.tracker=local), Hadoop/Hive execution logs are produced
on the client machine itself. Starting v-0.6 - Hive uses the hive-exec-log4j.properties (falling
back to hive-log4j.properties only if it's missing) to determine where these logs are delivered
by default. The default configuration file produces one log file per query executed in local
mode and stores it under /tmp/{user.name}. The intent of providing a separate configuration
file is to enable administrators to centralize execution log capture if desired (on a NFS
file server for example). Execution logs are invaluable for debugging run-time errors.
+ When using local mode (using `mapred.job.tracker=local`), Hadoop/Hive execution logs are
produced on the client machine itself. Starting v-0.6 - Hive uses the `hive-exec-log4j.properties`
(falling back to `hive-log4j.properties` only if it's missing) to determine where these logs
are delivered by default. The default configuration file produces one log file per query executed
in local mode and stores it under `/tmp/{user.name}`. The intent of providing a separate configuration
file is to enable administrators to centralize execution log capture if desired (on a NFS
file server for example). Execution logs are invaluable for debugging run-time errors.
  
- Error logs are very useful to debug problems. Please send them with any bugs (of which there
are many!) to hive-dev@hadoop.apache.org.
+ Error logs are very useful to debug problems. Please send them with any bugs (of which there
are many!) to `hive-dev@hadoop.apache.org`.
  
  == DDL Operations ==
  
@@ -184, +184 @@

  expressions. Check out this link for documentation 
  http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html
  
+ {{{
  hive> DESCRIBE invites;
+ }}}
  
  shows the list of columns
  
@@ -205, +207 @@

  === Metadata Store ===
  
  Metadata is in an embedded Derby database whose disk storage location is determined by the

- hive configuration variable named javax.jdo.option.ConnectionURL. By default 
+ hive configuration variable named `javax.jdo.option.ConnectionURL`. By default 
- (see conf/hive-default.xml), this location is ./metastore_db
+ (see `conf/hive-default.xml`), this location is `./metastore_db`
  
  Right now, in the default configuration, this metadata can only be seen by 
  one user at a time. 
  
  Metastore can be stored in any database that is supported by JPOX. The 
  location and the type of the RDBMS can be controlled by the two variables 
- 'javax.jdo.option.ConnectionURL' and 'javax.jdo.option.ConnectionDriverName'. 
+ `javax.jdo.option.ConnectionURL` and `javax.jdo.option.ConnectionDriverName`. 
  Refer to JDO (or JPOX) documentation for more details on supported databases. 
- The database schema is defined in JDO metadata annotations file package.jdo 
+ The database schema is defined in JDO metadata annotations file `package.jdo` 
- at src/contrib/hive/metastore/src/model.
+ at `src/contrib/hive/metastore/src/model`.
  
  In the future, the metastore itself can be a standalone server.
  
@@ -241, +243 @@

  NOTES:
    * NO verification of data against the schema is performed by the load command.
    * If the file is in hdfs, it is moved into the Hive-controlled file system namespace.

-   The root of the Hive directory is specified by the option 'hive.metastore.warehouse.dir'

+   The root of the Hive directory is specified by the option `hive.metastore.warehouse.dir`

-   in hive-default.xml. We advise users to create this directory before 
+   in `hive-default.xml`. We advise users to create this directory before 
    trying to create tables via Hive.
  
  {{{
@@ -263, +265 @@

  == SQL Operations ==
  === Example Queries ===
  
- Some example queries are shown below. They are available in build/dist/examples/queries.
+ Some example queries are shown below. They are available in `build/dist/examples/queries`.
- More are available in the hive sources at ql/src/test/queries/positive
+ More are available in the hive sources at `ql/src/test/queries/positive`
  
  ==== SELECTS and FILTERS ====
  
  {{{
-   hive> SELECT a.foo FROM invites a WHERE a.ds='<DATE>';
+   hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
  }}}
  
- selects column 'foo' from all rows of partition <DATE> of invites table. The results
are not
+ selects column 'foo' from all rows of partition `ds=2008-08-15` of the `invites` table.
The results are not
  stored anywhere, but are displayed on the console.
  
- Note that in all the examples that follow, INSERT (into a hive table, local 
+ Note that in all the examples that follow, `INSERT` (into a hive table, local 
  directory or HDFS directory) is optional. 
  
  {{{
-   hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='<DATE>';
+   hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';
  }}}
  
- selects all rows from partition <DATE> OF invites table into an HDFS  directory. The
result data 
+ selects all rows from partition `ds=2008-08-15` of the `invites` table into an HDFS directory.
The result data 
  is in files (depending on the number of mappers) in that directory.
  NOTE: partition columns if any are selected by the use of *. They can also 
  be specified in the projection clauses.
  
- Partitioned tables must always have a partition selected in the WHERE clause of the statement.
+ Partitioned tables must always have a partition selected in the `WHERE` clause of the statement.
  
  {{{
    hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;
@@ -300, +302 @@

    hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100;

    hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
    hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles
a;
-   hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE
a.ds='<DATE>';
+   hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(*) FROM invites a WHERE
a.ds='2008-08-15';
    hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
    hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;
  }}}
  
- Sum of a column. avg, min, max can also be used. Note that for versions of Hive which don't
include [[https://issues.apache.org/jira/browse/HIVE-287|HIVE-287]], you'll need to use COUNT(1)
in place of COUNT(*).
+ Sum of a column. avg, min, max can also be used. Note that for versions of Hive which don't
include [[https://issues.apache.org/jira/browse/HIVE-287|HIVE-287]], you'll need to use `COUNT(1)`
in place of `COUNT(*)`.
  
  ==== GROUP BY ====
  
@@ -314, +316 @@

    hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(*) FROM invites a WHERE a.foo
> 0 GROUP BY a.bar;
  }}}
  
- Note that for versions of Hive which don't include [[https://issues.apache.org/jira/browse/HIVE-287|HIVE-287]],
you'll need to use COUNT(1) in place of COUNT(*).
+ Note that for versions of Hive which don't include [[https://issues.apache.org/jira/browse/HIVE-287|HIVE-287]],
you'll need to use `COUNT(1)` in place of `COUNT(*)`.
  
  ==== JOIN ====
  
@@ -337, +339 @@

    hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS
(oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';
  }}}
  
- This streams the data in the map phase through the script /bin/cat (like hadoop streaming).

+ This streams the data in the map phase through the script `/bin/cat` (like hadoop streaming).

  Similarly - streaming can be used on the reduce side (please see the Hive Tutorial or examples)
  
  
@@ -375, +377 @@

  
  Note that for versions of Hive which don't include [[https://issues.apache.org/jira/browse/HIVE-287|HIVE-287]],
you'll need to use COUNT(1) in place of COUNT(*).
  
- Now we can do some complex data analysis on the table u_data:
+ Now we can do some complex data analysis on the table `u_data`:
  
- Create weekday_mapper.py:
+ Create `weekday_mapper.py`:
  {{{
  import sys
  import datetime
@@ -413, +415 @@

  GROUP BY weekday;
  }}}
  
- Note that for versions of Hive which don't include [[https://issues.apache.org/jira/browse/HIVE-287|HIVE-287]],
you'll need to use COUNT(1) in place of COUNT(*).
+ Note that if you're using Hive 0.5.0 or earlier you will need to use `COUNT(1)` in place
of COUNT(*).
  
  === Apache Weblog Data ===
  

Mime
View raw message