accumulo-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mwa...@apache.org
Subject [accumulo-website] branch master updated: More updates to MapReduce docs (#142)
Date Mon, 07 Jan 2019 19:22:08 GMT
This is an automated email from the ASF dual-hosted git repository.

mwalch pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/accumulo-website.git


The following commit(s) were added to refs/heads/master by this push:
     new d70ec3b  More updates to MapReduce docs (#142)
d70ec3b is described below

commit d70ec3bbe3bd16c9b644bd59f74a453ef17a17e4
Author: Mike Walch <mwalch@apache.org>
AuthorDate: Mon Jan 7 14:22:04 2019 -0500

    More updates to MapReduce docs (#142)
---
 _docs-2/administration/upgrading.md |  8 ++++-
 _docs-2/development/mapreduce.md    | 72 +++++++++++++++++++++++++++++++++----
 2 files changed, 72 insertions(+), 8 deletions(-)

diff --git a/_docs-2/administration/upgrading.md b/_docs-2/administration/upgrading.md
index 18c5714..f054509 100644
--- a/_docs-2/administration/upgrading.md
+++ b/_docs-2/administration/upgrading.md
@@ -42,7 +42,7 @@ Below are some changes in 2.0 that you should be aware of:
     - `log4j-service.properties` for all Accumulo services (except monitor)
     - `logj4-monitor.properties` for Accumulo monitor
     - `log4j.properties` for Accumulo clients and commands
-* [New Hadoop configuration is required]({% durl development/mapreduce#configuration %})
when reading or writing to Accumulo using MapReduce.
+* MapReduce jobs that read/write from Accumulo [must configure their dependencies differently]({%
durl development/mapreduce#configure-dependencies-for-your-mapreduce-job %}).
 * Run the command `accumulo shell` to access the shell using configuration in `conf/accumulo-client.properties`
 
 When your Accumulo 2.0 installation is properly configured, stop Accumulo 1.8/9 and start
Accumulo 2.0:
@@ -78,6 +78,12 @@ Below is a list of recommended client API changes:
 * The API for [creating Accumulo clients]({% durl getting-started/clients#creating-an-accumulo-client
%}) has changed in 2.0.
   * The old API using [ZooKeeeperInstance], [Connector], [Instance], and [ClientConfiguration]
has been deprecated.
   * [Connector] objects can be created from an [AccumuloClient] object using [Connector.from()]
+* Accumulo's [MapReduce API]({% durl development/mapreduce %}) has changed in 2.0.
+  * A new API has been introduced in the `org.apache.accumulo.hadoop` package of the `accumulo-hadoop-mapreduce`
jar.
+  * The old API in the `org.apache.accumulo.core.client` package of the `accumulo-core` has
been deprecated and will
+    eventually be removed.
+  * For both the old and new API, you must [configure dependencies differently]({% durl development/mapreduce#configure-dependencies-for-your-mapreduce-job
%})
+    when creating your MapReduce job.
 
 ## Upgrading from 1.7 to 1.8
 
diff --git a/_docs-2/development/mapreduce.md b/_docs-2/development/mapreduce.md
index 7687ae8..3295b6f 100644
--- a/_docs-2/development/mapreduce.md
+++ b/_docs-2/development/mapreduce.md
@@ -8,10 +8,42 @@ Accumulo tables can be used as the source and destination of MapReduce jobs.
 
 ## General MapReduce configuration
 
-Since 2.0.0, Accumulo no longer has the same dependency versions (i.e Guava, etc) as Hadoop.
-When launching a MapReduce job that reads or writes to Accumulo, you should build a shaded
jar
-with all of your dependencies and complete the following steps so YARN only includes Hadoop
code
-(and not all of Hadoop dependencies) when running your MapReduce job:
+### Add Accumulo's MapReduce API to your dependencies
+
+If you are using Maven, add the following dependency to your `pom.xml` to use Accumulo's
MapReduce API:
+
+```xml
+<dependency>
+  <groupId>org.apache.accumulo</groupId>
+  <artifactId>accumulo-hadoop-mapreduce</artifactId>
+  <version>{{ page.latest_release }}</version>
+</dependency>
+```
+
+The MapReduce API consists of the following classes:
+
+* If using Hadoop's **mapreduce** API:
+  * {% jlink -f org.apache.accumulo.hadoop.mapreduce.AccumuloInputFormat %}
+  * {% jlink -f org.apache.accumulo.hadoop.mapreduce.AccumuloOutputFormat %}
+  * {% jlink -f org.apache.accumulo.hadoop.mapreduce.AccumuloFileOutputFormat %}
+* If using Hadoop's **mapred** API:
+  * {% jlink -f org.apache.accumulo.hadoop.mapred.AccumuloInputFormat %}
+  * {% jlink -f org.apache.accumulo.hadoop.mapred.AccumuloOutputFormat %}
+  * {% jlink -f org.apache.accumulo.hadoop.mapred.AccumuloFileOutputFormat %}
+
+Before 2.0, the MapReduce API resided in the `org.apache.accumulo.core.client` package of
the `accumulo-core` jar.
+While this old API still exists and can be used, it has been deprecated and will be removed
eventually.
+
+### Configure dependencies for your MapReduce job
+
+Before 2.0, Accumulo used the same versions for dependencies (such as Guava) as Hadoop. This
allowed
+MapReduce jobs to run with both Accumulo's & Hadoop's dependencies on the classpath.
+
+Since 2.0, Accumulo no longer has the same versions for dependencies as Hadoop. While this
allows
+Accumulo to update its dependencies more frequently, it can cause problems if both Accumulo's
&
+Hadoop's dependencies are on the classpath of the MapReduce job. When launching a MapReduce
job that
+use Accumulo, you should build a shaded jar with all of your dependencies and complete the
following
+steps so YARN only includes Hadoop code (and not all of Hadoop's dependencies) when running
your MapReduce job:
 
 1. Set `export HADOOP_USE_CLIENT_CLASSLOADER=true` in your environment before submitting
    your job with `yarn` command.
@@ -38,7 +70,7 @@ Follow the steps below to create a MapReduce job that reads from an Accumulo
tab
 2. Configure your MapReduce job to use [AccumuloInputFormat].
 
     ```java
-    Job job = Job.getInstance(getConf());
+    Job job = Job.getInstance();
     job.setInputFormatClass(AccumuloInputFormat.class);
     Properties props = Accumulo.newClientProperties().to("myinstance","zoo1,zoo2")
                             .as("user", "passwd").build();
@@ -61,7 +93,7 @@ Follow the steps below to create a MapReduce job that reads from an Accumulo
tab
     ```
     [AccumuloInputFormat] can also be configured to read from multiple Accumulo tables.
     ```java
-    Job job = Job.getInstance(getConf());
+    Job job = Job.getInstance();
     job.setInputFormatClass(AccumuloInputFormat.class);
     Properties props = Accumulo.newClientProperties().to("myinstance","zoo1,zoo2")
                             .as("user", "passwd").build();
@@ -106,7 +138,7 @@ Follow the steps below to write to an Accumulo table from a MapReduce
job.
 
 2. Configure your MapReduce job to use [AccumuloOutputFormat].
     ```java
-    Job job = Job.getInstance(getConf());
+    Job job = Job.getInstance();
     job.setOutputFormatClass(AccumuloOutputFormat.class);
     Properties props = Accumulo.newClientProperties().to("myinstance","zoo1,zoo2")
                             .as("user", "passwd").build();
@@ -114,8 +146,34 @@ Follow the steps below to write to an Accumulo table from a MapReduce
job.
         .defaultTable("mytable").store(job);
     ```
 
+## Write output to RFiles in HDFS
+
+Follow the step below to have a MapReduce job output to RFiles in HDFS. These files
+can then be bulk imported into Accumulo:
+
+1. Create a Mapper or Reducer with `Key` & `Value` as output parameters.
+    ```java
+    class MyReducer extends Reducer<WritableComparable, Writable, Key, Value> {
+        public void reduce(WritableComparable key, Iterable<Text> values, Context c)
{
+            Key key;
+            Value value;
+            // create Key & Value based on input
+            c.write(key, value);
+        }
+    }
+    ```
+
+2. Configure your MapReduce job to use [AccumuloFileOutputFormat].
+    ```java
+    Job job = Job.getInstance();
+    job.setOutputFormatClass(AccumuloFileOutputFormat.class);
+    AccumuloFileOutputFormat.configure()
+        .outputPath(new Path("hdfs://localhost:8020/myoutput/")).store(job);
+    ```
+
 The [MapReduce example][mapred-example] contains a complete example of using MapReduce with
Accumulo.
 
 [mapred-example]: https://github.com/apache/accumulo-examples/blob/master/docs/mapred.md
 [AccumuloInputFormat]: {% jurl org.apache.accumulo.hadoop.mapreduce.AccumuloInputFormat %}
 [AccumuloOutputFormat]: {% jurl org.apache.accumulo.hadoop.mapreduce.AccumuloOutputFormat
%}
+[AccumuloFileOutputFormat]: {% jurl org.apache.accumulo.hadoop.mapreduce.AccumuloFileOutputFormat
%}


Mime
View raw message