accumulo-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mwa...@apache.org
Subject [accumulo-website] branch master updated: Updated MapReduce docs with 2.0 changes (#140)
Date Fri, 04 Jan 2019 14:23:12 GMT
This is an automated email from the ASF dual-hosted git repository.

mwalch pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/accumulo-website.git


The following commit(s) were added to refs/heads/master by this push:
     new 36b89f4  Updated MapReduce docs with 2.0 changes (#140)
36b89f4 is described below

commit 36b89f42ae4b9d7f2a110c98d4cff78c52aaecee
Author: Mike Walch <mwalch@apache.org>
AuthorDate: Fri Jan 4 09:23:08 2019 -0500

    Updated MapReduce docs with 2.0 changes (#140)
---
 _docs-2/development/high_speed_ingest.md |   4 +-
 _docs-2/development/mapreduce.md         | 245 +++++++++++--------------------
 _docs-2/development/sampling.md          |  10 +-
 _docs-2/development/summaries.md         |   5 +-
 _docs-2/security/kerberos.md             |   8 +-
 _docs-2/security/on-disk-encryption.md   |   6 +-
 _plugins/links.rb                        |   4 +-
 7 files changed, 107 insertions(+), 175 deletions(-)

diff --git a/_docs-2/development/high_speed_ingest.md b/_docs-2/development/high_speed_ingest.md
index ecf458b..46fee58 100644
--- a/_docs-2/development/high_speed_ingest.md
+++ b/_docs-2/development/high_speed_ingest.md
@@ -112,7 +112,7 @@ on how use to use MapReduce with Accumulo, see the [MapReduce documentation][map
 and the [MapReduce example code][mapred-code].
 
 [bulk-example]: https://github.com/apache/accumulo-examples/blob/master/docs/bulkIngest.md
-[AccumuloOutputFormat]: {% jurl org.apache.accumulo.core.client.mapred.AccumuloOutputFormat
%}
-[AccumuloFileOutputFormat]: {% jurl org.apache.accumulo.core.client.mapred.AccumuloFileOutputFormat
%}
+[AccumuloOutputFormat]: {% jurl org.apache.accumulo.hadoop.mapreduce.AccumuloOutputFormat
%}
+[AccumuloFileOutputFormat]: {% jurl org.apache.accumulo.hadoop.mapreduce.AccumuloFileOutputFormat
%}
 [mapred-docs]: {% durl development/mapreduce %}
 [mapred-code]: https://github.com/apache/accumulo-examples/blob/master/docs/mapred.md
diff --git a/_docs-2/development/mapreduce.md b/_docs-2/development/mapreduce.md
index ee6a93a..7687ae8 100644
--- a/_docs-2/development/mapreduce.md
+++ b/_docs-2/development/mapreduce.md
@@ -4,18 +4,11 @@ category: development
 order: 2
 ---
 
-Accumulo tables can be used as the source and destination of MapReduce jobs. To
-use an Accumulo table with a MapReduce job, configure the job parameters to use
-the [AccumuloInputFormat] and [AccumuloOutputFormat]. Accumulo specific parameters
-can be set via these two format classes to do the following:
+Accumulo tables can be used as the source and destination of MapReduce jobs.
 
-* Authenticate and provide user credentials for the input
-* Restrict the scan to a range of rows
-* Restrict the input to a subset of available columns
+## General MapReduce configuration
 
-## Configuration
-
-Since 2.0.0, Accumulo no longer has the same versions of dependencies (i.e Guava, etc) as
Hadoop.
+Since 2.0.0, Accumulo no longer has the same dependency versions (i.e Guava, etc) as Hadoop.
 When launching a MapReduce job that reads or writes to Accumulo, you should build a shaded
jar
 with all of your dependencies and complete the following steps so YARN only includes Hadoop
code
 (and not all of Hadoop dependencies) when running your MapReduce job:
@@ -28,163 +21,101 @@ with all of your dependencies and complete the following steps so YARN
only incl
     job.getConfiguration().set("mapreduce.job.classloader", "true");
     ```
 
-## Mapper and Reducer classes
+## Read input from an Accumulo table
 
-To read from an Accumulo table create a Mapper with the following class
-parameterization and be sure to configure the [AccumuloInputFormat].
+Follow the steps below to create a MapReduce job that reads from an Accumulo table:
 
-```java
-class MyMapper extends Mapper<Key,Value,WritableComparable,Writable> {
-    public void map(Key k, Value v, Context c) {
-        // transform key and value data here
-    }
-}
-```
-
-To write to an Accumulo table, create a Reducer with the following class
-parameterization and be sure to configure the [AccumuloOutputFormat]. The key
-emitted from the Reducer identifies the table to which the mutation is sent. This
-allows a single Reducer to write to more than one table if desired. A default table
-can be configured using the AccumuloOutputFormat, in which case the output table
-name does not have to be passed to the Context object within the Reducer.
-
-```java
-class MyReducer extends Reducer<WritableComparable, Writable, Text, Mutation> {
-    public void reduce(WritableComparable key, Iterable<Text> values, Context c) {
-        Mutation m;
-        // create the mutation based on input key and value
-        c.write(new Text("output-table"), m);
+1. Create a Mapper with the following class parameterization.
+
+    ```java
+    class MyMapper extends Mapper<Key,Value,WritableComparable,Writable> {
+        public void map(Key k, Value v, Context c) {
+            // transform key and value data here
+        }
     }
-}
-```
+    ```
 
-The Text object passed as the output should contain the name of the table to which
-this mutation should be applied. The Text can be null in which case the mutation
-will be applied to the default table name specified in the [AccumuloOutputFormat]
-options.
-
-## AccumuloInputFormat options
-
-The following code shows how to set up Accumulo
-
-```java
-Job job = new Job(getConf());
-ClientInfo info = Accumulo.newClient().to("myinstance","zoo1,zoo2")
-                        .as("user", "passwd").info()
-AccumuloInputFormat.setClientInfo(job, info);
-AccumuloInputFormat.setInputTableName(job, table);
-AccumuloInputFormat.setScanAuthorizations(job, new Authorizations());
-```
-
-**Optional Settings:**
-
-To restrict Accumulo to a set of row ranges:
-
-```java
-ArrayList<Range> ranges = new ArrayList<Range>();
-// populate array list of row ranges ...
-AccumuloInputFormat.setRanges(job, ranges);
-```
-
-To restrict Accumulo to a list of columns:
-
-```java
-ArrayList<Pair<Text,Text>> columns = new ArrayList<Pair<Text,Text>>();
-// populate list of columns
-AccumuloInputFormat.fetchColumns(job, columns);
-```
-
-To use a regular expression to match row IDs:
-
-```java
-IteratorSetting is = new IteratorSetting(30, RexExFilter.class);
-RegExFilter.setRegexs(is, ".*suffix", null, null, null, true);
-AccumuloInputFormat.addIterator(job, is);
-```
-
-## AccumuloMultiTableInputFormat options
-
-The [AccumuloMultiTableInputFormat] allows the scanning over multiple tables
-in a single MapReduce job. Separate ranges, columns, and iterators can be
-used for each table.
-
-```java
-InputTableConfig tableOneConfig = new InputTableConfig();
-InputTableConfig tableTwoConfig = new InputTableConfig();
-```
-
-To set the configuration objects on the job:
-
-```java
-Map<String, InputTableConfig> configs = new HashMap<String,InputTableConfig>();
-configs.put("table1", tableOneConfig);
-configs.put("table2", tableTwoConfig);
-AccumuloMultiTableInputFormat.setInputTableConfigs(job, configs);
-```
-
-**Optional settings:**
-
-To restrict to a set of ranges:
-
-```java
-ArrayList<Range> tableOneRanges = new ArrayList<Range>();
-ArrayList<Range> tableTwoRanges = new ArrayList<Range>();
-// populate array lists of row ranges for tables...
-tableOneConfig.setRanges(tableOneRanges);
-tableTwoConfig.setRanges(tableTwoRanges);
-```
-
-To restrict Accumulo to a list of columns:
-
-```java
-ArrayList<Pair<Text,Text>> tableOneColumns = new ArrayList<Pair<Text,Text>>();
-ArrayList<Pair<Text,Text>> tableTwoColumns = new ArrayList<Pair<Text,Text>>();
-// populate lists of columns for each of the tables ...
-tableOneConfig.fetchColumns(tableOneColumns);
-tableTwoConfig.fetchColumns(tableTwoColumns);
-```
-
-To set scan iterators:
-
-```java
-List<IteratorSetting> tableOneIterators = new ArrayList<IteratorSetting>();
-List<IteratorSetting> tableTwoIterators = new ArrayList<IteratorSetting>();
-// populate the lists of iterator settings for each of the tables ...
-tableOneConfig.setIterators(tableOneIterators);
-tableTwoConfig.setIterators(tableTwoIterators);
-```
-
-The name of the table can be retrieved from the input split:
-
-```java
-class MyMapper extends Mapper<Key,Value,WritableComparable,Writable> {
-    public void map(Key k, Value v, Context c) {
-        RangeInputSplit split = (RangeInputSplit)c.getInputSplit();
-        String tableName = split.getTableName();
-        // do something with table name
+2. Configure your MapReduce job to use [AccumuloInputFormat].
+
+    ```java
+    Job job = Job.getInstance(getConf());
+    job.setInputFormatClass(AccumuloInputFormat.class);
+    Properties props = Accumulo.newClientProperties().to("myinstance","zoo1,zoo2")
+                            .as("user", "passwd").build();
+    AccumuloInputFormat.configure().clientProperties(props).table(table).store(job);
+    ```
+    [AccumuloInputFormat] has optional settings.
+    ```java
+    List<Range> ranges = new ArrayList<Range>();
+    List<Pair<Text,Text>> columns = new ArrayList<Pair<Text,Text>>();
+    // populate ranges & columns
+    IteratorSetting is = new IteratorSetting(30, RexExFilter.class);
+    RegExFilter.setRegexs(is, ".*suffix", null, null, null, true);
+
+    AccumuloInputFormat.configure().clientProperties(props).table(table)
+        .auths(Authorizations.EMPTY) // optional: default to user's auths if not set
+        .ranges(ranges)              // optional: only read specified ranges
+        .fetchColumns(columns)       // optional: only read specified columns
+        .addIterator(is)             // optional: add iterator that matches row IDs
+        .store(job);
+    ```
+    [AccumuloInputFormat] can also be configured to read from multiple Accumulo tables.
+    ```java
+    Job job = Job.getInstance(getConf());
+    job.setInputFormatClass(AccumuloInputFormat.class);
+    Properties props = Accumulo.newClientProperties().to("myinstance","zoo1,zoo2")
+                            .as("user", "passwd").build();
+    AccumuloInputFormat.configure().clientProperties(props)
+        .table("table1").auths(Authorizations.EMPTY).ranges(tableOneRanges)
+        .table("table2").auths(Authorizations.EMPTY).ranges(tableTwoRanges)
+        .store(job);
+    ```
+    If reading from multiple tables, the table name can be retrieved from the input split:
+    ```java
+    class MyMapper extends Mapper<Key,Value,WritableComparable,Writable> {
+        public void map(Key k, Value v, Context c) {
+            RangeInputSplit split = (RangeInputSplit)c.getInputSplit();
+            String tableName = split.getTableName();
+            // do something with table name
+        }
     }
-}
-```
+    ```
 
-## AccumuloOutputFormat options
+## Write output to an Accumulo table
 
-```java
-ClientInfo info = Accumulo.newClient().to("myinstance","zoo1,zoo2")
-                        .as("user", "passwd").info()
-AccumuloOutputFormat.setClientInfo(job, info);
-AccumuloOutputFormat.setDefaultTableName(job, "mytable");
-```
+Follow the steps below to write to an Accumulo table from a MapReduce job.
 
-**Optional Settings:**
+1. Create a Reducer with the following class parameterization. The key emitted from
+    the Reducer identifies the table to which the mutation is sent. This allows a single
+    Reducer to write to more than one table if desired. A default table can be configured
+    using the [AccumuloOutputFormat], in which case the output table name does not have to
+    be passed to the Context object within the Reducer.
+    ```java
+    class MyReducer extends Reducer<WritableComparable, Writable, Text, Mutation> {
+        public void reduce(WritableComparable key, Iterable<Text> values, Context c)
{
+            Mutation m;
+            // create the mutation based on input key and value
+            c.write(new Text("output-table"), m);
+        }
+    }
+    ```
+    The Text object passed as the output should contain the name of the table to which
+    this mutation should be applied. The Text can be null in which case the mutation
+    will be applied to the default table name specified in the [AccumuloOutputFormat]
+    options.
 
-```java
-AccumuloOutputFormat.setMaxLatency(job, 300000); // milliseconds
-AccumuloOutputFormat.setMaxMutationBufferSize(job, 50000000); // bytes
-```
+2. Configure your MapReduce job to use [AccumuloOutputFormat].
+    ```java
+    Job job = Job.getInstance(getConf());
+    job.setOutputFormatClass(AccumuloOutputFormat.class);
+    Properties props = Accumulo.newClientProperties().to("myinstance","zoo1,zoo2")
+                            .as("user", "passwd").build();
+    AccumuloOutputFormat.configure().clientProperties(props)
+        .defaultTable("mytable").store(job);
+    ```
 
 The [MapReduce example][mapred-example] contains a complete example of using MapReduce with
Accumulo.
 
 [mapred-example]: https://github.com/apache/accumulo-examples/blob/master/docs/mapred.md
-[AccumuloInputFormat]: {% jurl org.apache.accumulo.core.client.mapred.AccumuloInputFormat
%}
-[AccumuloMultiTableInputFormat]: {% jurl org.apache.accumulo.core.client.mapred.AccumuloMultiTableInputFormat
%}
-[AccumuloOutputFormat]: {% jurl org.apache.accumulo.core.client.mapred.AccumuloOutputFormat
%}
+[AccumuloInputFormat]: {% jurl org.apache.accumulo.hadoop.mapreduce.AccumuloInputFormat %}
+[AccumuloOutputFormat]: {% jurl org.apache.accumulo.hadoop.mapreduce.AccumuloOutputFormat
%}
diff --git a/_docs-2/development/sampling.md b/_docs-2/development/sampling.md
index cde4642..4d586d3 100644
--- a/_docs-2/development/sampling.md
+++ b/_docs-2/development/sampling.md
@@ -52,8 +52,8 @@ Sample data can also be scanned from within an Accumulo [SortedKeyValueIterator]
 To see how to do this, look at the example iterator referenced in the [sampling example][example].
 Also, consult the javadoc on [IteratorEnvironment.cloneWithSamplingEnabled()][clone-sampling].
 
-Map reduce jobs using the [AccumuloInputFormat] can also read sample data.  See
-the javadoc for the `setSamplerConfiguration()` method of [AccumuloInputFormat].
+MapReduce jobs using the [AccumuloInputFormat] can also read sample data.  See the javadoc
+for `samplerConfiguration()` in the `configure()` method of [AccumuloInputFormat].
 
 Scans over sample data will throw a [SampleNotPresentException] in the following cases :
 
@@ -67,7 +67,7 @@ generated with the same configuration.
 ## Bulk import
 
 When generating rfiles to bulk import into Accumulo, those rfiles can contain
-sample data.  To use this feature, look at the javadoc of the `setSampler(...)`
+sample data.  To use this feature, look at the javadoc of `sampler()` in the `configure()`
 method of [AccumuloFileOutputFormat].
 
 [example]: https://github.com/apache/accumulo-examples/blob/master/docs/sample.md
@@ -75,8 +75,8 @@ method of [AccumuloFileOutputFormat].
 [sample-package]: {% jurl org.apache.accumulo.core.client.sample %}
 [skv-iterator]: {% jurl org.apache.accumulo.core.iterators.SortedKeyValueIterator %}
 [clone-sampling]: {% jurl org.apache.accumulo.core.iterators.IteratorEnvironment#cloneWithSamplingEnabled--
%}
-[AccumuloInputFormat]: {% jurl org.apache.accumulo.core.client.mapred.AccumuloInputFormat
%}
-[AccumuloFileOutputFormat]: {% jurl org.apache.accumulo.core.client.mapred.AccumuloFileOutputFormat
%}
+[AccumuloInputFormat]: {% jurl org.apache.accumulo.hadoop.mapreduce.AccumuloInputFormat %}
+[AccumuloFileOutputFormat]: {% jurl org.apache.accumulo.hadoop.mapreduce.AccumuloFileOutputFormat
%}
 [SampleNotPresentException]: {% jurl org.apache.accumulo.core.client.SampleNotPresentException
%}
 [BatchScanner]: {% jurl org.apache.accumulo.core.client.BatchScanner %}
 [Scanner]: {% jurl org.apache.accumulo.core.client.Scanner %}
diff --git a/_docs-2/development/summaries.md b/_docs-2/development/summaries.md
index d68a570..40f6c1e 100644
--- a/_docs-2/development/summaries.md
+++ b/_docs-2/development/summaries.md
@@ -63,8 +63,8 @@ requires a special permission.  User must have the table permission
 ## Bulk import
 
 When generating RFiles to bulk import into Accumulo, those RFiles can contain
-summary data.  To use this feature, look at the javadoc on the
-`AccumuloFileOutputFormat.setSummarizers(...)` method.  Also, the {% jlink org.apache.accumulo.core.client.rfile.RFile
%}
+summary data.  To use this feature, look at the javadoc of `summarizers()` in the `configure()`
method
+of AccumuloFileOutputFormat.  Also, the {% jlink org.apache.accumulo.core.client.rfile.RFile
%}
 class has options for creating RFiles with embedded summary data.
 
 ## Examples
@@ -218,3 +218,4 @@ root@uno summary_test> summaries
 root@uno summary_test>   
 ```
 
+[AccumuloFileOutputFormat]: {% jurl org.apache.accumulo.hadoop.mapreduce.AccumuloFileOutputFormat
%}
diff --git a/_docs-2/security/kerberos.md b/_docs-2/security/kerberos.md
index 716f630..2535935 100644
--- a/_docs-2/security/kerberos.md
+++ b/_docs-2/security/kerberos.md
@@ -390,14 +390,14 @@ KerberosToken kt = new KerberosToken();
 AccumuloClient client = Accumulo.newClient().to("myinstance", "zoo1,zoo2")
                           .as(principal, kt).build();
 DelegationToken dt = client.securityOperations().getDelegationToken();
-AccumuloClient client2 = client.changeUser(principal, dt);
-ClientInfo info2 = client2.info();
+Properties props = Accumulo.newClientProperties().from(client.properties())
+                          .as(principal, dt).build();
 
 // Reading from Accumulo
-AccumuloInputFormat.setClientInfo(job, info2);
+AccumuloInputFormat.configure().clientProperties(props).store(job);
 
 // Writing to Accumulo
-AccumuloOutputFormat.setClientInfo(job, info2);
+AccumuloOutputFormat.configure().clientProperties(props).store(job);
 ```
 
 Users must have the `DELEGATION_TOKEN` system permission to call the `getDelegationToken`
diff --git a/_docs-2/security/on-disk-encryption.md b/_docs-2/security/on-disk-encryption.md
index e7be37b..7046767 100644
--- a/_docs-2/security/on-disk-encryption.md
+++ b/_docs-2/security/on-disk-encryption.md
@@ -78,8 +78,8 @@ its the additional data that gets encrypted on disk that could be exposed
in a l
 
 ### Bulk Import
 
-There are 2 ways to create RFiles for bulk ingest: with the [RFile API][rfile] and during
Map Reduce using [AccumuloOutputFormat].  
-The [RFile API][rfile] allows passing in the configuration properties for encryption mentioned
above.  The [AccumuloOutputFormat] does 
+There are 2 ways to create RFiles for bulk ingest: with the [RFile API][rfile] and during
Map Reduce using [AccumuloFileOutputFormat].  
+The [RFile API][rfile] allows passing in the configuration properties for encryption mentioned
above.  The [AccumuloFileOutputFormat] does 
 not allow for encryption of RFiles so any data bulk imported through this process will be
unencrypted.
 
 ### Zookeeper
@@ -104,4 +104,4 @@ As you can see, there is a significant performance hit when running without
the
 [Kerberos]: {% durl security/kerberos %}
 [design]: {% durl getting-started/design#rfile %}
 [rfile]: {% jurl org.apache.accumulo.core.client.rfile.RFile %}
-[AccumuloOutputFormat]: {% jurl org.apache.accumulo.core.client.mapred.AccumuloOutputFormat
%}
+[AccumuloFileOutputFormat]: {% jurl org.apache.accumulo.hadoop.mapreduce.AccumuloFileOutputFormat
%}
diff --git a/_plugins/links.rb b/_plugins/links.rb
index 2f9dc3f..f227890 100755
--- a/_plugins/links.rb
+++ b/_plugins/links.rb
@@ -43,8 +43,8 @@ def render_javadoc(context, text, url_only)
   jmodule = 'accumulo-' + clz.split('.')[3]
   if clz.start_with?('org.apache.accumulo.server')
     jmodule = 'accumulo-server-base'
-  elsif clz.start_with?('org.apache.accumulo.core.client.mapred')
-    jmodule = 'accumulo-client-mapreduce'
+  elsif clz.start_with?('org.apache.accumulo.hadoop.mapred')
+    jmodule = 'accumulo-hadoop-mapreduce'
   elsif clz.start_with?('org.apache.accumulo.iteratortest')
     jmodule = 'accumulo-iterator-test-harness'
   end


Mime
View raw message