hbase-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mi...@apache.org
Subject [07/11] hbase git commit: HBASE-12922 Post-asciidoc conversion fix-ups part 2 <Lars Francke>
Date Thu, 29 Jan 2015 22:59:07 GMT
http://git-wip-us.apache.org/repos/asf/hbase/blob/a7816d88/src/main/asciidoc/_chapters/mapreduce.adoc
----------------------------------------------------------------------
diff --git a/src/main/asciidoc/_chapters/mapreduce.adoc b/src/main/asciidoc/_chapters/mapreduce.adoc
index 1228f57..a008a4f 100644
--- a/src/main/asciidoc/_chapters/mapreduce.adoc
+++ b/src/main/asciidoc/_chapters/mapreduce.adoc
@@ -29,48 +29,48 @@
 
 Apache MapReduce is a software framework used to analyze large amounts of data, and is the
framework used most often with link:http://hadoop.apache.org/[Apache Hadoop].
 MapReduce itself is out of the scope of this document.
-A good place to get started with MapReduce is link:http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html.
-MapReduce version 2 (MR2)is now part of link:http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/[YARN].

+A good place to get started with MapReduce is http://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html.
+MapReduce version 2 (MR2)is now part of link:http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/[YARN].
 
 This chapter discusses specific configuration steps you need to take to use MapReduce on
data within HBase.
-In addition, it discusses other interactions and issues between HBase and MapReduce jobs.

+In addition, it discusses other interactions and issues between HBase and MapReduce jobs.
 
-.mapred and mapreduce
+.`mapred` and `mapreduce`
 [NOTE]
 ====
 There are two mapreduce packages in HBase as in MapReduce itself: _org.apache.hadoop.hbase.mapred_
     and _org.apache.hadoop.hbase.mapreduce_.
 The former does old-style API and the latter the new style.
 The latter has more facility though you can usually find an equivalent in the older package.
-Pick the package that goes with your mapreduce deploy.
+Pick the package that goes with your MapReduce deploy.
 When in doubt or starting over, pick the _org.apache.hadoop.hbase.mapreduce_.
-In the notes below, we refer to o.a.h.h.mapreduce but replace with the o.a.h.h.mapred if
that is what you are using. 
-====  
+In the notes below, we refer to o.a.h.h.mapreduce but replace with the o.a.h.h.mapred if
that is what you are using.
+====
 
 [[hbase.mapreduce.classpath]]
 == HBase, MapReduce, and the CLASSPATH
 
 By default, MapReduce jobs deployed to a MapReduce cluster do not have access to either the
HBase configuration under `$HBASE_CONF_DIR` or the HBase classes.
 
-To give the MapReduce jobs the access they need, you could add _hbase-site.xml_ to the _$HADOOP_HOME/conf/_
directory and add the HBase JARs to the _`$HADOOP_HOME`/conf/_        directory, then copy
these changes across your cluster.
-You could add hbase-site.xml to `$HADOOP_HOME`/conf and add HBase jars to the $HADOOP_HOME/lib.
-You would then need to copy these changes across your cluster or edit _`$HADOOP_HOME`/conf/hadoop-env.sh_
and add them to the `HADOOP_CLASSPATH` variable.
+To give the MapReduce jobs the access they need, you could add _hbase-site.xml_ to the _$HADOOP_HOME/conf/_
directory and add the HBase JARs to the _HADOOP_HOME/conf/_ directory, then copy these changes
across your cluster.
+You could add _hbase-site.xml_ to _$HADOOP_HOME/conf_ and add HBase jars to the _$HADOOP_HOME/lib_
directory.
+You would then need to copy these changes across your cluster or edit _$HADOOP_HOMEconf/hadoop-env.sh_
and add them to the `HADOOP_CLASSPATH` variable.
 However, this approach is not recommended because it will pollute your Hadoop install with
HBase references.
 It also requires you to restart the Hadoop cluster before Hadoop can use the HBase data.
 
 Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration itself.
 The dependencies only need to be available on the local `CLASSPATH`.
-The following example runs the bundled HBase link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter]
       MapReduce job against a table named [systemitem]+usertable+ If you have not set the
environment variables expected in the command (the parts prefixed by a `$` sign and curly
braces), you can use the actual system paths instead.
+The following example runs the bundled HBase link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter]
MapReduce job against a table named `usertable` If you have not set the environment variables
expected in the command (the parts prefixed by a `$` sign and curly braces), you can use the
actual system paths instead.
 Be sure to use the correct version of the HBase JAR for your system.
-The backticks (``` symbols) cause ths shell to execute the sub-commands, setting the CLASSPATH
as part of the command.
-This example assumes you use a BASH-compatible shell. 
+The backticks (``` symbols) cause ths shell to execute the sub-commands, setting the `CLASSPATH`
as part of the command.
+This example assumes you use a BASH-compatible shell.
 
 [source,bash]
 ----
 $ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar
rowcounter usertable
 ----
 
-When the command runs, internally, the HBase JAR finds the dependencies it needs for zookeeper,
guava, and its other dependencies on the passed `HADOOP_CLASSPATH`        and adds the JARs
to the MapReduce job configuration.
-See the source at TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for
how this is done. 
+When the command runs, internally, the HBase JAR finds the dependencies it needs for ZooKeeper,
Guava, and its other dependencies on the passed `HADOOP_CLASSPATH` and adds the JARs to the
MapReduce job configuration.
+See the source at `TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job)`
for how this is done.
 
 [NOTE]
 ====
@@ -89,10 +89,10 @@ $ HADOOP_CLASSPATH=${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSH
 ----
 ====
 
-.Notice to Mapreduce users of HBase 0.96.1 and above
+.Notice to MapReduce users of HBase 0.96.1 and above
 [CAUTION]
 ====
-Some mapreduce jobs that use HBase fail to launch.
+Some MapReduce jobs that use HBase fail to launch.
 The symptom is an exception similar to the following:
 
 ----
@@ -125,15 +125,15 @@ Exception in thread "main" java.lang.IllegalAccessError: class
 ...
 ----
 
-This is caused by an optimization introduced in link:https://issues.apache.org/jira/browse/HBASE-9867[HBASE-9867]
that inadvertently introduced a classloader dependency. 
+This is caused by an optimization introduced in link:https://issues.apache.org/jira/browse/HBASE-9867[HBASE-9867]
that inadvertently introduced a classloader dependency.
 
 This affects both jobs using the `-libjars` option and "fat jar," those which package their
runtime dependencies in a nested `lib` folder.
 
-In order to satisfy the new classloader requirements, hbase-protocol.jar must be included
in Hadoop's classpath.
-See <<hbase.mapreduce.classpath,hbase.mapreduce.classpath>> for current recommendations
for resolving classpath errors.
+In order to satisfy the new classloader requirements, `hbase-protocol.jar` must be included
in Hadoop's classpath.
+See <<hbase.mapreduce.classpath>> for current recommendations for resolving classpath
errors.
 The following is included for historical purposes.
 
-This can be resolved system-wide by including a reference to the hbase-protocol.jar in hadoop's
lib directory, via a symlink or by copying the jar into the new location.
+This can be resolved system-wide by including a reference to the `hbase-protocol.jar` in
Hadoop's lib directory, via a symlink or by copying the jar into the new location.
 
 This can also be achieved on a per-job launch basis by including it in the `HADOOP_CLASSPATH`
environment variable at job submission time.
 When launching jobs that package their dependencies, all three of the following job launching
commands satisfy this requirement:
@@ -162,7 +162,7 @@ This functionality was lost due to a bug in HBase 0.95 (link:https://issues.apac
 The priority order for choosing the scanner caching is as follows:
 
 . Caching settings which are set on the scan object.
-. Caching settings which are specified via the configuration option +hbase.client.scanner.caching+,
which can either be set manually in _hbase-site.xml_ or via the helper method `TableMapReduceUtil.setScannerCaching()`.
+. Caching settings which are specified via the configuration option `hbase.client.scanner.caching`,
which can either be set manually in _hbase-site.xml_ or via the helper method `TableMapReduceUtil.setScannerCaching()`.
 . The default value `HConstants.DEFAULT_HBASE_CLIENT_SCANNER_CACHING`, which is set to `100`.
 
 Optimizing the caching settings is a balance between the time the client waits for a result
and the number of sets of results the client needs to receive.
@@ -176,7 +176,7 @@ See the API documentation for link:https://hbase.apache.org/apidocs/org/apache/h
 
 == Bundled HBase MapReduce Jobs
 
-The HBase JAR also serves as a Driver for some bundled mapreduce jobs.
+The HBase JAR also serves as a Driver for some bundled MapReduce jobs.
 To learn about the bundled MapReduce jobs, run the following command.
 
 [source,bash]
@@ -202,35 +202,35 @@ $ ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar
rowcounte
 
 == HBase as a MapReduce Job Data Source and Data Sink
 
-HBase can be used as a data source, link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html[TableInputFormat],
and data sink, link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html[TableOutputFormat]
       or link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.html[MultiTableOutputFormat],
for MapReduce jobs.
+HBase can be used as a data source, link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html[TableInputFormat],
and data sink, link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html[TableOutputFormat]
or link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.html[MultiTableOutputFormat],
for MapReduce jobs.
 Writing MapReduce jobs that read or write HBase, it is advisable to subclass link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html[TableMapper]
       and/or link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableReducer.html[TableReducer].
-See the do-nothing pass-through classes link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableMapper.html[IdentityTableMapper]
       and link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableReducer.html[IdentityTableReducer]
       for basic usage.
-For a more involved example, see link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter]
       or review the `org.apache.hadoop.hbase.mapreduce.TestTableMapReduce` unit test. 
+See the do-nothing pass-through classes link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableMapper.html[IdentityTableMapper]
and link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableReducer.html[IdentityTableReducer]
for basic usage.
+For a more involved example, see link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter]
or review the `org.apache.hadoop.hbase.mapreduce.TestTableMapReduce` unit test.
 
 If you run MapReduce jobs that use HBase as source or sink, need to specify source and sink
table and column names in your configuration.
 
 When you read from HBase, the `TableInputFormat` requests the list of regions from HBase
and makes a map, which is either a `map-per-region` or `mapreduce.job.maps` map, whichever
is smaller.
 If your job only has two maps, raise `mapreduce.job.maps` to a number greater than the number
of regions.
-Maps will run on the adjacent TaskTracker if you are running a TaskTracer and RegionServer
per node.
+Maps will run on the adjacent TaskTracker/NodeManager if you are running a TaskTracer/NodeManager
and RegionServer per node.
 When writing to HBase, it may make sense to avoid the Reduce step and write back into HBase
from within your map.
 This approach works when your job does not need the sort and collation that MapReduce does
on the map-emitted data.
 On insert, HBase 'sorts' so there is no point double-sorting (and shuffling data around your
MapReduce cluster) unless you need to.
-If you do not need the Reduce, you myour map might emit counts of records processed for reporting
at the end of the jobj, or set the number of Reduces to zero and use TableOutputFormat.
+If you do not need the Reduce, your map might emit counts of records processed for reporting
at the end of the job, or set the number of Reduces to zero and use TableOutputFormat.
 If running the Reduce step makes sense in your case, you should typically use multiple reducers
so that load is spread across the HBase cluster.
 
 A new HBase partitioner, the link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/HRegionPartitioner.html[HRegionPartitioner],
can run as many reducers the number of existing regions.
 The HRegionPartitioner is suitable when your table is large and your upload will not greatly
alter the number of existing regions upon completion.
-Otherwise use the default partitioner. 
+Otherwise use the default partitioner.
 
 == Writing HFiles Directly During Bulk Import
 
 If you are importing into a new table, you can bypass the HBase API and write your content
directly to the filesystem, formatted into HBase data files (HFiles). Your import will run
faster, perhaps an order of magnitude faster.
-For more on how this mechanism works, see <<arch.bulk.load,arch.bulk.load>>.
+For more on how this mechanism works, see <<arch.bulk.load>>.
 
 == RowCounter Example
 
-The included link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter]
       MapReduce job uses `TableInputFormat` and does a count of all rows in the specified
table.
-To run it, use the following command: 
+The included link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html[RowCounter]
MapReduce job uses `TableInputFormat` and does a count of all rows in the specified table.
+To run it, use the following command:
 
 [source,bash]
 ----
@@ -239,9 +239,9 @@ $ ./bin/hadoop jar hbase-X.X.X.jar
 
 This will invoke the HBase MapReduce Driver class.
 Select `rowcounter` from the choice of jobs offered.
-This will print rowcouner usage advice to standard output.
+This will print rowcounter usage advice to standard output.
 Specify the tablename, column to count, and output directory.
-If you have classpath errors, see <<hbase.mapreduce.classpath,hbase.mapreduce.classpath>>.
+If you have classpath errors, see <<hbase.mapreduce.classpath>>.
 
 [[splitter]]
 == Map-Task Splitting
@@ -249,14 +249,14 @@ If you have classpath errors, see <<hbase.mapreduce.classpath,hbase.mapreduce.cl
 [[splitter.default]]
 === The Default HBase MapReduce Splitter
 
-When link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html[TableInputFormat]
         is used to source an HBase table in a MapReduce job, its splitter will make a map
task for each region of the table.
+When link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html[TableInputFormat]
is used to source an HBase table in a MapReduce job, its splitter will make a map task for
each region of the table.
 Thus, if there are 100 regions in the table, there will be 100 map-tasks for the job - regardless
of how many column families are selected in the Scan.
 
 [[splitter.custom]]
 === Custom Splitters
 
 For those interested in implementing custom splitters, see the method `getSplits` in link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html[TableInputFormatBase].
-That is where the logic for map-task assignment resides. 
+That is where the logic for map-task assignment resides.
 
 [[mapreduce.example]]
 == HBase MapReduce Examples
@@ -325,16 +325,16 @@ scan.setCacheBlocks(false);  // don't set to true for MR jobs
 // set other scan attrs
 
 TableMapReduceUtil.initTableMapperJob(
-	sourceTable,      // input table
-	scan,	          // Scan instance to control CF and attribute selection
-	MyMapper.class,   // mapper class
-	null,	          // mapper output key
-	null,	          // mapper output value
-	job);
+  sourceTable,      // input table
+  scan,             // Scan instance to control CF and attribute selection
+  MyMapper.class,   // mapper class
+  null,             // mapper output key
+  null,             // mapper output value
+  job);
 TableMapReduceUtil.initTableReducerJob(
-	targetTable,      // output table
-	null,             // reducer class
-	job);
+  targetTable,      // output table
+  null,             // reducer class
+  job);
 job.setNumReduceTasks(0);
 
 boolean b = job.waitForCompletion(true);
@@ -343,45 +343,45 @@ if (!b) {
 }
 ----
 
-An explanation is required of what `TableMapReduceUtil` is doing, especially with the reducer.
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html[TableOutputFormat]
         is being used as the outputFormat class, and several parameters are being set on
the config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key
to `ImmutableBytesWritable` and reducer value to `Writable`.
+An explanation is required of what `TableMapReduceUtil` is doing, especially with the reducer.
link:http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html[TableOutputFormat]
is being used as the outputFormat class, and several parameters are being set on the config
(e.g., `TableOutputFormat.OUTPUT_TABLE`), as well as setting the reducer output key to `ImmutableBytesWritable`
and reducer value to `Writable`.
 These could be set by the programmer on the job and conf, but `TableMapReduceUtil` tries
to make things easier.
 
-The following is the example mapper, which will create a `Put`          and matching the
input `Result` and emit it.
-Note: this is what the CopyTable utility does. 
+The following is the example mapper, which will create a `Put` and matching the input `Result`
and emit it.
+Note: this is what the CopyTable utility does.
 
 [source,java]
 ----
 public static class MyMapper extends TableMapper<ImmutableBytesWritable, Put>  {
 
-	public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException,
InterruptedException {
-		// this example is just copying the data from the source table...
-   		context.write(row, resultToPut(row,value));
-   	}
-
-  	private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException
{
-  		Put put = new Put(key.get());
- 		for (KeyValue kv : result.raw()) {
-			put.add(kv);
-		}
-		return put;
-   	}
+  public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException,
InterruptedException {
+    // this example is just copying the data from the source table...
+      context.write(row, resultToPut(row,value));
+    }
+
+    private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException
{
+      Put put = new Put(key.get());
+      for (KeyValue kv : result.raw()) {
+        put.add(kv);
+      }
+      return put;
+    }
 }
 ----
 
-There isn't actually a reducer step, so `TableOutputFormat` takes care of sending the `Put`
to the target table. 
+There isn't actually a reducer step, so `TableOutputFormat` takes care of sending the `Put`
to the target table.
 
-This is just an example, developers could choose not to use `TableOutputFormat` and connect
to the target table themselves. 
+This is just an example, developers could choose not to use `TableOutputFormat` and connect
to the target table themselves.
 
 [[mapreduce.example.readwrite.multi]]
 === HBase MapReduce Read/Write Example With Multi-Table Output
 
-TODO: example for `MultiTableOutputFormat`. 
+TODO: example for `MultiTableOutputFormat`.
 
 [[mapreduce.example.summary]]
 === HBase MapReduce Summary to HBase Example
 
 The following example uses HBase as a MapReduce source and sink with a summarization step.
-This example will count the number of distinct instances of a value in a table and write
those summarized counts in another table. 
+This example will count the number of distinct instances of a value in a table and write
those summarized counts in another table.
 
 [source,java]
 ----
@@ -395,72 +395,71 @@ scan.setCacheBlocks(false);  // don't set to true for MR jobs
 // set other scan attrs
 
 TableMapReduceUtil.initTableMapperJob(
-	sourceTable,        // input table
-	scan,               // Scan instance to control CF and attribute selection
-	MyMapper.class,     // mapper class
-	Text.class,         // mapper output key
-	IntWritable.class,  // mapper output value
-	job);
+  sourceTable,        // input table
+  scan,               // Scan instance to control CF and attribute selection
+  MyMapper.class,     // mapper class
+  Text.class,         // mapper output key
+  IntWritable.class,  // mapper output value
+  job);
 TableMapReduceUtil.initTableReducerJob(
-	targetTable,        // output table
-	MyTableReducer.class,    // reducer class
-	job);
+  targetTable,        // output table
+  MyTableReducer.class,    // reducer class
+  job);
 job.setNumReduceTasks(1);   // at least one, adjust as required
 
 boolean b = job.waitForCompletion(true);
 if (!b) {
-	throw new IOException("error with job!");
+  throw new IOException("error with job!");
 }
-----          
+----
 
 In this example mapper a column with a String-value is chosen as the value to summarize upon.
-This value is used as the key to emit from the mapper, and an `IntWritable` represents an
instance counter. 
+This value is used as the key to emit from the mapper, and an `IntWritable` represents an
instance counter.
 
 [source,java]
 ----
 public static class MyMapper extends TableMapper<Text, IntWritable>  {
-	public static final byte[] CF = "cf".getBytes();
-	public static final byte[] ATTR1 = "attr1".getBytes();
-
-	private final IntWritable ONE = new IntWritable(1);
-   	private Text text = new Text();
+  public static final byte[] CF = "cf".getBytes();
+  public static final byte[] ATTR1 = "attr1".getBytes();
 
-   	public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException,
InterruptedException {
-        	String val = new String(value.getValue(CF, ATTR1));
-          	text.set(val);     // we can only emit Writables...
+  private final IntWritable ONE = new IntWritable(1);
+  private Text text = new Text();
 
-        	context.write(text, ONE);
-   	}
+  public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException,
InterruptedException {
+    String val = new String(value.getValue(CF, ATTR1));
+    text.set(val);     // we can only emit Writables...
+    context.write(text, ONE);
+  }
 }
-----          
+----
 
-In the reducer, the "ones" are counted (just like any other MR example that does this), and
then emits a `Put`. 
+In the reducer, the "ones" are counted (just like any other MR example that does this), and
then emits a `Put`.
 
 [source,java]
 ----
 public static class MyTableReducer extends TableReducer<Text, IntWritable, ImmutableBytesWritable>
 {
-	public static final byte[] CF = "cf".getBytes();
-	public static final byte[] COUNT = "count".getBytes();
-
- 	public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException {
-    		int i = 0;
-    		for (IntWritable val : values) {
-    			i += val.get();
-    		}
-    		Put put = new Put(Bytes.toBytes(key.toString()));
-    		put.add(CF, COUNT, Bytes.toBytes(i));
-
-    		context.write(null, put);
-   	}
+  public static final byte[] CF = "cf".getBytes();
+  public static final byte[] COUNT = "count".getBytes();
+
+  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException {
+    int i = 0;
+    for (IntWritable val : values) {
+      i += val.get();
+    }
+    Put put = new Put(Bytes.toBytes(key.toString()));
+    put.add(CF, COUNT, Bytes.toBytes(i));
+
+    context.write(null, put);
+  }
 }
-----        
+----
 
 [[mapreduce.example.summary.file]]
 === HBase MapReduce Summary to File Example
 
 This very similar to the summary example above, with exception that this is using HBase as
a MapReduce source but HDFS as the sink.
 The differences are in the job setup and in the reducer.
-The mapper remains the same. 
+The mapper remains the same.
 
 [source,java]
 ----
@@ -474,19 +473,19 @@ scan.setCacheBlocks(false);  // don't set to true for MR jobs
 // set other scan attrs
 
 TableMapReduceUtil.initTableMapperJob(
-	sourceTable,        // input table
-	scan,               // Scan instance to control CF and attribute selection
-	MyMapper.class,     // mapper class
-	Text.class,         // mapper output key
-	IntWritable.class,  // mapper output value
-	job);
+  sourceTable,        // input table
+  scan,               // Scan instance to control CF and attribute selection
+  MyMapper.class,     // mapper class
+  Text.class,         // mapper output key
+  IntWritable.class,  // mapper output value
+  job);
 job.setReducerClass(MyReducer.class);    // reducer class
 job.setNumReduceTasks(1);    // at least one, adjust as required
 FileOutputFormat.setOutputPath(job, new Path("/tmp/mr/mySummaryFile"));  // adjust directories
as required
 
 boolean b = job.waitForCompletion(true);
 if (!b) {
-	throw new IOException("error with job!");
+  throw new IOException("error with job!");
 }
 ----
 
@@ -497,68 +496,68 @@ As for the Reducer, it is a "generic" Reducer instead of extending TableMapper
a
 ----
 public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable>
 {
 
-	public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException {
-		int i = 0;
-		for (IntWritable val : values) {
-			i += val.get();
-		}
-		context.write(key, new IntWritable(i));
-	}
+  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException {
+    int i = 0;
+    for (IntWritable val : values) {
+      i += val.get();
+    }
+    context.write(key, new IntWritable(i));
+  }
 }
 ----
 
 [[mapreduce.example.summary.noreducer]]
 === HBase MapReduce Summary to HBase Without Reducer
 
-It is also possible to perform summaries without a reducer - if you use HBase as the reducer.

+It is also possible to perform summaries without a reducer - if you use HBase as the reducer.
 
 An HBase target table would need to exist for the job summary.
 The Table method `incrementColumnValue` would be used to atomically increment values.
-From a performance perspective, it might make sense to keep a Map of values with their values
to be incremeneted for each map-task, and make one update per key at during the `cleanup`
method of the mapper.
-However, your milage may vary depending on the number of rows to be processed and unique
keys. 
+From a performance perspective, it might make sense to keep a Map of values with their values
to be incremented for each map-task, and make one update per key at during the `cleanup` method
of the mapper.
+However, your mileage may vary depending on the number of rows to be processed and unique
keys.
 
-In the end, the summary results are in HBase. 
+In the end, the summary results are in HBase.
 
 [[mapreduce.example.summary.rdbms]]
 === HBase MapReduce Summary to RDBMS
 
 Sometimes it is more appropriate to generate summaries to an RDBMS.
 For these cases, it is possible to generate summaries directly to an RDBMS via a custom reducer.
-The `setup` method can connect to an RDBMS (the connection information can be passed via
custom parameters in the context) and the cleanup method can close the connection. 
+The `setup` method can connect to an RDBMS (the connection information can be passed via
custom parameters in the context) and the cleanup method can close the connection.
 
 It is critical to understand that number of reducers for the job affects the summarization
implementation, and you'll have to design this into your reducer.
 Specifically, whether it is designed to run as a singleton (one reducer) or multiple reducers.
 Neither is right or wrong, it depends on your use-case.
-Recognize that the more reducers that are assigned to the job, the more simultaneous connections
to the RDBMS will be created - this will scale, but only to a point. 
+Recognize that the more reducers that are assigned to the job, the more simultaneous connections
to the RDBMS will be created - this will scale, but only to a point.
 
 [source,java]
 ----
- public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, IntWritable>
 {
+public static class MyRdbmsReducer extends Reducer<Text, IntWritable, Text, IntWritable>
 {
 
-	private Connection c = null;
+  private Connection c = null;
 
-	public void setup(Context context) {
-  		// create DB connection...
-  	}
+  public void setup(Context context) {
+    // create DB connection...
+  }
 
-	public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException {
-		// do summarization
-		// in this example the keys are Text, but this is just an example
-	}
+  public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException {
+    // do summarization
+    // in this example the keys are Text, but this is just an example
+  }
 
-	public void cleanup(Context context) {
-  		// close db connection
-  	}
+  public void cleanup(Context context) {
+    // close db connection
+  }
 
 }
 ----
 
-In the end, the summary results are written to your RDBMS table/s. 
+In the end, the summary results are written to your RDBMS table/s.
 
 [[mapreduce.htable.access]]
 == Accessing Other HBase Tables in a MapReduce Job
 
-Although the framework currently allows one HBase table as input to a MapReduce job, other
HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating an Table
instance in the setup method of the Mapper. 
+Although the framework currently allows one HBase table as input to a MapReduce job, other
HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating an Table
instance in the setup method of the Mapper.
 [source,java]
 ----
 public class MyMapper extends TableMapper<Text, LongWritable> {
@@ -571,16 +570,16 @@ public class MyMapper extends TableMapper<Text, LongWritable>
{
   }
 
   public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException,
InterruptedException {
-	// process Result...
-	// use 'myOtherTable' for lookups
+    // process Result...
+    // use 'myOtherTable' for lookups
   }
-----      
+----
 
 [[mapreduce.specex]]
 == Speculative Execution
 
 It is generally advisable to turn off speculative execution for MapReduce jobs that use HBase
as a source.
 This can either be done on a per-Job basis through properties, on on the entire cluster.
-Especially for longer running jobs, speculative execution will create duplicate map-tasks
which will double-write your data to HBase; this is probably not what you want. 
+Especially for longer running jobs, speculative execution will create duplicate map-tasks
which will double-write your data to HBase; this is probably not what you want.
 
-See <<spec.ex,spec.ex>> for more information. 
+See <<spec.ex,spec.ex>> for more information.


Mime
View raw message