hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hbase/MapReduce" by allenday
Date Thu, 21 Aug 2008 05:03:00 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by allenday:
http://wiki.apache.org/hadoop/Hbase/MapReduce

------------------------------------------------------------------------------
  Running mapreduce jobs that have hbase as source or sink, you'll need to specify source/sink
table and column names in your configuration.
  
  Reading from hbase, the !TableInputFormat asks hbase for the list of regions and makes a
map-per-region.  Writing, it may make sense to avoid the reduce step and write back into hbase
from inside your map.  You'd do this when your job does not need the sort and collation that
MR does inside in its reduce; on insert, hbase sorts so no point double-sorting (and shuffling
data around your MR cluster) unless you need to.  If you do not need the reduce, you might
just have your map emit counts of records processed just so the framework can emit that nice
report of records processed when the job is done.  See example code below.  If running the
reduce step makes sense in  your case, its better to have lots of reducers so load is spread
across the hbase cluster.
+ 
+ == Example to map rows/column families between two HTables ==
+ 
+ Here's some sample code from [http://spicylogic.com/allenday/blog Allen Day] that will iterate
over all rows in one table for specified column families and insert those rows/columns to
a second table.
+ 
+ {{{
+ import java.io.IOException;
+ 
+ public class BulkCopy extends TableMap<Text, Text> implements Tool {
+   static final String NAME = "bulkcopy";  
+   private Configuration conf;
+   
+   public void map(ImmutableBytesWritable row, RowResult value, OutputCollector<Text,
Text> output, Reporter reporter) throws IOException {
+     HTable table = new HTable(new HBaseConfiguration(), conf.get("output.table"));
+     if ( table == null ) {
+       throw new IOException("output table is null");
+     }
+ 
+     BatchUpdate bu = new BatchUpdate( row.get() );
+ 
+     boolean content = false;
+     for (Map.Entry<byte [], Cell> e: value.entrySet()) {
+       Cell cell = e.getValue();
+       if (cell != null && cell.getValue().length > 0) {
+         bu.put(e.getKey(), cell.getValue());
+       }
+     }
+     table.commit( bu );
+   }
+ 
+   public JobConf createSubmittableJob(String[] args) throws IOException {
+     JobConf c = new JobConf(getConf(), BulkExport.class);
+     //table = new HTable(new HBaseConfiguration(), args[2]);
+     c.set("output.table", args[2]);
+     c.setJobName(NAME);
+     // Columns are space delimited
+     StringBuilder sb = new StringBuilder();
+     final int columnoffset = 3;
+     for (int i = columnoffset; i < args.length; i++) {
+       if (i > columnoffset) {
+         sb.append(" ");
+       }
+       sb.append(args[i]);
+     }
+     // Second argument is the table name.
+     TableMap.initJob(args[1], sb.toString(), this.getClass(),
+     Text.class, Text.class, c);
+     c.setReducerClass(IdentityReducer.class);
+     // First arg is the output directory.
+     c.setOutputPath(new Path(args[0]));
+     return c;
+   }
+   
+   static int printUsage() {
+     System.out.println(NAME +" <outputdir> <input tablename> <output tablename>
<column1> [<column2>...]");
+     return -1;
+   }
+   
+   public int run(final String[] args) throws Exception {
+     // Make sure there are at least 3 parameters
+     if (args.length < 3) {
+       System.err.println("ERROR: Wrong number of parameters: " + args.length);
+       return printUsage();
+     }
+     JobClient.runJob(createSubmittableJob(args));
+     return 0;
+   }
+ 
+   public Configuration getConf() {
+     return this.conf;
+   }
+ 
+   public void setConf(final Configuration c) {
+     this.conf = c;
+   }
+ 
+   public static void main(String[] args) throws Exception {
+     //String[] aa = {"/tmp/foobar", "M2", "M3", "R:"};
+     int errCode = ToolRunner.run(new HBaseConfiguration(), new BulkCopy(), args);
+     System.exit(errCode);
+   }
+ }
+ }}}
+ 
  
  == Sample running HBase inserts out of Map Task ==
  Here's sample code from Andrew Purtell that does HBase insert inside in the mapper rather
than via TableReduce.

Mime
View raw message