hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stuart Smith <stu24m...@yahoo.com>
Subject RE: Run MR job when my data stays in hbase?
Date Mon, 19 Jul 2010 18:16:32 GMT
Hello,

  You can ignore this if you're already rock solid on writing M/R jobs, but just in case you're
as new to this as I am: 

Be careful you have all your dependencies lined up in the jar you're creating your M/R job
in.  If you're using Eclipse this means selecting "Extract required libraries into generated
jar". 

Without this you get strange "Map class not found errors", similar to when you forget to make
your map class static or forget to call setJarByClass() on your job. 

All the examples I saw that used the *new api* were a little more complicated than needed.
A stripped down example with the new api:

public static class Mapper extends TableMapper<Text,IntWritable>
{
	@Override
	public void map( ImmutableBytesWritable key, Result value, Context context )
	throws IOException, InterruptedException
	{
        //Don't forget to make sure to load this as UTF-8
		String sha256 = new String( key.get(), "UTF-8" );
        //just calling value.value() will NOT give you what you want 
        byte[] valueBuffer = value.getValue(Bytes.toBytes(/*family*/), Bytes.toBytes(/*qualifier*/));

        /**Do stuff**/
        context.write( [some text], [some int] );
    }
}

public static class Reduce extends TableReducer<Text,IntWritable,Text>
{
	@Override
	public void reduce( Text key, Iterable<IntWritable> Values, Context context )
	throws IOException, InterruptedException
	{
        /**output of a reduce job needs to be a [something],Put object pair*/
		Put outputRow = new Put( Bytes.toBytes("row key") );
		outputRow.add( Bytes.toBytes(/*output family*/), Bytes.toBytes(/*output qualifier*/), Bytes.toBytes(count)
);
		context.write( /*some string*/, outputRow );
    }
}

public static void main(String[] argv) throws Exception 
{
	Job validateJob = new Job( configuration, /*job name*/ );
    //don't forget this!
	validateJob.setJarByClass(/*main class*/.class);
			
	//don't add anything, and it will scan everything (according to docs)
	Scan scan = new Scan();
	scan.addColumn( Bytes.toBytes(/*input family*/), Bytes.toBytes(/*input qualifier*/) );
			
	TableMapReduceUtil.initTableMapperJob(/*input tablename*/, scan, Mapper.class, Text.class,
IntWritable.class, validateJob);
	TableMapReduceUtil.initTableReducerJob(/*output table name*/, Reduce.class, validateJob);
	
	validateJob.waitForCompletion(true);
}

But look at the examples! I just thought some simple highlights might help. Don't forget that
you can issue Put()'s from your Map() tasks, if you already have the data you need assembled
(just open a connection in the map constructor):

	super();
	this.hbaseConfiguration = new HBaseConfiguration();
	this.hbaseConfiguration.set("hbase.master", "ubuntu-namenode:60000");
	this.fileMetadataTable = new HTable( hbaseConfiguration, /*tableName*/ );

and issue the Put() in your map() method. This can take the load of your reduce() tasks, which
may speed things up a bit.

Caveat emptor:
I just started on all this stuff. ;)

Hope it helps.

Take care,
  -stu



--- On Mon, 7/19/10, Hegner, Travis <THegner@trilliumit.com> wrote:

> From: Hegner, Travis <THegner@trilliumit.com>
> Subject: RE: Run MR job when my data stays in hbase?
> To: "user@hbase.apache.org" <user@hbase.apache.org>
> Date: Monday, July 19, 2010, 11:55 AM
> Also make sure that the
> $HBASE_HOME/hbase-<version>.jar,
> $HBASE_HOME/lib/zookeeper-<version>.jar, and the
> $HBASE_HOME/conf/ are all on the classpath in your
> $HADOOP_HOME/conf/hadoop-env.sh file. That configuration
> must be cluster wide.
> 
> With that, your map and reduce tasks can access zookeeper
> and hbase objects. You can then use the TableInputFormat
> with TableOutputFormat, or you can use TableInputFormat, and
> your reduce tasks can write data directly back into Hbase.
> You're problem, and your dataset, will dictate which of
> those methods is more efficient.
> 
> Travis Hegner
> http://www.travishegner.com/

> 
> -----Original Message-----
> From: Andrey Stepachev [mailto:octo47@gmail.com]
> Sent: Monday, July 19, 2010 9:28 AM
> To: user@hbase.apache.org
> Subject: Re: Run MR job when my data stays in hbase?
> 
> 2010/7/19 elton sky <eltonsky9404@gmail.com>:
> 
> > My question is if I wanna run the backgroup process as
> a MR job, can I get
> > data from hbase, rather than hdfs, with hadoop? How do
> I do that?
> > I appreciate if anyone can provide some simple example
> code.
> 
> Look at org.apache.hadoop.hbase.mapreduce package in hbase
> sources
> and as real example:
> org.apache.hadoop.hbase.mapreduce.RowCounter
> 
> The information contained in this communication is
> confidential and is intended only for the use of the named
> recipient.  Unauthorized use, disclosure, or copying is
> strictly prohibited and may be unlawful.  If you have
> received this communication in error, you should know that
> you are bound to confidentiality, and should please
> immediately notify the sender or our IT Department at 
> 866.459.4599.
> 


      

Mime
View raw message