hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: Best Way to Insert data into Hbase using Map Reduce
Date Fri, 05 Nov 2010 12:05:49 GMT

Suja,

Just did a quick glance.

What is it that you want to do exactly?

Here's how we do it... (at a high level.)

Input is an XML file where we want to store the raw XML records in hbase, one record per row.

Instead of using the output of the map() method, we write the raw record in via a single put()
so the map() method is a null writable.

Its pretty fast. However fast is relative.

Another thing... we store the xml record as a string (converted to bytecode) rather than a
serialized object.

Then you can break it down in to individual fields in a second batch job.
(You can start with a DOM parser, and later move to a Stax parser. Depending on which DOM
parser you have and the size of the record, it should be 'fast enough'. A good implementation
of Stax tends to be recursive/re-entrant code which is harder to maintain.)

HTH

-Mike


> Date: Fri, 5 Nov 2010 16:13:02 +0500
> Subject: Best Way to Insert data into Hbase using Map Reduce
> From: shujamughal@gmail.com
> To: user@hbase.apache.org
> 
> Hi
> 
> I am reading data from raw xml files and inserting data into hbase using
> TableOutputFormat in a map reduce job. but due to heavy put statements, it
> takes many hours to process the data. here is my sample code.
> 
> conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
>     conf.set("xmlinput.start", "<adc>");
>     conf.set("xmlinput.end", "</adc>");
>     conf
>         .set(
>           "io.serializations",
> 
> "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> 
>       Job job = new Job(conf, "Populate Table with Data");
> 
>     FileInputFormat.setInputPaths(job, input);
>     job.setJarByClass(ParserDriver.class);
>     job.setMapperClass(MyParserMapper.class);
>     job.setNumReduceTasks(0);
>     job.setInputFormatClass(XmlInputFormat.class);
>     job.setOutputFormatClass(TableOutputFormat.class);
> 
> 
> *and mapper code*
> 
> public class MyParserMapper   extends
>     Mapper<LongWritable, Text, NullWritable, Writable> {
> 
>     @Override
>     public void map(LongWritable key, Text value1,Context context)
> 
> throws IOException, InterruptedException {
> *//doing some processing*
>  while(rItr.hasNext())
>                     {
> *                   //and this put statement runs for 132,622,560 times to
> insert the data.*
>                     context.write(NullWritable.get(), new
> Put(rowId).add(Bytes.toBytes("CounterValues"),
> Bytes.toBytes(counter.toString()), Bytes.toBytes(rElement.getTextTrim())));
> 
>                     }
> 
> }}
> 
> Is there any other way of doing this task so i can improve the performance?
> 
> 
> -- 
> Regards
> Shuja-ur-Rehman Baig
> <http://pk.linkedin.com/in/shujamughal>
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message