hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Buttler, David" <buttl...@llnl.gov>
Subject RE: Best Way to Insert data into Hbase using Map Reduce
Date Fri, 05 Nov 2010 17:46:15 GMT
Good points.
Before we can make any rational suggestion, we need to know where the bottleneck is, so we
can make suggestions to move it elsewhere.  I personally favor Michael's suggestion to split
the ingest and the parsing parts of your job, and to switch to a parser that is faster than
a DOM parser (SAX or Stax). But, without knowing what the bottleneck actually is, all of these
suggestions are shots in the dark.  

What is the network load, the CPU load, the disk load?  Have you at least installed Ganglia
or some equivalent so that you can see what the load is across the cluster?

Dave


-----Original Message-----
From: Michael Segel [mailto:michael_segel@hotmail.com] 
Sent: Friday, November 05, 2010 9:49 AM
To: user@hbase.apache.org
Subject: RE: Best Way to Insert data into Hbase using Map Reduce


I don't think using the buffered client is going to help a lot w performance.

I'm a little confused because it doesn't sound like Shuja is using a map/reduce job to parse
the file. 
That is... he says he parses the file in to a dom tree. Usually your map job parses each record
and then in the mapper you parse out the record.
Within the m/r job we don't parse out the fields in the records because we do additional processing
which 'dedupes' the data so we don't have to further process the data.
The second job only has to parse a portion of the original records.

So assuming that Shuja is actually using a map reduce job, and each xml record is being parsed
within the mapper() there are a couple of things...
1) Reduce the number of column families that you are using. (Each column family is written
to a separate file)
2) Set up the HTable instance in Mapper.setup() 
3) Switch to a different dom class (not all java classes are equal) or switch to Stax.




> From: buttler1@llnl.gov
> To: user@hbase.apache.org
> Date: Fri, 5 Nov 2010 08:28:07 -0700
> Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> 
> Have you tried turning off auto flush, and managing the flush in your own code (say every
1000 puts?)
> Dave
> 
> 
> -----Original Message-----
> From: Shuja Rehman [mailto:shujamughal@gmail.com] 
> Sent: Friday, November 05, 2010 8:04 AM
> To: user@hbase.apache.org
> Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> 
> Michael
> 
> hum....so u are storing xml record in the hbase and in second job, u r
> parsing. but in my case i am parsing it also in first phase. what i do, i
> get xml file and i parse it using jdom and then putting data in hbase. so
> parsing+putting both operations are in 1 phase and in mapper code.
> 
> My actual problem is that after parsing file, i need to use put statement
> millions of times and i think for each statement it connects to hbase and
> then insert it and this might be the reason of slow processing. So i am
> trying to figure out some way we i can first buffer data and then insert in
> batch fashion. it means in one put statement, i can insert many records and
> i think if i do in this way then the process will be very fast.
> 
> secondly what does it means? "we write the raw record in via a single put()
> so the map() method is a null writable."
> 
> can u explain it more?
> 
> Thanks
> 
> 
> On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <michael_segel@hotmail.com>wrote:
> 
> >
> > Suja,
> >
> > Just did a quick glance.
> >
> > What is it that you want to do exactly?
> >
> > Here's how we do it... (at a high level.)
> >
> > Input is an XML file where we want to store the raw XML records in hbase,
> > one record per row.
> >
> > Instead of using the output of the map() method, we write the raw record in
> > via a single put() so the map() method is a null writable.
> >
> > Its pretty fast. However fast is relative.
> >
> > Another thing... we store the xml record as a string (converted to
> > bytecode) rather than a serialized object.
> >
> > Then you can break it down in to individual fields in a second batch job.
> > (You can start with a DOM parser, and later move to a Stax parser.
> > Depending on which DOM parser you have and the size of the record, it should
> > be 'fast enough'. A good implementation of Stax tends to be
> > recursive/re-entrant code which is harder to maintain.)
> >
> > HTH
> >
> > -Mike
> >
> >
> > > Date: Fri, 5 Nov 2010 16:13:02 +0500
> > > Subject: Best Way to Insert data into Hbase using Map Reduce
> > > From: shujamughal@gmail.com
> > > To: user@hbase.apache.org
> > >
> > > Hi
> > >
> > > I am reading data from raw xml files and inserting data into hbase using
> > > TableOutputFormat in a map reduce job. but due to heavy put statements,
> > it
> > > takes many hours to process the data. here is my sample code.
> > >
> > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
> > >     conf.set("xmlinput.start", "<adc>");
> > >     conf.set("xmlinput.end", "</adc>");
> > >     conf
> > >         .set(
> > >           "io.serializations",
> > >
> > >
> > "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> > >
> > >       Job job = new Job(conf, "Populate Table with Data");
> > >
> > >     FileInputFormat.setInputPaths(job, input);
> > >     job.setJarByClass(ParserDriver.class);
> > >     job.setMapperClass(MyParserMapper.class);
> > >     job.setNumReduceTasks(0);
> > >     job.setInputFormatClass(XmlInputFormat.class);
> > >     job.setOutputFormatClass(TableOutputFormat.class);
> > >
> > >
> > > *and mapper code*
> > >
> > > public class MyParserMapper   extends
> > >     Mapper<LongWritable, Text, NullWritable, Writable> {
> > >
> > >     @Override
> > >     public void map(LongWritable key, Text value1,Context context)
> > >
> > > throws IOException, InterruptedException {
> > > *//doing some processing*
> > >  while(rItr.hasNext())
> > >                     {
> > > *                   //and this put statement runs for 132,622,560 times
> > to
> > > insert the data.*
> > >                     context.write(NullWritable.get(), new
> > > Put(rowId).add(Bytes.toBytes("CounterValues"),
> > > Bytes.toBytes(counter.toString()),
> > Bytes.toBytes(rElement.getTextTrim())));
> > >
> > >                     }
> > >
> > > }}
> > >
> > > Is there any other way of doing this task so i can improve the
> > performance?
> > >
> > >
> > > --
> > > Regards
> > > Shuja-ur-Rehman Baig
> > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> >
> 
> 
> 
> 
> -- 
> Regards
> Shuja-ur-Rehman Baig
> <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
 		 	   		  

Mime
View raw message