hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: Best Way to Insert data into Hbase using Map Reduce
Date Mon, 08 Nov 2010 15:50:39 GMT

Switch out the JDOM for a Stax parser.

Ok, having said that... 
You said you have a single record per file. Ok that means you have a lot of fields.
Because you have 1 record, this isn't a map/reduce problem. You're better off writing a single
threaded app 
to read the file, parse the file using Stax, and then write the fields to HBase.

I'm not sure why you have millions of put()s.
Do you have millions of fields in this one record?

Writing a good stax parser and then mapping the fields to your hbase column(s) will help.

HTH

-Mike
PS. A good stax implementation would be a recursive/re-entrant piece of code.
While the code may look simple, it takes a skilled developer to write and maintain.


> Date: Mon, 8 Nov 2010 14:36:34 +0500
> Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> From: shujamughal@gmail.com
> To: user@hbase.apache.org
> 
> HI
> 
> I have used JDOM library to parse the xml in mapper and in my case, one
> single file consist of 1 record so i give one complete file to map process
> and extract the information from it which i need. I have only 2 column
> families in my schema and bottleneck was the put statements which run
> millions of time for each file. when i comment this put statement then job
> complete within minutes but with put statement, it was taking about 7 hours
> to complete the same job. Anyhow I have changed the code according to
> suggestion given by Michael  and now using java api to dump data instead of
> table output format and used the list of puts and then flush them at each
> 1000 records and it reduces the time significantly. Now the whole job
> process by 1 hour and 45 min approx but still not in minutes. So is there
> anything left which i might apply and performance increase?
> 
> Thanks
> 
> On Fri, Nov 5, 2010 at 10:46 PM, Buttler, David <buttler1@llnl.gov> wrote:
> 
> > Good points.
> > Before we can make any rational suggestion, we need to know where the
> > bottleneck is, so we can make suggestions to move it elsewhere.  I
> > personally favor Michael's suggestion to split the ingest and the parsing
> > parts of your job, and to switch to a parser that is faster than a DOM
> > parser (SAX or Stax). But, without knowing what the bottleneck actually is,
> > all of these suggestions are shots in the dark.
> >
> > What is the network load, the CPU load, the disk load?  Have you at least
> > installed Ganglia or some equivalent so that you can see what the load is
> > across the cluster?
> >
> > Dave
> >
> >
> > -----Original Message-----
> > From: Michael Segel [mailto:michael_segel@hotmail.com]
> > Sent: Friday, November 05, 2010 9:49 AM
> > To: user@hbase.apache.org
> > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> >
> >
> > I don't think using the buffered client is going to help a lot w
> > performance.
> >
> > I'm a little confused because it doesn't sound like Shuja is using a
> > map/reduce job to parse the file.
> > That is... he says he parses the file in to a dom tree. Usually your map
> > job parses each record and then in the mapper you parse out the record.
> > Within the m/r job we don't parse out the fields in the records because we
> > do additional processing which 'dedupes' the data so we don't have to
> > further process the data.
> > The second job only has to parse a portion of the original records.
> >
> > So assuming that Shuja is actually using a map reduce job, and each xml
> > record is being parsed within the mapper() there are a couple of things...
> > 1) Reduce the number of column families that you are using. (Each column
> > family is written to a separate file)
> > 2) Set up the HTable instance in Mapper.setup()
> > 3) Switch to a different dom class (not all java classes are equal) or
> > switch to Stax.
> >
> >
> >
> >
> > > From: buttler1@llnl.gov
> > > To: user@hbase.apache.org
> > > Date: Fri, 5 Nov 2010 08:28:07 -0700
> > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> > >
> > > Have you tried turning off auto flush, and managing the flush in your own
> > code (say every 1000 puts?)
> > > Dave
> > >
> > >
> > > -----Original Message-----
> > > From: Shuja Rehman [mailto:shujamughal@gmail.com]
> > > Sent: Friday, November 05, 2010 8:04 AM
> > > To: user@hbase.apache.org
> > > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> > >
> > > Michael
> > >
> > > hum....so u are storing xml record in the hbase and in second job, u r
> > > parsing. but in my case i am parsing it also in first phase. what i do, i
> > > get xml file and i parse it using jdom and then putting data in hbase. so
> > > parsing+putting both operations are in 1 phase and in mapper code.
> > >
> > > My actual problem is that after parsing file, i need to use put statement
> > > millions of times and i think for each statement it connects to hbase and
> > > then insert it and this might be the reason of slow processing. So i am
> > > trying to figure out some way we i can first buffer data and then insert
> > in
> > > batch fashion. it means in one put statement, i can insert many records
> > and
> > > i think if i do in this way then the process will be very fast.
> > >
> > > secondly what does it means? "we write the raw record in via a single
> > put()
> > > so the map() method is a null writable."
> > >
> > > can u explain it more?
> > >
> > > Thanks
> > >
> > >
> > > On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <michael_segel@hotmail.com
> > >wrote:
> > >
> > > >
> > > > Suja,
> > > >
> > > > Just did a quick glance.
> > > >
> > > > What is it that you want to do exactly?
> > > >
> > > > Here's how we do it... (at a high level.)
> > > >
> > > > Input is an XML file where we want to store the raw XML records in
> > hbase,
> > > > one record per row.
> > > >
> > > > Instead of using the output of the map() method, we write the raw
> > record in
> > > > via a single put() so the map() method is a null writable.
> > > >
> > > > Its pretty fast. However fast is relative.
> > > >
> > > > Another thing... we store the xml record as a string (converted to
> > > > bytecode) rather than a serialized object.
> > > >
> > > > Then you can break it down in to individual fields in a second batch
> > job.
> > > > (You can start with a DOM parser, and later move to a Stax parser.
> > > > Depending on which DOM parser you have and the size of the record, it
> > should
> > > > be 'fast enough'. A good implementation of Stax tends to be
> > > > recursive/re-entrant code which is harder to maintain.)
> > > >
> > > > HTH
> > > >
> > > > -Mike
> > > >
> > > >
> > > > > Date: Fri, 5 Nov 2010 16:13:02 +0500
> > > > > Subject: Best Way to Insert data into Hbase using Map Reduce
> > > > > From: shujamughal@gmail.com
> > > > > To: user@hbase.apache.org
> > > > >
> > > > > Hi
> > > > >
> > > > > I am reading data from raw xml files and inserting data into hbase
> > using
> > > > > TableOutputFormat in a map reduce job. but due to heavy put
> > statements,
> > > > it
> > > > > takes many hours to process the data. here is my sample code.
> > > > >
> > > > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
> > > > >     conf.set("xmlinput.start", "<adc>");
> > > > >     conf.set("xmlinput.end", "</adc>");
> > > > >     conf
> > > > >         .set(
> > > > >           "io.serializations",
> > > > >
> > > > >
> > > >
> > "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> > > > >
> > > > >       Job job = new Job(conf, "Populate Table with Data");
> > > > >
> > > > >     FileInputFormat.setInputPaths(job, input);
> > > > >     job.setJarByClass(ParserDriver.class);
> > > > >     job.setMapperClass(MyParserMapper.class);
> > > > >     job.setNumReduceTasks(0);
> > > > >     job.setInputFormatClass(XmlInputFormat.class);
> > > > >     job.setOutputFormatClass(TableOutputFormat.class);
> > > > >
> > > > >
> > > > > *and mapper code*
> > > > >
> > > > > public class MyParserMapper   extends
> > > > >     Mapper<LongWritable, Text, NullWritable, Writable> {
> > > > >
> > > > >     @Override
> > > > >     public void map(LongWritable key, Text value1,Context context)
> > > > >
> > > > > throws IOException, InterruptedException {
> > > > > *//doing some processing*
> > > > >  while(rItr.hasNext())
> > > > >                     {
> > > > > *                   //and this put statement runs for 132,622,560
> > times
> > > > to
> > > > > insert the data.*
> > > > >                     context.write(NullWritable.get(), new
> > > > > Put(rowId).add(Bytes.toBytes("CounterValues"),
> > > > > Bytes.toBytes(counter.toString()),
> > > > Bytes.toBytes(rElement.getTextTrim())));
> > > > >
> > > > >                     }
> > > > >
> > > > > }}
> > > > >
> > > > > Is there any other way of doing this task so i can improve the
> > > > performance?
> > > > >
> > > > >
> > > > > --
> > > > > Regards
> > > > > Shuja-ur-Rehman Baig
> > > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> > > >
> > >
> > >
> > >
> > >
> > > --
> > > Regards
> > > Shuja-ur-Rehman Baig
> > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> >
> >
> 
> 
> -- 
> Regards
> Shuja-ur-Rehman Baig
> <http://pk.linkedin.com/in/shujamughal>
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message