hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shuja Rehman <shujamug...@gmail.com>
Subject Re: Best Way to Insert data into Hbase using Map Reduce
Date Fri, 05 Nov 2010 16:13:53 GMT
not yet, can u explain it more how to do it?
thnx

On Fri, Nov 5, 2010 at 8:28 PM, Buttler, David <buttler1@llnl.gov> wrote:

> Have you tried turning off auto flush, and managing the flush in your own
> code (say every 1000 puts?)
> Dave
>
>
> -----Original Message-----
> From: Shuja Rehman [mailto:shujamughal@gmail.com]
> Sent: Friday, November 05, 2010 8:04 AM
> To: user@hbase.apache.org
> Subject: Re: Best Way to Insert data into Hbase using Map Reduce
>
> Michael
>
> hum....so u are storing xml record in the hbase and in second job, u r
> parsing. but in my case i am parsing it also in first phase. what i do, i
> get xml file and i parse it using jdom and then putting data in hbase. so
> parsing+putting both operations are in 1 phase and in mapper code.
>
> My actual problem is that after parsing file, i need to use put statement
> millions of times and i think for each statement it connects to hbase and
> then insert it and this might be the reason of slow processing. So i am
> trying to figure out some way we i can first buffer data and then insert in
> batch fashion. it means in one put statement, i can insert many records and
> i think if i do in this way then the process will be very fast.
>
> secondly what does it means? "we write the raw record in via a single put()
> so the map() method is a null writable."
>
> can u explain it more?
>
> Thanks
>
>
> On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <michael_segel@hotmail.com
> >wrote:
>
> >
> > Suja,
> >
> > Just did a quick glance.
> >
> > What is it that you want to do exactly?
> >
> > Here's how we do it... (at a high level.)
> >
> > Input is an XML file where we want to store the raw XML records in hbase,
> > one record per row.
> >
> > Instead of using the output of the map() method, we write the raw record
> in
> > via a single put() so the map() method is a null writable.
> >
> > Its pretty fast. However fast is relative.
> >
> > Another thing... we store the xml record as a string (converted to
> > bytecode) rather than a serialized object.
> >
> > Then you can break it down in to individual fields in a second batch job.
> > (You can start with a DOM parser, and later move to a Stax parser.
> > Depending on which DOM parser you have and the size of the record, it
> should
> > be 'fast enough'. A good implementation of Stax tends to be
> > recursive/re-entrant code which is harder to maintain.)
> >
> > HTH
> >
> > -Mike
> >
> >
> > > Date: Fri, 5 Nov 2010 16:13:02 +0500
> > > Subject: Best Way to Insert data into Hbase using Map Reduce
> > > From: shujamughal@gmail.com
> > > To: user@hbase.apache.org
> > >
> > > Hi
> > >
> > > I am reading data from raw xml files and inserting data into hbase
> using
> > > TableOutputFormat in a map reduce job. but due to heavy put statements,
> > it
> > > takes many hours to process the data. here is my sample code.
> > >
> > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
> > >     conf.set("xmlinput.start", "<adc>");
> > >     conf.set("xmlinput.end", "</adc>");
> > >     conf
> > >         .set(
> > >           "io.serializations",
> > >
> > >
> >
> "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> > >
> > >       Job job = new Job(conf, "Populate Table with Data");
> > >
> > >     FileInputFormat.setInputPaths(job, input);
> > >     job.setJarByClass(ParserDriver.class);
> > >     job.setMapperClass(MyParserMapper.class);
> > >     job.setNumReduceTasks(0);
> > >     job.setInputFormatClass(XmlInputFormat.class);
> > >     job.setOutputFormatClass(TableOutputFormat.class);
> > >
> > >
> > > *and mapper code*
> > >
> > > public class MyParserMapper   extends
> > >     Mapper<LongWritable, Text, NullWritable, Writable> {
> > >
> > >     @Override
> > >     public void map(LongWritable key, Text value1,Context context)
> > >
> > > throws IOException, InterruptedException {
> > > *//doing some processing*
> > >  while(rItr.hasNext())
> > >                     {
> > > *                   //and this put statement runs for 132,622,560 times
> > to
> > > insert the data.*
> > >                     context.write(NullWritable.get(), new
> > > Put(rowId).add(Bytes.toBytes("CounterValues"),
> > > Bytes.toBytes(counter.toString()),
> > Bytes.toBytes(rElement.getTextTrim())));
> > >
> > >                     }
> > >
> > > }}
> > >
> > > Is there any other way of doing this task so i can improve the
> > performance?
> > >
> > >
> > > --
> > > Regards
> > > Shuja-ur-Rehman Baig
> > > <http://BLOCKEDpk.linkedin.com/in/shujamughal>
> >
>
>
>
>
> --
> Regards
> Shuja-ur-Rehman Baig
> <http://BLOCKEDpk.linkedin.com/in/shujamughal>
>



-- 
Regards
Shuja-ur-Rehman Baig
<http://pk.linkedin.com/in/shujamughal>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message