hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shuja Rehman <shujamug...@gmail.com>
Subject Re: Best Way to Insert data into Hbase using Map Reduce
Date Mon, 08 Nov 2010 16:59:22 GMT
One more thing which i want to ask that i have found that people have given
the following buffer size.

  table.setWriteBufferSize(1024*1024*24);
  table.setAutoFlush(false);

Is there any specific reason of giving such buffer size? and how much ram is
required for it. I have given 4 GB to each region server and I can see that
used heap value for region server going increasing and increasing and region
servers are crashing then.

On Mon, Nov 8, 2010 at 9:26 PM, Shuja Rehman <shujamughal@gmail.com> wrote:

> Ok
> Well...i am getting hundred of files daily which all need to process thats
> why i am using hadoop so it manage distribution of processing itself.
> Yes, one record has millions of fields
>
> Thanks for comments.
>
>
> On Mon, Nov 8, 2010 at 8:50 PM, Michael Segel <michael_segel@hotmail.com>wrote:
>
>>
>> Switch out the JDOM for a Stax parser.
>>
>> Ok, having said that...
>> You said you have a single record per file. Ok that means you have a lot
>> of fields.
>> Because you have 1 record, this isn't a map/reduce problem. You're better
>> off writing a single threaded app
>> to read the file, parse the file using Stax, and then write the fields to
>> HBase.
>>
>> I'm not sure why you have millions of put()s.
>> Do you have millions of fields in this one record?
>>
>> Writing a good stax parser and then mapping the fields to your hbase
>> column(s) will help.
>>
>> HTH
>>
>> -Mike
>> PS. A good stax implementation would be a recursive/re-entrant piece of
>> code.
>> While the code may look simple, it takes a skilled developer to write and
>> maintain.
>>
>>
>> > Date: Mon, 8 Nov 2010 14:36:34 +0500
>> > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
>> > From: shujamughal@gmail.com
>> > To: user@hbase.apache.org
>> >
>> > HI
>> >
>> > I have used JDOM library to parse the xml in mapper and in my case, one
>> > single file consist of 1 record so i give one complete file to map
>> process
>> > and extract the information from it which i need. I have only 2 column
>> > families in my schema and bottleneck was the put statements which run
>> > millions of time for each file. when i comment this put statement then
>> job
>> > complete within minutes but with put statement, it was taking about 7
>> hours
>> > to complete the same job. Anyhow I have changed the code according to
>> > suggestion given by Michael  and now using java api to dump data instead
>> of
>> > table output format and used the list of puts and then flush them at
>> each
>> > 1000 records and it reduces the time significantly. Now the whole job
>> > process by 1 hour and 45 min approx but still not in minutes. So is
>> there
>> > anything left which i might apply and performance increase?
>> >
>> > Thanks
>> >
>> > On Fri, Nov 5, 2010 at 10:46 PM, Buttler, David <buttler1@llnl.gov>
>> wrote:
>> >
>> > > Good points.
>> > > Before we can make any rational suggestion, we need to know where the
>> > > bottleneck is, so we can make suggestions to move it elsewhere.  I
>> > > personally favor Michael's suggestion to split the ingest and the
>> parsing
>> > > parts of your job, and to switch to a parser that is faster than a DOM
>> > > parser (SAX or Stax). But, without knowing what the bottleneck
>> actually is,
>> > > all of these suggestions are shots in the dark.
>> > >
>> > > What is the network load, the CPU load, the disk load?  Have you at
>> least
>> > > installed Ganglia or some equivalent so that you can see what the load
>> is
>> > > across the cluster?
>> > >
>> > > Dave
>> > >
>> > >
>> > > -----Original Message-----
>> > > From: Michael Segel [mailto:michael_segel@hotmail.com]
>> > > Sent: Friday, November 05, 2010 9:49 AM
>> > > To: user@hbase.apache.org
>> > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
>> > >
>> > >
>> > > I don't think using the buffered client is going to help a lot w
>> > > performance.
>> > >
>> > > I'm a little confused because it doesn't sound like Shuja is using a
>> > > map/reduce job to parse the file.
>> > > That is... he says he parses the file in to a dom tree. Usually your
>> map
>> > > job parses each record and then in the mapper you parse out the
>> record.
>> > > Within the m/r job we don't parse out the fields in the records
>> because we
>> > > do additional processing which 'dedupes' the data so we don't have to
>> > > further process the data.
>> > > The second job only has to parse a portion of the original records.
>> > >
>> > > So assuming that Shuja is actually using a map reduce job, and each
>> xml
>> > > record is being parsed within the mapper() there are a couple of
>> things...
>> > > 1) Reduce the number of column families that you are using. (Each
>> column
>> > > family is written to a separate file)
>> > > 2) Set up the HTable instance in Mapper.setup()
>> > > 3) Switch to a different dom class (not all java classes are equal) or
>> > > switch to Stax.
>> > >
>> > >
>> > >
>> > >
>> > > > From: buttler1@llnl.gov
>> > > > To: user@hbase.apache.org
>> > > > Date: Fri, 5 Nov 2010 08:28:07 -0700
>> > > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
>> > > >
>> > > > Have you tried turning off auto flush, and managing the flush in
>> your own
>> > > code (say every 1000 puts?)
>> > > > Dave
>> > > >
>> > > >
>> > > > -----Original Message-----
>> > > > From: Shuja Rehman [mailto:shujamughal@gmail.com]
>> > > > Sent: Friday, November 05, 2010 8:04 AM
>> > > > To: user@hbase.apache.org
>> > > > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
>> > > >
>> > > > Michael
>> > > >
>> > > > hum....so u are storing xml record in the hbase and in second job,
u
>> r
>> > > > parsing. but in my case i am parsing it also in first phase. what
i
>> do, i
>> > > > get xml file and i parse it using jdom and then putting data in
>> hbase. so
>> > > > parsing+putting both operations are in 1 phase and in mapper code.
>> > > >
>> > > > My actual problem is that after parsing file, i need to use put
>> statement
>> > > > millions of times and i think for each statement it connects to
>> hbase and
>> > > > then insert it and this might be the reason of slow processing. So
i
>> am
>> > > > trying to figure out some way we i can first buffer data and then
>> insert
>> > > in
>> > > > batch fashion. it means in one put statement, i can insert many
>> records
>> > > and
>> > > > i think if i do in this way then the process will be very fast.
>> > > >
>> > > > secondly what does it means? "we write the raw record in via a
>> single
>> > > put()
>> > > > so the map() method is a null writable."
>> > > >
>> > > > can u explain it more?
>> > > >
>> > > > Thanks
>> > > >
>> > > >
>> > > > On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <
>> michael_segel@hotmail.com
>> > > >wrote:
>> > > >
>> > > > >
>> > > > > Suja,
>> > > > >
>> > > > > Just did a quick glance.
>> > > > >
>> > > > > What is it that you want to do exactly?
>> > > > >
>> > > > > Here's how we do it... (at a high level.)
>> > > > >
>> > > > > Input is an XML file where we want to store the raw XML records
in
>> > > hbase,
>> > > > > one record per row.
>> > > > >
>> > > > > Instead of using the output of the map() method, we write the
raw
>> > > record in
>> > > > > via a single put() so the map() method is a null writable.
>> > > > >
>> > > > > Its pretty fast. However fast is relative.
>> > > > >
>> > > > > Another thing... we store the xml record as a string (converted
to
>> > > > > bytecode) rather than a serialized object.
>> > > > >
>> > > > > Then you can break it down in to individual fields in a second
>> batch
>> > > job.
>> > > > > (You can start with a DOM parser, and later move to a Stax parser.
>> > > > > Depending on which DOM parser you have and the size of the record,
>> it
>> > > should
>> > > > > be 'fast enough'. A good implementation of Stax tends to be
>> > > > > recursive/re-entrant code which is harder to maintain.)
>> > > > >
>> > > > > HTH
>> > > > >
>> > > > > -Mike
>> > > > >
>> > > > >
>> > > > > > Date: Fri, 5 Nov 2010 16:13:02 +0500
>> > > > > > Subject: Best Way to Insert data into Hbase using Map Reduce
>> > > > > > From: shujamughal@gmail.com
>> > > > > > To: user@hbase.apache.org
>> > > > > >
>> > > > > > Hi
>> > > > > >
>> > > > > > I am reading data from raw xml files and inserting data
into
>> hbase
>> > > using
>> > > > > > TableOutputFormat in a map reduce job. but due to heavy
put
>> > > statements,
>> > > > > it
>> > > > > > takes many hours to process the data. here is my sample
code.
>> > > > > >
>> > > > > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
>> > > > > >     conf.set("xmlinput.start", "<adc>");
>> > > > > >     conf.set("xmlinput.end", "</adc>");
>> > > > > >     conf
>> > > > > >         .set(
>> > > > > >           "io.serializations",
>> > > > > >
>> > > > > >
>> > > > >
>> > >
>> "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
>> > > > > >
>> > > > > >       Job job = new Job(conf, "Populate Table with Data");
>> > > > > >
>> > > > > >     FileInputFormat.setInputPaths(job, input);
>> > > > > >     job.setJarByClass(ParserDriver.class);
>> > > > > >     job.setMapperClass(MyParserMapper.class);
>> > > > > >     job.setNumReduceTasks(0);
>> > > > > >     job.setInputFormatClass(XmlInputFormat.class);
>> > > > > >     job.setOutputFormatClass(TableOutputFormat.class);
>> > > > > >
>> > > > > >
>> > > > > > *and mapper code*
>> > > > > >
>> > > > > > public class MyParserMapper   extends
>> > > > > >     Mapper<LongWritable, Text, NullWritable, Writable>
{
>> > > > > >
>> > > > > >     @Override
>> > > > > >     public void map(LongWritable key, Text value1,Context
>> context)
>> > > > > >
>> > > > > > throws IOException, InterruptedException {
>> > > > > > *//doing some processing*
>> > > > > >  while(rItr.hasNext())
>> > > > > >                     {
>> > > > > > *                   //and this put statement runs for
>> 132,622,560
>> > > times
>> > > > > to
>> > > > > > insert the data.*
>> > > > > >                     context.write(NullWritable.get(), new
>> > > > > > Put(rowId).add(Bytes.toBytes("CounterValues"),
>> > > > > > Bytes.toBytes(counter.toString()),
>> > > > > Bytes.toBytes(rElement.getTextTrim())));
>> > > > > >
>> > > > > >                     }
>> > > > > >
>> > > > > > }}
>> > > > > >
>> > > > > > Is there any other way of doing this task so i can improve
the
>> > > > > performance?
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Regards
>> > > > > > Shuja-ur-Rehman Baig
>> > > > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Regards
>> > > > Shuja-ur-Rehman Baig
>> > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
>> > >
>> > >
>> >
>> >
>> > --
>> > Regards
>> > Shuja-ur-Rehman Baig
>> > <http://pk.linkedin.com/in/shujamughal>
>>
>>
>
>
>
> --
> Regards
> Shuja-ur-Rehman Baig
> <http://pk.linkedin.com/in/shujamughal>
>
>


-- 
Regards
Shuja-ur-Rehman Baig
<http://pk.linkedin.com/in/shujamughal>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message