Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of oruchovets@gmail.com designates
 209.85.212.41 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=K0LYFAHVbFbOT9T9JaDJ9V6QJkgWE71LpJkx8LYFWGJaAsizlQ76jQGx4BoB7Yh9VO
         5ffRXpEJS8JnIwOzaqtsjrjXYfpsZ9KKoX3Yb3gozICk+uOx1w5Sn/MIoE2idzC2av/I
         g7lClSG2pv0KNZTFvO8rphNX3JeOjECOSOq+o=
MIME-Version: 1.0
In-Reply-To: <AANLkTi=qoG5j3hbFF2HvY6skq=c=fQ-hzk4z_WukPMx-@mail.gmail.com>
References: <AANLkTi=H2d9g4i=fcB+=PHJwLMDP4LsQBfnR13tN0usS@mail.gmail.com>
	<COL117-W84D510B51F9C0C8FE9DE38F4C0@phx.gbl>
	<AANLkTin4ozWXtXiYRw4S0N7uvYXZBmFrjOse9uOyn+Pa@mail.gmail.com>
	<2D6136772A13B84E95DF6DA79E85A9F00130C9842F72@NSPEXMBX-A.the-lab.llnl.gov>
	<COL117-W18057823DE4D3089877B658F4C0@phx.gbl>
	<2D6136772A13B84E95DF6DA79E85A9F00130C9843024@NSPEXMBX-A.the-lab.llnl.gov>
	<AANLkTikdN2UnYboy2cp+Znzw1Pw=FbBPW8S3fUZH4Y0v@mail.gmail.com>
	<COL117-W142D2FA7959D6B89F75DFC8F4F0@phx.gbl>
	<AANLkTikLeP1v5sHSs1SMtTRy7rv9qHLFLCa-FKCJH_c8@mail.gmail.com>
	<AANLkTi=qoG5j3hbFF2HvY6skq=c=fQ-hzk4z_WukPMx-@mail.gmail.com>
Date: Tue, 9 Nov 2010 12:32:18 +0200
Message-ID: <AANLkTimgy=Me98G8YMJYfkWyvLLEbLnydCLM0SEmNNdq@mail.gmail.com>
Subject: Re: Best Way to Insert data into Hbase using Map Reduce
From: Oleg Ruchovets <oruchovets@gmail.com>
To: user@hbase.apache.org
Content-Type: multipart/alternative; boundary=20cf300fb10352121004949c41a9

--20cf300fb10352121004949c41a9
Content-Type: text/plain; charset=ISO-8859-1

Hi ,
Do you use HTablePool?
Changing the code to using HBasePool gives  me significat performance
benefit.


HBaseConfiguration conf = new HBaseConfiguration();
HTablePool pool = new HTablePool(conf, 10);
HTable table = pool.getTable(name);

Actually disabling WAL ,
Increasing pool size and rewriting code to using WriteBuffer
Gives me a good improvement.

I wonder : how can I check that my insertion process is optimized.
  I mean if insertion took X time -- is it good or no? and how can I check
it.

Thanks Oleg.


On Mon, Nov 8, 2010 at 6:59 PM, Shuja Rehman <shujamughal@gmail.com> wrote:

> One more thing which i want to ask that i have found that people have given
> the following buffer size.
>
>  table.setWriteBufferSize(1024*1024*24);
>  table.setAutoFlush(false);
>
> Is there any specific reason of giving such buffer size? and how much ram
> is
> required for it. I have given 4 GB to each region server and I can see that
> used heap value for region server going increasing and increasing and
> region
> servers are crashing then.
>
> On Mon, Nov 8, 2010 at 9:26 PM, Shuja Rehman <shujamughal@gmail.com>
> wrote:
>
> > Ok
> > Well...i am getting hundred of files daily which all need to process
> thats
> > why i am using hadoop so it manage distribution of processing itself.
> > Yes, one record has millions of fields
> >
> > Thanks for comments.
> >
> >
> > On Mon, Nov 8, 2010 at 8:50 PM, Michael Segel <michael_segel@hotmail.com
> >wrote:
> >
> >>
> >> Switch out the JDOM for a Stax parser.
> >>
> >> Ok, having said that...
> >> You said you have a single record per file. Ok that means you have a lot
> >> of fields.
> >> Because you have 1 record, this isn't a map/reduce problem. You're
> better
> >> off writing a single threaded app
> >> to read the file, parse the file using Stax, and then write the fields
> to
> >> HBase.
> >>
> >> I'm not sure why you have millions of put()s.
> >> Do you have millions of fields in this one record?
> >>
> >> Writing a good stax parser and then mapping the fields to your hbase
> >> column(s) will help.
> >>
> >> HTH
> >>
> >> -Mike
> >> PS. A good stax implementation would be a recursive/re-entrant piece of
> >> code.
> >> While the code may look simple, it takes a skilled developer to write
> and
> >> maintain.
> >>
> >>
> >> > Date: Mon, 8 Nov 2010 14:36:34 +0500
> >> > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> >> > From: shujamughal@gmail.com
> >> > To: user@hbase.apache.org
> >> >
> >> > HI
> >> >
> >> > I have used JDOM library to parse the xml in mapper and in my case,
> one
> >> > single file consist of 1 record so i give one complete file to map
> >> process
> >> > and extract the information from it which i need. I have only 2 column
> >> > families in my schema and bottleneck was the put statements which run
> >> > millions of time for each file. when i comment this put statement then
> >> job
> >> > complete within minutes but with put statement, it was taking about 7
> >> hours
> >> > to complete the same job. Anyhow I have changed the code according to
> >> > suggestion given by Michael  and now using java api to dump data
> instead
> >> of
> >> > table output format and used the list of puts and then flush them at
> >> each
> >> > 1000 records and it reduces the time significantly. Now the whole job
> >> > process by 1 hour and 45 min approx but still not in minutes. So is
> >> there
> >> > anything left which i might apply and performance increase?
> >> >
> >> > Thanks
> >> >
> >> > On Fri, Nov 5, 2010 at 10:46 PM, Buttler, David <buttler1@llnl.gov>
> >> wrote:
> >> >
> >> > > Good points.
> >> > > Before we can make any rational suggestion, we need to know where
> the
> >> > > bottleneck is, so we can make suggestions to move it elsewhere.  I
> >> > > personally favor Michael's suggestion to split the ingest and the
> >> parsing
> >> > > parts of your job, and to switch to a parser that is faster than a
> DOM
> >> > > parser (SAX or Stax). But, without knowing what the bottleneck
> >> actually is,
> >> > > all of these suggestions are shots in the dark.
> >> > >
> >> > > What is the network load, the CPU load, the disk load?  Have you at
> >> least
> >> > > installed Ganglia or some equivalent so that you can see what the
> load
> >> is
> >> > > across the cluster?
> >> > >
> >> > > Dave
> >> > >
> >> > >
> >> > > -----Original Message-----
> >> > > From: Michael Segel [mailto:michael_segel@hotmail.com]
> >> > > Sent: Friday, November 05, 2010 9:49 AM
> >> > > To: user@hbase.apache.org
> >> > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> >> > >
> >> > >
> >> > > I don't think using the buffered client is going to help a lot w
> >> > > performance.
> >> > >
> >> > > I'm a little confused because it doesn't sound like Shuja is using a
> >> > > map/reduce job to parse the file.
> >> > > That is... he says he parses the file in to a dom tree. Usually your
> >> map
> >> > > job parses each record and then in the mapper you parse out the
> >> record.
> >> > > Within the m/r job we don't parse out the fields in the records
> >> because we
> >> > > do additional processing which 'dedupes' the data so we don't have
> to
> >> > > further process the data.
> >> > > The second job only has to parse a portion of the original records.
> >> > >
> >> > > So assuming that Shuja is actually using a map reduce job, and each
> >> xml
> >> > > record is being parsed within the mapper() there are a couple of
> >> things...
> >> > > 1) Reduce the number of column families that you are using. (Each
> >> column
> >> > > family is written to a separate file)
> >> > > 2) Set up the HTable instance in Mapper.setup()
> >> > > 3) Switch to a different dom class (not all java classes are equal)
> or
> >> > > switch to Stax.
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > > From: buttler1@llnl.gov
> >> > > > To: user@hbase.apache.org
> >> > > > Date: Fri, 5 Nov 2010 08:28:07 -0700
> >> > > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce
> >> > > >
> >> > > > Have you tried turning off auto flush, and managing the flush in
> >> your own
> >> > > code (say every 1000 puts?)
> >> > > > Dave
> >> > > >
> >> > > >
> >> > > > -----Original Message-----
> >> > > > From: Shuja Rehman [mailto:shujamughal@gmail.com]
> >> > > > Sent: Friday, November 05, 2010 8:04 AM
> >> > > > To: user@hbase.apache.org
> >> > > > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> >> > > >
> >> > > > Michael
> >> > > >
> >> > > > hum....so u are storing xml record in the hbase and in second job,
> u
> >> r
> >> > > > parsing. but in my case i am parsing it also in first phase. what
> i
> >> do, i
> >> > > > get xml file and i parse it using jdom and then putting data in
> >> hbase. so
> >> > > > parsing+putting both operations are in 1 phase and in mapper code.
> >> > > >
> >> > > > My actual problem is that after parsing file, i need to use put
> >> statement
> >> > > > millions of times and i think for each statement it connects to
> >> hbase and
> >> > > > then insert it and this might be the reason of slow processing. So
> i
> >> am
> >> > > > trying to figure out some way we i can first buffer data and then
> >> insert
> >> > > in
> >> > > > batch fashion. it means in one put statement, i can insert many
> >> records
> >> > > and
> >> > > > i think if i do in this way then the process will be very fast.
> >> > > >
> >> > > > secondly what does it means? "we write the raw record in via a
> >> single
> >> > > put()
> >> > > > so the map() method is a null writable."
> >> > > >
> >> > > > can u explain it more?
> >> > > >
> >> > > > Thanks
> >> > > >
> >> > > >
> >> > > > On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <
> >> michael_segel@hotmail.com
> >> > > >wrote:
> >> > > >
> >> > > > >
> >> > > > > Suja,
> >> > > > >
> >> > > > > Just did a quick glance.
> >> > > > >
> >> > > > > What is it that you want to do exactly?
> >> > > > >
> >> > > > > Here's how we do it... (at a high level.)
> >> > > > >
> >> > > > > Input is an XML file where we want to store the raw XML records
> in
> >> > > hbase,
> >> > > > > one record per row.
> >> > > > >
> >> > > > > Instead of using the output of the map() method, we write the
> raw
> >> > > record in
> >> > > > > via a single put() so the map() method is a null writable.
> >> > > > >
> >> > > > > Its pretty fast. However fast is relative.
> >> > > > >
> >> > > > > Another thing... we store the xml record as a string (converted
> to
> >> > > > > bytecode) rather than a serialized object.
> >> > > > >
> >> > > > > Then you can break it down in to individual fields in a second
> >> batch
> >> > > job.
> >> > > > > (You can start with a DOM parser, and later move to a Stax
> parser.
> >> > > > > Depending on which DOM parser you have and the size of the
> record,
> >> it
> >> > > should
> >> > > > > be 'fast enough'. A good implementation of Stax tends to be
> >> > > > > recursive/re-entrant code which is harder to maintain.)
> >> > > > >
> >> > > > > HTH
> >> > > > >
> >> > > > > -Mike
> >> > > > >
> >> > > > >
> >> > > > > > Date: Fri, 5 Nov 2010 16:13:02 +0500
> >> > > > > > Subject: Best Way to Insert data into Hbase using Map Reduce
> >> > > > > > From: shujamughal@gmail.com
> >> > > > > > To: user@hbase.apache.org
> >> > > > > >
> >> > > > > > Hi
> >> > > > > >
> >> > > > > > I am reading data from raw xml files and inserting data into
> >> hbase
> >> > > using
> >> > > > > > TableOutputFormat in a map reduce job. but due to heavy put
> >> > > statements,
> >> > > > > it
> >> > > > > > takes many hours to process the data. here is my sample code.
> >> > > > > >
> >> > > > > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable");
> >> > > > > >     conf.set("xmlinput.start", "<adc>");
> >> > > > > >     conf.set("xmlinput.end", "</adc>");
> >> > > > > >     conf
> >> > > > > >         .set(
> >> > > > > >           "io.serializations",
> >> > > > > >
> >> > > > > >
> >> > > > >
> >> > >
> >>
> "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> >> > > > > >
> >> > > > > >       Job job = new Job(conf, "Populate Table with Data");
> >> > > > > >
> >> > > > > >     FileInputFormat.setInputPaths(job, input);
> >> > > > > >     job.setJarByClass(ParserDriver.class);
> >> > > > > >     job.setMapperClass(MyParserMapper.class);
> >> > > > > >     job.setNumReduceTasks(0);
> >> > > > > >     job.setInputFormatClass(XmlInputFormat.class);
> >> > > > > >     job.setOutputFormatClass(TableOutputFormat.class);
> >> > > > > >
> >> > > > > >
> >> > > > > > *and mapper code*
> >> > > > > >
> >> > > > > > public class MyParserMapper   extends
> >> > > > > >     Mapper<LongWritable, Text, NullWritable, Writable> {
> >> > > > > >
> >> > > > > >     @Override
> >> > > > > >     public void map(LongWritable key, Text value1,Context
> >> context)
> >> > > > > >
> >> > > > > > throws IOException, InterruptedException {
> >> > > > > > *//doing some processing*
> >> > > > > >  while(rItr.hasNext())
> >> > > > > >                     {
> >> > > > > > *                   //and this put statement runs for
> >> 132,622,560
> >> > > times
> >> > > > > to
> >> > > > > > insert the data.*
> >> > > > > >                     context.write(NullWritable.get(), new
> >> > > > > > Put(rowId).add(Bytes.toBytes("CounterValues"),
> >> > > > > > Bytes.toBytes(counter.toString()),
> >> > > > > Bytes.toBytes(rElement.getTextTrim())));
> >> > > > > >
> >> > > > > >                     }
> >> > > > > >
> >> > > > > > }}
> >> > > > > >
> >> > > > > > Is there any other way of doing this task so i can improve the
> >> > > > > performance?
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Regards
> >> > > > > > Shuja-ur-Rehman Baig
> >> > > > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> >> > > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > Regards
> >> > > > Shuja-ur-Rehman Baig
> >> > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> >> > >
> >> > >
> >> >
> >> >
> >> > --
> >> > Regards
> >> > Shuja-ur-Rehman Baig
> >> > <http://pk.linkedin.com/in/shujamughal>
> >>
> >>
> >
> >
> >
> > --
> > Regards
> > Shuja-ur-Rehman Baig
> > <http://pk.linkedin.com/in/shujamughal>
> >
> >
>
>
> --
> Regards
> Shuja-ur-Rehman Baig
> <http://pk.linkedin.com/in/shujamughal>
>

--20cf300fb10352121004949c41a9--