hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: Best Way to Insert data into Hbase using Map Reduce
Date Tue, 09 Nov 2010 16:25:30 GMT

OK,

I responded to a different question ...

You don't need to use a pool in this case. 
In the set up you can create a single instance of HTable and then use it in your map() method.

Also I'd stay away from not writing to WAL.

Having said that... yes writing to WAL means you incur some overhead when compared to *not*
writing to WAL.

However this is a map/reduce job and stopping the writing to WAL is probably one of the last
things I would do to improve performance.
The reason is that if you don't write to WAL you risk losing data. Thinking down the road...
couldn't you use the WAL to do log shipping to a different cluster/cloud ? 

But I digress. 

Turning off the WAL increases your potential risk for data loss if something goes wrong. 
There are a lot of options that could improve performance, including a design change, that
could get you to a point where the batch process occurs 'fast enough'.

And that's a crucial point.

How fast do you want to go and how much are you willing to spend?

To give you an extreme example... suppose I have a job that takes 2 hours to run. I figure
I can shave 10 minutes off the job but it would take 40 hours of work. Does it make sense
to do this work if my SLA (Service Level Agreement) with my users is that the job has to run
within 3 hours?

The point is that my job runs 'fast enough' that I can't justify the hours required to get
a marginal performance improvement. Of course I may still want to do the work if it means
improving the quality of code and reducing my ongoing maintenance costs... but that's a different
argument.

Specific to your example... switching from DOM to StAX implementation will do more to improve
your performance and memory footprint.

JMHO... HTH

-Mike


> Date: Tue, 9 Nov 2010 20:46:02 +0500
> Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> From: shujamughal@gmail.com
> To: user@hbase.apache.org
> 
> Hi Oleg,
> 
> Yes, I have used HTablePool. Here is my basic code skeleton
> 
>  public void setup(Context context)
>     {
>      HBaseConfiguration config = new HBaseConfiguration();
>     config.set("hbase.zookeeper.quorum", Constants.HBASE_ZOOKEEPER_QUORUM);
> 
> config.set("hbase.zookeeper.property.clientPort",Constants.HBASE_ZOOKEEPRR_PROPERTY_CLIENTPORT);
>              HTablePool tablePool = new HTablePool(config, 50);
>             table = (HTable) tablePool.getTable("myTable");
>     }
> 
>  public void map(LongWritable key, Text value1,Context context)
> {
> 
> List<Put> puts = new ArrayList<Put>();
> table.setWriteBufferSize(1024*1024*24);
> table.setAutoFlush(false);
>  while(true)
>       {
>        Put put = new Put(rowId);
>                      put.add(;
>                      put.setWriteToWAL(false);
>                      puts.add(put);
>       }
> if(cnt % 500 == 0 )
>          {
>           table.getWriteBuffer().addAll(puts);
>            table.flushCommits();
>              puts.clear();
>            }//*/
>         cnt++;
> }//while
> 
> if(puts.size()>0 )
>              {
>              table.getWriteBuffer().addAll(puts);
>              table.flushCommits();
>              puts.clear();
>              }//*/
> 
> }//map
> 
> 
> On Tue, Nov 9, 2010 at 3:32 PM, Oleg Ruchovets <oruchovets@gmail.com> wrote:
> 
> > Hi ,
> > Do you use HTablePool?
> > Changing the code to using HBasePool gives  me significat performance
> > benefit.
> >
> >
> > HBaseConfiguration conf = new HBaseConfiguration();
> > HTablePool pool = new HTablePool(conf, 10);
> > HTable table = pool.getTable(name);
> >
> > Actually disabling WAL ,
> > Increasing pool size and rewriting code to using WriteBuffer
> > Gives me a good improvement.
> >
> What does mean by *rewriting code to using WriteBuffer?*
> 
> 
> 
> > I wonder : how can I check that my insertion process is optimized.
> >  I mean if insertion took X time -- is it good or no? and how can I check
> > it.
> >
> *I am also not sure about it.*
> 
> >
> > Thanks Oleg.
> >
> >
> >
> > On Mon, Nov 8, 2010 at 6:59 PM, Shuja Rehman <shujamughal@gmail.com>
> > wrote:
> >
> > > One more thing which i want to ask that i have found that people have
> > given
> > > the following buffer size.
> > >
> > >  table.setWriteBufferSize(1024*1024*24);
> > >  table.setAutoFlush(false);
> > >
> > > Is there any specific reason of giving such buffer size? and how much ram
> > > is
> > > required for it. I have given 4 GB to each region server and I can see
> > that
> > > used heap value for region server going increasing and increasing and
> > > region
> > > servers are crashing then.
> > >
> > > On Mon, Nov 8, 2010 at 9:26 PM, Shuja Rehman <shujamughal@gmail.com>
> > > wrote:
> > >
> > > > Ok
> > > > Well...i am getting hundred of files daily which all need to process
> > > thats
> > > > why i am using hadoop so it manage distribution of processing itself.
> > > > Yes, one record has millions of fields
> > > >
> > > > Thanks for comments.
> > > >
> > > >
> > > > On Mon, Nov 8, 2010 at 8:50 PM, Michael Segel <
> > michael_segel@hotmail.com
> > > >wrote:
> > > >
> > > >>
> > > >> Switch out the JDOM for a Stax parser.
> > > >>
> > > >> Ok, having said that...
> > > >> You said you have a single record per file. Ok that means you have
a
> > lot
> > > >> of fields.
> > > >> Because you have 1 record, this isn't a map/reduce problem. You're
> > > better
> > > >> off writing a single threaded app
> > > >> to read the file, parse the file using Stax, and then write the fields
> > > to
> > > >> HBase.
> > > >>
> > > >> I'm not sure why you have millions of put()s.
> > > >> Do you have millions of fields in this one record?
> > > >>
> > > >> Writing a good stax parser and then mapping the fields to your hbase
> > > >> column(s) will help.
> > > >>
> > > >> HTH
> > > >>
> > > >> -Mike
> > > >> PS. A good stax implementation would be a recursive/re-entrant piece
> > of
> > > >> code.
> > > >> While the code may look simple, it takes a skilled developer to write
> > > and
> > > >> maintain.
> > > >>
> > > >>
> > > >> > Date: Mon, 8 Nov 2010 14:36:34 +0500
> > > >> > Subject: Re: Best Way to Insert data into Hbase using Map Reduce
> > > >> > From: shujamughal@gmail.com
> > > >> > To: user@hbase.apache.org
> > > >> >
> > > >> > HI
> > > >> >
> > > >> > I have used JDOM library to parse the xml in mapper and in my
case,
> > > one
> > > >> > single file consist of 1 record so i give one complete file to
map
> > > >> process
> > > >> > and extract the information from it which i need. I have only
2
> > column
> > > >> > families in my schema and bottleneck was the put statements which
> > run
> > > >> > millions of time for each file. when i comment this put statement
> > then
> > > >> job
> > > >> > complete within minutes but with put statement, it was taking
about
> > 7
> > > >> hours
> > > >> > to complete the same job. Anyhow I have changed the code according
> > to
> > > >> > suggestion given by Michael  and now using java api to dump data
> > > instead
> > > >> of
> > > >> > table output format and used the list of puts and then flush
them at
> > > >> each
> > > >> > 1000 records and it reduces the time significantly. Now the whole
> > job
> > > >> > process by 1 hour and 45 min approx but still not in minutes.
So is
> > > >> there
> > > >> > anything left which i might apply and performance increase?
> > > >> >
> > > >> > Thanks
> > > >> >
> > > >> > On Fri, Nov 5, 2010 at 10:46 PM, Buttler, David <buttler1@llnl.gov>
> > > >> wrote:
> > > >> >
> > > >> > > Good points.
> > > >> > > Before we can make any rational suggestion, we need to know
where
> > > the
> > > >> > > bottleneck is, so we can make suggestions to move it elsewhere.
 I
> > > >> > > personally favor Michael's suggestion to split the ingest
and the
> > > >> parsing
> > > >> > > parts of your job, and to switch to a parser that is faster
than a
> > > DOM
> > > >> > > parser (SAX or Stax). But, without knowing what the bottleneck
> > > >> actually is,
> > > >> > > all of these suggestions are shots in the dark.
> > > >> > >
> > > >> > > What is the network load, the CPU load, the disk load? 
Have you
> > at
> > > >> least
> > > >> > > installed Ganglia or some equivalent so that you can see
what the
> > > load
> > > >> is
> > > >> > > across the cluster?
> > > >> > >
> > > >> > > Dave
> > > >> > >
> > > >> > >
> > > >> > > -----Original Message-----
> > > >> > > From: Michael Segel [mailto:michael_segel@hotmail.com]
> > > >> > > Sent: Friday, November 05, 2010 9:49 AM
> > > >> > > To: user@hbase.apache.org
> > > >> > > Subject: RE: Best Way to Insert data into Hbase using Map
Reduce
> > > >> > >
> > > >> > >
> > > >> > > I don't think using the buffered client is going to help
a lot w
> > > >> > > performance.
> > > >> > >
> > > >> > > I'm a little confused because it doesn't sound like Shuja
is using
> > a
> > > >> > > map/reduce job to parse the file.
> > > >> > > That is... he says he parses the file in to a dom tree.
Usually
> > your
> > > >> map
> > > >> > > job parses each record and then in the mapper you parse
out the
> > > >> record.
> > > >> > > Within the m/r job we don't parse out the fields in the
records
> > > >> because we
> > > >> > > do additional processing which 'dedupes' the data so we
don't have
> > > to
> > > >> > > further process the data.
> > > >> > > The second job only has to parse a portion of the original
> > records.
> > > >> > >
> > > >> > > So assuming that Shuja is actually using a map reduce job,
and
> > each
> > > >> xml
> > > >> > > record is being parsed within the mapper() there are a couple
of
> > > >> things...
> > > >> > > 1) Reduce the number of column families that you are using.
(Each
> > > >> column
> > > >> > > family is written to a separate file)
> > > >> > > 2) Set up the HTable instance in Mapper.setup()
> > > >> > > 3) Switch to a different dom class (not all java classes
are
> > equal)
> > > or
> > > >> > > switch to Stax.
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > > From: buttler1@llnl.gov
> > > >> > > > To: user@hbase.apache.org
> > > >> > > > Date: Fri, 5 Nov 2010 08:28:07 -0700
> > > >> > > > Subject: RE: Best Way to Insert data into Hbase using
Map Reduce
> > > >> > > >
> > > >> > > > Have you tried turning off auto flush, and managing
the flush in
> > > >> your own
> > > >> > > code (say every 1000 puts?)
> > > >> > > > Dave
> > > >> > > >
> > > >> > > >
> > > >> > > > -----Original Message-----
> > > >> > > > From: Shuja Rehman [mailto:shujamughal@gmail.com]
> > > >> > > > Sent: Friday, November 05, 2010 8:04 AM
> > > >> > > > To: user@hbase.apache.org
> > > >> > > > Subject: Re: Best Way to Insert data into Hbase using
Map Reduce
> > > >> > > >
> > > >> > > > Michael
> > > >> > > >
> > > >> > > > hum....so u are storing xml record in the hbase and
in second
> > job,
> > > u
> > > >> r
> > > >> > > > parsing. but in my case i am parsing it also in first
phase.
> > what
> > > i
> > > >> do, i
> > > >> > > > get xml file and i parse it using jdom and then putting
data in
> > > >> hbase. so
> > > >> > > > parsing+putting both operations are in 1 phase and
in mapper
> > code.
> > > >> > > >
> > > >> > > > My actual problem is that after parsing file, i need
to use put
> > > >> statement
> > > >> > > > millions of times and i think for each statement it
connects to
> > > >> hbase and
> > > >> > > > then insert it and this might be the reason of slow
processing.
> > So
> > > i
> > > >> am
> > > >> > > > trying to figure out some way we i can first buffer
data and
> > then
> > > >> insert
> > > >> > > in
> > > >> > > > batch fashion. it means in one put statement, i can
insert many
> > > >> records
> > > >> > > and
> > > >> > > > i think if i do in this way then the process will be
very fast.
> > > >> > > >
> > > >> > > > secondly what does it means? "we write the raw record
in via a
> > > >> single
> > > >> > > put()
> > > >> > > > so the map() method is a null writable."
> > > >> > > >
> > > >> > > > can u explain it more?
> > > >> > > >
> > > >> > > > Thanks
> > > >> > > >
> > > >> > > >
> > > >> > > > On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel <
> > > >> michael_segel@hotmail.com
> > > >> > > >wrote:
> > > >> > > >
> > > >> > > > >
> > > >> > > > > Suja,
> > > >> > > > >
> > > >> > > > > Just did a quick glance.
> > > >> > > > >
> > > >> > > > > What is it that you want to do exactly?
> > > >> > > > >
> > > >> > > > > Here's how we do it... (at a high level.)
> > > >> > > > >
> > > >> > > > > Input is an XML file where we want to store the
raw XML
> > records
> > > in
> > > >> > > hbase,
> > > >> > > > > one record per row.
> > > >> > > > >
> > > >> > > > > Instead of using the output of the map() method,
we write the
> > > raw
> > > >> > > record in
> > > >> > > > > via a single put() so the map() method is a null
writable.
> > > >> > > > >
> > > >> > > > > Its pretty fast. However fast is relative.
> > > >> > > > >
> > > >> > > > > Another thing... we store the xml record as a
string
> > (converted
> > > to
> > > >> > > > > bytecode) rather than a serialized object.
> > > >> > > > >
> > > >> > > > > Then you can break it down in to individual fields
in a second
> > > >> batch
> > > >> > > job.
> > > >> > > > > (You can start with a DOM parser, and later move
to a Stax
> > > parser.
> > > >> > > > > Depending on which DOM parser you have and the
size of the
> > > record,
> > > >> it
> > > >> > > should
> > > >> > > > > be 'fast enough'. A good implementation of Stax
tends to be
> > > >> > > > > recursive/re-entrant code which is harder to maintain.)
> > > >> > > > >
> > > >> > > > > HTH
> > > >> > > > >
> > > >> > > > > -Mike
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > > Date: Fri, 5 Nov 2010 16:13:02 +0500
> > > >> > > > > > Subject: Best Way to Insert data into Hbase
using Map Reduce
> > > >> > > > > > From: shujamughal@gmail.com
> > > >> > > > > > To: user@hbase.apache.org
> > > >> > > > > >
> > > >> > > > > > Hi
> > > >> > > > > >
> > > >> > > > > > I am reading data from raw xml files and
inserting data into
> > > >> hbase
> > > >> > > using
> > > >> > > > > > TableOutputFormat in a map reduce job. but
due to heavy put
> > > >> > > statements,
> > > >> > > > > it
> > > >> > > > > > takes many hours to process the data. here
is my sample
> > code.
> > > >> > > > > >
> > > >> > > > > > conf.set(TableOutputFormat.OUTPUT_TABLE,
"mytable");
> > > >> > > > > >     conf.set("xmlinput.start", "<adc>");
> > > >> > > > > >     conf.set("xmlinput.end", "</adc>");
> > > >> > > > > >     conf
> > > >> > > > > >         .set(
> > > >> > > > > >           "io.serializations",
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > >
> > > >>
> > >
> > "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization");
> > > >> > > > > >
> > > >> > > > > >       Job job = new Job(conf, "Populate Table
with Data");
> > > >> > > > > >
> > > >> > > > > >     FileInputFormat.setInputPaths(job, input);
> > > >> > > > > >     job.setJarByClass(ParserDriver.class);
> > > >> > > > > >     job.setMapperClass(MyParserMapper.class);
> > > >> > > > > >     job.setNumReduceTasks(0);
> > > >> > > > > >     job.setInputFormatClass(XmlInputFormat.class);
> > > >> > > > > >     job.setOutputFormatClass(TableOutputFormat.class);
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > *and mapper code*
> > > >> > > > > >
> > > >> > > > > > public class MyParserMapper   extends
> > > >> > > > > >     Mapper<LongWritable, Text, NullWritable,
Writable> {
> > > >> > > > > >
> > > >> > > > > >     @Override
> > > >> > > > > >     public void map(LongWritable key, Text
value1,Context
> > > >> context)
> > > >> > > > > >
> > > >> > > > > > throws IOException, InterruptedException
{
> > > >> > > > > > *//doing some processing*
> > > >> > > > > >  while(rItr.hasNext())
> > > >> > > > > >                     {
> > > >> > > > > > *                   //and this put statement
runs for
> > > >> 132,622,560
> > > >> > > times
> > > >> > > > > to
> > > >> > > > > > insert the data.*
> > > >> > > > > >                     context.write(NullWritable.get(),
new
> > > >> > > > > > Put(rowId).add(Bytes.toBytes("CounterValues"),
> > > >> > > > > > Bytes.toBytes(counter.toString()),
> > > >> > > > > Bytes.toBytes(rElement.getTextTrim())));
> > > >> > > > > >
> > > >> > > > > >                     }
> > > >> > > > > >
> > > >> > > > > > }}
> > > >> > > > > >
> > > >> > > > > > Is there any other way of doing this task
so i can improve
> > the
> > > >> > > > > performance?
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > --
> > > >> > > > > > Regards
> > > >> > > > > > Shuja-ur-Rehman Baig
> > > >> > > > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> > > >> > > > >
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > >
> > > >> > > > --
> > > >> > > > Regards
> > > >> > > > Shuja-ur-Rehman Baig
> > > >> > > > <http://BLOCKEDBLOCKEDpk.linkedin.com/in/shujamughal>
> > > >> > >
> > > >> > >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Regards
> > > >> > Shuja-ur-Rehman Baig
> > > >> > <http://pk.linkedin.com/in/shujamughal>
> > > >>
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Regards
> > > > Shuja-ur-Rehman Baig
> > > > <http://pk.linkedin.com/in/shujamughal>
> > > >
> > > >
> > >
> > >
> > > --
> > > Regards
> > > Shuja-ur-Rehman Baig
> > > <http://pk.linkedin.com/in/shujamughal>
> > >
> >
> 
> 
> 
> -- 
> Regards
> Shuja-ur-Rehman Baig
> <http://pk.linkedin.com/in/shujamughal>
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message