Return-Path: Delivered-To: apmail-hbase-user-archive@www.apache.org Received: (qmail 35305 invoked from network); 9 Nov 2010 10:32:20 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 9 Nov 2010 10:32:20 -0000 Received: (qmail 95535 invoked by uid 500); 9 Nov 2010 10:32:50 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 95186 invoked by uid 500); 9 Nov 2010 10:32:48 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 95108 invoked by uid 99); 9 Nov 2010 10:32:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Nov 2010 10:32:47 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of oruchovets@gmail.com designates 209.85.212.41 as permitted sender) Received: from [209.85.212.41] (HELO mail-vw0-f41.google.com) (209.85.212.41) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Nov 2010 10:32:40 +0000 Received: by vws18 with SMTP id 18so1843989vws.14 for ; Tue, 09 Nov 2010 02:32:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=QkajbHWcVN4S+XhDtWFfpWZsSMQP81RB9RLht8aAifE=; b=T3Los4Ywjp4myD9u6sERVmbrBrmO3AHilhn1XYVkJD2EBqKLQE/G26MubK4ml6+g0f 5p3PIUUfcarRDA9hPG4bVeCjRqGRImTsly7ElPrSPQ9D89HqJcXJkNUDy7H+BeytdNIP iclg22Gszi19Jo6I9NOKqmjRbDbyp35hvGPMU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=K0LYFAHVbFbOT9T9JaDJ9V6QJkgWE71LpJkx8LYFWGJaAsizlQ76jQGx4BoB7Yh9VO 5ffRXpEJS8JnIwOzaqtsjrjXYfpsZ9KKoX3Yb3gozICk+uOx1w5Sn/MIoE2idzC2av/I g7lClSG2pv0KNZTFvO8rphNX3JeOjECOSOq+o= MIME-Version: 1.0 Received: by 10.224.199.201 with SMTP id et9mr887703qab.6.1289298738549; Tue, 09 Nov 2010 02:32:18 -0800 (PST) Received: by 10.229.38.141 with HTTP; Tue, 9 Nov 2010 02:32:18 -0800 (PST) In-Reply-To: References: <2D6136772A13B84E95DF6DA79E85A9F00130C9842F72@NSPEXMBX-A.the-lab.llnl.gov> <2D6136772A13B84E95DF6DA79E85A9F00130C9843024@NSPEXMBX-A.the-lab.llnl.gov> Date: Tue, 9 Nov 2010 12:32:18 +0200 Message-ID: Subject: Re: Best Way to Insert data into Hbase using Map Reduce From: Oleg Ruchovets To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=20cf300fb10352121004949c41a9 X-Virus-Checked: Checked by ClamAV on apache.org --20cf300fb10352121004949c41a9 Content-Type: text/plain; charset=ISO-8859-1 Hi , Do you use HTablePool? Changing the code to using HBasePool gives me significat performance benefit. HBaseConfiguration conf = new HBaseConfiguration(); HTablePool pool = new HTablePool(conf, 10); HTable table = pool.getTable(name); Actually disabling WAL , Increasing pool size and rewriting code to using WriteBuffer Gives me a good improvement. I wonder : how can I check that my insertion process is optimized. I mean if insertion took X time -- is it good or no? and how can I check it. Thanks Oleg. On Mon, Nov 8, 2010 at 6:59 PM, Shuja Rehman wrote: > One more thing which i want to ask that i have found that people have given > the following buffer size. > > table.setWriteBufferSize(1024*1024*24); > table.setAutoFlush(false); > > Is there any specific reason of giving such buffer size? and how much ram > is > required for it. I have given 4 GB to each region server and I can see that > used heap value for region server going increasing and increasing and > region > servers are crashing then. > > On Mon, Nov 8, 2010 at 9:26 PM, Shuja Rehman > wrote: > > > Ok > > Well...i am getting hundred of files daily which all need to process > thats > > why i am using hadoop so it manage distribution of processing itself. > > Yes, one record has millions of fields > > > > Thanks for comments. > > > > > > On Mon, Nov 8, 2010 at 8:50 PM, Michael Segel >wrote: > > > >> > >> Switch out the JDOM for a Stax parser. > >> > >> Ok, having said that... > >> You said you have a single record per file. Ok that means you have a lot > >> of fields. > >> Because you have 1 record, this isn't a map/reduce problem. You're > better > >> off writing a single threaded app > >> to read the file, parse the file using Stax, and then write the fields > to > >> HBase. > >> > >> I'm not sure why you have millions of put()s. > >> Do you have millions of fields in this one record? > >> > >> Writing a good stax parser and then mapping the fields to your hbase > >> column(s) will help. > >> > >> HTH > >> > >> -Mike > >> PS. A good stax implementation would be a recursive/re-entrant piece of > >> code. > >> While the code may look simple, it takes a skilled developer to write > and > >> maintain. > >> > >> > >> > Date: Mon, 8 Nov 2010 14:36:34 +0500 > >> > Subject: Re: Best Way to Insert data into Hbase using Map Reduce > >> > From: shujamughal@gmail.com > >> > To: user@hbase.apache.org > >> > > >> > HI > >> > > >> > I have used JDOM library to parse the xml in mapper and in my case, > one > >> > single file consist of 1 record so i give one complete file to map > >> process > >> > and extract the information from it which i need. I have only 2 column > >> > families in my schema and bottleneck was the put statements which run > >> > millions of time for each file. when i comment this put statement then > >> job > >> > complete within minutes but with put statement, it was taking about 7 > >> hours > >> > to complete the same job. Anyhow I have changed the code according to > >> > suggestion given by Michael and now using java api to dump data > instead > >> of > >> > table output format and used the list of puts and then flush them at > >> each > >> > 1000 records and it reduces the time significantly. Now the whole job > >> > process by 1 hour and 45 min approx but still not in minutes. So is > >> there > >> > anything left which i might apply and performance increase? > >> > > >> > Thanks > >> > > >> > On Fri, Nov 5, 2010 at 10:46 PM, Buttler, David > >> wrote: > >> > > >> > > Good points. > >> > > Before we can make any rational suggestion, we need to know where > the > >> > > bottleneck is, so we can make suggestions to move it elsewhere. I > >> > > personally favor Michael's suggestion to split the ingest and the > >> parsing > >> > > parts of your job, and to switch to a parser that is faster than a > DOM > >> > > parser (SAX or Stax). But, without knowing what the bottleneck > >> actually is, > >> > > all of these suggestions are shots in the dark. > >> > > > >> > > What is the network load, the CPU load, the disk load? Have you at > >> least > >> > > installed Ganglia or some equivalent so that you can see what the > load > >> is > >> > > across the cluster? > >> > > > >> > > Dave > >> > > > >> > > > >> > > -----Original Message----- > >> > > From: Michael Segel [mailto:michael_segel@hotmail.com] > >> > > Sent: Friday, November 05, 2010 9:49 AM > >> > > To: user@hbase.apache.org > >> > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce > >> > > > >> > > > >> > > I don't think using the buffered client is going to help a lot w > >> > > performance. > >> > > > >> > > I'm a little confused because it doesn't sound like Shuja is using a > >> > > map/reduce job to parse the file. > >> > > That is... he says he parses the file in to a dom tree. Usually your > >> map > >> > > job parses each record and then in the mapper you parse out the > >> record. > >> > > Within the m/r job we don't parse out the fields in the records > >> because we > >> > > do additional processing which 'dedupes' the data so we don't have > to > >> > > further process the data. > >> > > The second job only has to parse a portion of the original records. > >> > > > >> > > So assuming that Shuja is actually using a map reduce job, and each > >> xml > >> > > record is being parsed within the mapper() there are a couple of > >> things... > >> > > 1) Reduce the number of column families that you are using. (Each > >> column > >> > > family is written to a separate file) > >> > > 2) Set up the HTable instance in Mapper.setup() > >> > > 3) Switch to a different dom class (not all java classes are equal) > or > >> > > switch to Stax. > >> > > > >> > > > >> > > > >> > > > >> > > > From: buttler1@llnl.gov > >> > > > To: user@hbase.apache.org > >> > > > Date: Fri, 5 Nov 2010 08:28:07 -0700 > >> > > > Subject: RE: Best Way to Insert data into Hbase using Map Reduce > >> > > > > >> > > > Have you tried turning off auto flush, and managing the flush in > >> your own > >> > > code (say every 1000 puts?) > >> > > > Dave > >> > > > > >> > > > > >> > > > -----Original Message----- > >> > > > From: Shuja Rehman [mailto:shujamughal@gmail.com] > >> > > > Sent: Friday, November 05, 2010 8:04 AM > >> > > > To: user@hbase.apache.org > >> > > > Subject: Re: Best Way to Insert data into Hbase using Map Reduce > >> > > > > >> > > > Michael > >> > > > > >> > > > hum....so u are storing xml record in the hbase and in second job, > u > >> r > >> > > > parsing. but in my case i am parsing it also in first phase. what > i > >> do, i > >> > > > get xml file and i parse it using jdom and then putting data in > >> hbase. so > >> > > > parsing+putting both operations are in 1 phase and in mapper code. > >> > > > > >> > > > My actual problem is that after parsing file, i need to use put > >> statement > >> > > > millions of times and i think for each statement it connects to > >> hbase and > >> > > > then insert it and this might be the reason of slow processing. So > i > >> am > >> > > > trying to figure out some way we i can first buffer data and then > >> insert > >> > > in > >> > > > batch fashion. it means in one put statement, i can insert many > >> records > >> > > and > >> > > > i think if i do in this way then the process will be very fast. > >> > > > > >> > > > secondly what does it means? "we write the raw record in via a > >> single > >> > > put() > >> > > > so the map() method is a null writable." > >> > > > > >> > > > can u explain it more? > >> > > > > >> > > > Thanks > >> > > > > >> > > > > >> > > > On Fri, Nov 5, 2010 at 5:05 PM, Michael Segel < > >> michael_segel@hotmail.com > >> > > >wrote: > >> > > > > >> > > > > > >> > > > > Suja, > >> > > > > > >> > > > > Just did a quick glance. > >> > > > > > >> > > > > What is it that you want to do exactly? > >> > > > > > >> > > > > Here's how we do it... (at a high level.) > >> > > > > > >> > > > > Input is an XML file where we want to store the raw XML records > in > >> > > hbase, > >> > > > > one record per row. > >> > > > > > >> > > > > Instead of using the output of the map() method, we write the > raw > >> > > record in > >> > > > > via a single put() so the map() method is a null writable. > >> > > > > > >> > > > > Its pretty fast. However fast is relative. > >> > > > > > >> > > > > Another thing... we store the xml record as a string (converted > to > >> > > > > bytecode) rather than a serialized object. > >> > > > > > >> > > > > Then you can break it down in to individual fields in a second > >> batch > >> > > job. > >> > > > > (You can start with a DOM parser, and later move to a Stax > parser. > >> > > > > Depending on which DOM parser you have and the size of the > record, > >> it > >> > > should > >> > > > > be 'fast enough'. A good implementation of Stax tends to be > >> > > > > recursive/re-entrant code which is harder to maintain.) > >> > > > > > >> > > > > HTH > >> > > > > > >> > > > > -Mike > >> > > > > > >> > > > > > >> > > > > > Date: Fri, 5 Nov 2010 16:13:02 +0500 > >> > > > > > Subject: Best Way to Insert data into Hbase using Map Reduce > >> > > > > > From: shujamughal@gmail.com > >> > > > > > To: user@hbase.apache.org > >> > > > > > > >> > > > > > Hi > >> > > > > > > >> > > > > > I am reading data from raw xml files and inserting data into > >> hbase > >> > > using > >> > > > > > TableOutputFormat in a map reduce job. but due to heavy put > >> > > statements, > >> > > > > it > >> > > > > > takes many hours to process the data. here is my sample code. > >> > > > > > > >> > > > > > conf.set(TableOutputFormat.OUTPUT_TABLE, "mytable"); > >> > > > > > conf.set("xmlinput.start", ""); > >> > > > > > conf.set("xmlinput.end", ""); > >> > > > > > conf > >> > > > > > .set( > >> > > > > > "io.serializations", > >> > > > > > > >> > > > > > > >> > > > > > >> > > > >> > "org.apache.hadoop.io.serializer.JavaSerialization,org.apache.hadoop.io.serializer.WritableSerialization"); > >> > > > > > > >> > > > > > Job job = new Job(conf, "Populate Table with Data"); > >> > > > > > > >> > > > > > FileInputFormat.setInputPaths(job, input); > >> > > > > > job.setJarByClass(ParserDriver.class); > >> > > > > > job.setMapperClass(MyParserMapper.class); > >> > > > > > job.setNumReduceTasks(0); > >> > > > > > job.setInputFormatClass(XmlInputFormat.class); > >> > > > > > job.setOutputFormatClass(TableOutputFormat.class); > >> > > > > > > >> > > > > > > >> > > > > > *and mapper code* > >> > > > > > > >> > > > > > public class MyParserMapper extends > >> > > > > > Mapper { > >> > > > > > > >> > > > > > @Override > >> > > > > > public void map(LongWritable key, Text value1,Context > >> context) > >> > > > > > > >> > > > > > throws IOException, InterruptedException { > >> > > > > > *//doing some processing* > >> > > > > > while(rItr.hasNext()) > >> > > > > > { > >> > > > > > * //and this put statement runs for > >> 132,622,560 > >> > > times > >> > > > > to > >> > > > > > insert the data.* > >> > > > > > context.write(NullWritable.get(), new > >> > > > > > Put(rowId).add(Bytes.toBytes("CounterValues"), > >> > > > > > Bytes.toBytes(counter.toString()), > >> > > > > Bytes.toBytes(rElement.getTextTrim()))); > >> > > > > > > >> > > > > > } > >> > > > > > > >> > > > > > }} > >> > > > > > > >> > > > > > Is there any other way of doing this task so i can improve the > >> > > > > performance? > >> > > > > > > >> > > > > > > >> > > > > > -- > >> > > > > > Regards > >> > > > > > Shuja-ur-Rehman Baig > >> > > > > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > -- > >> > > > Regards > >> > > > Shuja-ur-Rehman Baig > >> > > > > >> > > > >> > > > >> > > >> > > >> > -- > >> > Regards > >> > Shuja-ur-Rehman Baig > >> > > >> > >> > > > > > > > > -- > > Regards > > Shuja-ur-Rehman Baig > > > > > > > > > -- > Regards > Shuja-ur-Rehman Baig > > --20cf300fb10352121004949c41a9--