Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A1CD8C38C for ; Thu, 10 May 2012 07:23:16 +0000 (UTC) Received: (qmail 35326 invoked by uid 500); 10 May 2012 07:23:14 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 35239 invoked by uid 500); 10 May 2012 07:23:13 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 35210 invoked by uid 99); 10 May 2012 07:23:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 May 2012 07:23:12 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of mailinglists19@gmail.com designates 209.85.212.179 as permitted sender) Received: from [209.85.212.179] (HELO mail-wi0-f179.google.com) (209.85.212.179) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 May 2012 07:23:08 +0000 Received: by wibhr2 with SMTP id hr2so107524wib.2 for ; Thu, 10 May 2012 00:22:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=DAl7FdlrXXRvZ6kfkrnAng0Brt+laeRst1FRbyS1Ewc=; b=ndXYxCOQx+4Q9kVlD8rdXxlSAbwkkFy/7zEXFCwUgsTzCQ5NOwPIMWtWBilcX4AfCn i3onGUj4GISKkO+MWf9nyml181NO2t34cXTa6nenMR/NGUZm4TUf8zkfXoMHI/ywtTuS ss3Q0qo/H7nrFWZ9cmnBkoxTYOBuq5Az+GfDpwnPjgyGBRSRmt52QGYg0L+n73o4QZEO L+om4nNgLQWqYG7/QVu1dcLGQGsfyoSGnjYVTpkC4s1Pf15N1FUBIiW+Tcwzdn06NteN sOG17PpYviy3Y8UmQjvI2pXjEh3oNWblEaTNPuKTe+7P5q74NuImXEc2WF4OcCUjYyiq vdtA== MIME-Version: 1.0 Received: by 10.180.83.38 with SMTP id n6mr6908303wiy.4.1336634567089; Thu, 10 May 2012 00:22:47 -0700 (PDT) Received: by 10.223.70.197 with HTTP; Thu, 10 May 2012 00:22:46 -0700 (PDT) In-Reply-To: References: <250D72A5-0C43-4EF5-8875-56DCE3962954@gbif.org> Date: Thu, 10 May 2012 00:22:46 -0700 Message-ID: Subject: Re: HBase Performance Improvements? From: Something Something To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=f46d04428f1090df4104bfa97c0b X-Virus-Checked: Checked by ClamAV on apache.org --f46d04428f1090df4104bfa97c0b Content-Type: text/plain; charset=ISO-8859-1 Thank you Tim & Bryan for the responses. Sorry for the delayed response. Got busy with other things. Bryan - I decided to focus on the region split problem first. The challenge here is to find the correct start key for each region, right? Here are the steps I could think of: 1) Sort the keys. 2) Count how many keys & divide by # of regions we want to create. (e.g. 300). This gives us # of keys in a region (region size). 3) Loop thru the sorted keys & every time region size is reached, write down region # & starting key. This info can later be used to create the table. Honestly, I am not sure what you mean by "hadoop does this automatically". If you used a single reducer, did you use secondary sort (setOutputValueGroupingComparator) to sort the keys? Did you loop thru the *values* to find regions? Would appreciate it if you would describe this MR job. Thanks. On Wed, May 9, 2012 at 8:25 AM, Bryan Beaudreault wrote: > I also recently had this problem, trying to index 6+ billion records into > HBase. The job would take about 4 hours before it brought down the entire > cluster, at only around 60% complete. > > After trying a bunch of things, we went to bulk loading. This is actually > pretty easy, though the hardest part is that you need to have a table ready > with the region splits you are going to use. Region splits aside, there > are 2 steps: > > 1) Change your job to instead of executing yours Puts, just output them > using context.write. Put is writable. (We used ImmutableBytesWritable as > the Key, representing the rowKey) > 2) Add another job that reads that input and configure it > using HFileOutputFormat.configureIncrementalLoad(Job job, HTable table); > This will add the right reducer. > > Once those two have run, you can finalize the process using the > completebulkload tool documented at > http://hbase.apache.org/bulk-loads.html > > For the region splits problem, we created another job which sorted all of > the puts by the key (hadoop does this automatically) and had a single > reducer. It stepped through all of the Puts calculating up the total size > until it reached some threshold. When it did it recorded the bytearray and > used that for the start of the next region. We used the result of this job > to create a new table. There is probably a better way to do this but it > takes like 20 minutes to write. > > This whole process took less than an hour, with the bulk load part only > taking 15 minutes. Much better! > > On Wed, May 9, 2012 at 11:08 AM, Something Something < > mailinglists19@gmail.com> wrote: > > > Hey Oliver, > > > > Thanks a "billion" for the response -:) I will take any code you can > > provide even if it's a hack! I will even send you an Amazon gift card - > > not that you care or need it -:) > > > > Can you share some performance statistics? Thanks again. > > > > > > On Wed, May 9, 2012 at 8:02 AM, Oliver Meyn (GBIF) > wrote: > > > > > Heya Something, > > > > > > I had a similar task recently and by far the best way to go about this > is > > > with bulk loading after pre-splitting your target table. As you know > > > ImportTsv doesn't understand Avro files so I hacked together my own > > > ImportAvro class to create the Hfiles that I eventually moved into > HBase > > > with completebulkload. I haven't committed my class anywhere because > > it's > > > a pretty ugly hack, but I'm happy to share it with you as a starting > > point. > > > Doing billions of puts will just drive you crazy. > > > > > > Cheers, > > > Oliver > > > > > > On 2012-05-09, at 4:51 PM, Something Something wrote: > > > > > > > I ran the following MR job that reads AVRO files & puts them on > HBase. > > > The > > > > files have tons of data (billions). We have a fairly decent size > > > cluster. > > > > When I ran this MR job, it brought down HBase. When I commented out > > the > > > > Puts on HBase, the job completed in 45 seconds (yes that's seconds). > > > > > > > > Obviously, my HBase configuration is not ideal. I am using all the > > > default > > > > HBase configurations that come out of Cloudera's distribution: > > > 0.90.4+49. > > > > > > > > I am planning to read up on the following two: > > > > > > > > http://hbase.apache.org/book/important_configurations.html > > > > http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/ > > > > > > > > But can someone quickly take a look and recommend a list of > priorities, > > > > such as "try this first..."? That would be greatly appreciated. As > > > > always, thanks for the time. > > > > > > > > > > > > Here's the Mapper. (There's no reducer): > > > > > > > > > > > > > > > > public class AvroProfileMapper extends AvroMapper > > > NullWritable> { > > > > private static final Logger logger = > > > > LoggerFactory.getLogger(AvroProfileMapper.class); > > > > > > > > final private String SEPARATOR = "*"; > > > > > > > > private HTable table; > > > > > > > > private String datasetDate; > > > > private String tableName; > > > > > > > > @Override > > > > public void configure(JobConf jobConf) { > > > > super.configure(jobConf); > > > > datasetDate = jobConf.get("datasetDate"); > > > > tableName = jobConf.get("tableName"); > > > > > > > > // Open table for writing > > > > try { > > > > table = new HTable(jobConf, tableName); > > > > table.setAutoFlush(false); > > > > table.setWriteBufferSize(1024 * 1024 * 12); > > > > } catch (IOException e) { > > > > throw new RuntimeException("Failed table construction", > e); > > > > } > > > > } > > > > > > > > @Override > > > > public void map(GenericData.Record record, > > AvroCollector > > > > collector, > > > > Reporter reporter) throws IOException { > > > > > > > > String u1 = record.get("u1").toString(); > > > > > > > > GenericData.Array fields = > > > > (GenericData.Array) record.get("bag"); > > > > for (GenericData.Record rec : fields) { > > > > Integer s1 = (Integer) rec.get("s1"); > > > > Integer n1 = (Integer) rec.get("n1"); > > > > Integer c1 = (Integer) rec.get("c1"); > > > > Integer freq = (Integer) rec.get("freq"); > > > > if (freq == null) { > > > > freq = 0; > > > > } > > > > > > > > String key = u1 + SEPARATOR + n1 + SEPARATOR + c1 + > > SEPARATOR > > > + > > > > s1; > > > > Put put = new Put(Bytes.toBytes(key)); > > > > put.setWriteToWAL(false); > > > > put.add(Bytes.toBytes("info"), Bytes.toBytes("frequency"), > > > > Bytes.toBytes(freq.toString())); > > > > try { > > > > table.put(put); > > > > } catch (IOException e) { > > > > throw new RuntimeException("Error while writing to " + > > > > table + " table.", e); > > > > } > > > > > > > > } > > > > logger.error("------------ Finished processing user: " + u1); > > > > } > > > > > > > > @Override > > > > public void close() throws IOException { > > > > table.close(); > > > > } > > > > > > > > } > > > > > > > > > -- > > > Oliver Meyn > > > Software Developer > > > Global Biodiversity Information Facility (GBIF) > > > +45 35 32 15 12 > > > http://www.gbif.org > > > > > > > > > --f46d04428f1090df4104bfa97c0b--