Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 91D1510A91 for ; Tue, 21 Jan 2014 01:16:40 +0000 (UTC) Received: (qmail 73413 invoked by uid 500); 21 Jan 2014 01:16:27 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 73247 invoked by uid 500); 21 Jan 2014 01:16:25 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 73032 invoked by uid 99); 21 Jan 2014 01:16:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Jan 2014 01:16:22 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jtaylor@salesforce.com designates 209.85.216.42 as permitted sender) Received: from [209.85.216.42] (HELO mail-qa0-f42.google.com) (209.85.216.42) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Jan 2014 01:16:15 +0000 Received: by mail-qa0-f42.google.com with SMTP id k4so6218553qaq.29 for ; Mon, 20 Jan 2014 17:15:55 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=2YHn1vhDTt8yN+1E1WBH/T/0GAt8CtlpbweG+Qot13A=; b=kfw2IXGH+Hk+h+Vvm5tQDcOKFs/IPO/sBf6DOb3dF3GzRMNdBO0WkDA0FZhuYy23pT 91W1gfiYCWn4H55sfhgHMIqM+D7bPyJHwyFc372Vn+Sf4+Lyg1rfqL6OjQ0Fr2K0cFDc lFXwFlUZ7Exj/OG447fpqR5dT5EZ3tmJq6h4QSQMYWODAQxsNW0vhP+kLvyDty13HCHa he6DZUyBnT2UIlScVG5R0a6cjwk+/wwavHcUAz19rUPPhPjh5gtf5Xps7iVbhMY6aCTA c9o9RJHcNhJLEz9177J+eMa+gIrPnITp2SnZuJguaHEzCREOtgZ9uG+woSnjlNhSNNZb wR6w== X-Gm-Message-State: ALoCoQkN7Fq0TorBPGVvFKR6mudvilmqgAKNfWI71w7P1PKMLRXXmbnQAK+IR1W/7hirdDoCE18g MIME-Version: 1.0 X-Received: by 10.224.89.71 with SMTP id d7mr19441260qam.26.1390266954875; Mon, 20 Jan 2014 17:15:54 -0800 (PST) Received: by 10.96.90.8 with HTTP; Mon, 20 Jan 2014 17:15:54 -0800 (PST) In-Reply-To: References: Date: Mon, 20 Jan 2014 17:15:54 -0800 Message-ID: Subject: Re: HBase load distribution vs. scan efficiency From: James Taylor To: "user@hbase.apache.org" Content-Type: multipart/alternative; boundary=001a11c3e166fd19cd04f070bff9 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c3e166fd19cd04f070bff9 Content-Type: text/plain; charset=ISO-8859-1 Hi William, Phoenix uses this "bucket mod" solution as well ( http://phoenix.incubator.apache.org/salted.html). For the scan, you have to run it in every possible bucket. You can still do a range scan, you just have to prepend the bucket number to the start/stop key of each scan you do, and then you do a merge sort with the results. Phoenix does all this transparently for you. Thanks, James On Mon, Jan 20, 2014 at 4:51 PM, William Kang wrote: > Hi, > Thank you guys. This is an informative email chain. > > I have one follow up question about using the "bucket mod" solution. Once > you add the bucket number as the prefix to the key, how do you retrieve the > rows? Do you have to use a rowfilter? Will there be any performance issue > of using the row filter since it seems that would be a full table scan? > > Many thanks. > > > William > > > On Mon, Jan 20, 2014 at 5:06 AM, Amit Sela wrote: > > > The number of scans depends on the number of regions a day's data uses. > You > > need to manage compaction and splitting manually. > > If a days data is 100MB and you want regions to be no more than 200MB > than > > it's two regions to scan per day, if it's 1GB than 10 etc. > > Compression will help you maximize data per region and as I've recently > > learned, if your key occupies most of the byes in KeyValue (key is longer > > than family, qualifier and value) than compression can be very > efficient, I > > have a case where 100GB is compressed to 7. > > > > > > > > On Mon, Jan 20, 2014 at 6:56 AM, Vladimir Rodionov > > wrote: > > > > > Ted, how does it differ from row key salting? > > > > > > Best regards, > > > Vladimir Rodionov > > > Principal Platform Engineer > > > Carrier IQ, www.carrieriq.com > > > e-mail: vrodionov@carrieriq.com > > > > > > ________________________________________ > > > From: Ted Yu [yuzhihong@gmail.com] > > > Sent: Sunday, January 19, 2014 6:53 PM > > > To: user@hbase.apache.org > > > Subject: Re: HBase load distribution vs. scan efficiency > > > > > > Bill: > > > See http://blog.sematext.com/2012/04/09/hbasewd > > > > > > > > > -avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ > > > > > > FYI > > > > > > > > > On Sun, Jan 19, 2014 at 4:02 PM, Bill Q wrote: > > > > > > > Hi Amit, > > > > Thanks for the reply. > > > > > > > > If I understand your suggestion correctly, and assuming we have 100 > > > region > > > > servers, I would have to do 100 scans to merge reads if I want to > pull > > > any > > > > data for a specific date. Is that correct? Is the 100 scans the most > > > > efficient way to deal with this issue? > > > > > > > > Any thoughts? > > > > > > > > Many thanks. > > > > > > > > > > > > Bill > > > > > > > > > > > > On Sun, Jan 19, 2014 at 4:02 PM, Amit Sela > > wrote: > > > > > > > > > If you'll use bulk load to insert your data you could use the date > as > > > key > > > > > prefix and choose the rest of the key in a way that will split each > > day > > > > > evenly. You'll have X regions for Evey day >> 14X regions for the > two > > > > weeks > > > > > window. > > > > > On Jan 19, 2014 8:39 PM, "Bill Q" wrote: > > > > > > > > > > > Hi, > > > > > > I am designing a schema to host some large volume of data over > > HBase. > > > > We > > > > > > collect daily trading data for some markets. And we run a moving > > > window > > > > > > analysis to make predictions based on a two weeks window. > > > > > > > > > > > > Since everybody is going to pull the latest two weeks data every > > day, > > > > if > > > > > we > > > > > > put the date in the lead positions of the Key, we will have some > > hot > > > > > > regions. So, we can use bucketing (date to mode bucket number) > > > approach > > > > > to > > > > > > deal with this situation. However, if we have 200 buckets, we > need > > to > > > > run > > > > > > 200 scans to extract all the data in the last two weeks. > > > > > > > > > > > > My questions are: > > > > > > 1. What happens when each scan return the result? Will the scan > > > result > > > > be > > > > > > sent to a sink like place that collects and concatenate all the > > scan > > > > > > results? > > > > > > 2. Why having 200 scans might be a bad thing compared to have > only > > 10 > > > > > > scans? > > > > > > 3. Any suggestions to the design? > > > > > > > > > > > > Many thanks. > > > > > > > > > > > > > > > > > > Bill > > > > > > > > > > > > > > > > > > > > > Confidentiality Notice: The information contained in this message, > > > including any attachments hereto, may be confidential and is intended > to > > be > > > read only by the individual or entity to whom this message is > addressed. > > If > > > the reader of this message is not the intended recipient or an agent or > > > designee of the intended recipient, please note that any review, use, > > > disclosure or distribution of this message or its attachments, in any > > form, > > > is strictly prohibited. If you have received this message in error, > > please > > > immediately notify the sender and/or Notifications@carrieriq.com and > > > delete or destroy any copy of this message and its attachments. > > > > > > --001a11c3e166fd19cd04f070bff9--