Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7022E11A14 for ; Wed, 30 Apr 2014 17:28:41 +0000 (UTC) Received: (qmail 37401 invoked by uid 500); 30 Apr 2014 17:28:36 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 37324 invoked by uid 500); 30 Apr 2014 17:28:34 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 37315 invoked by uid 99); 30 Apr 2014 17:28:34 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Apr 2014 17:28:34 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of static.void.dev@gmail.com designates 209.85.212.171 as permitted sender) Received: from [209.85.212.171] (HELO mail-wi0-f171.google.com) (209.85.212.171) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 30 Apr 2014 17:28:31 +0000 Received: by mail-wi0-f171.google.com with SMTP id hm4so1386198wib.16 for ; Wed, 30 Apr 2014 10:28:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=BdXJJHVkwTu2nbhKXfOzDhFcP/Nb5YIUL4dDprV0cYI=; b=0GwTXNLxGQjzhj36JhtFjKcWk8x2GfRnep4jYjBJFawuSSLMXTf0CasKW9Skd3p3X9 sAnXURU27xt8X1qwzMPRowFAFi9m1YKEIzQY/bGaYTKrJD9QeBflXR7AE8sCAv3PMRsU vVbRC7X4e1Uwhm2vRHCR6n1oUGSeGH1X19utpiLaZj2/U1eKapRYfLRO5uESev3e1zTU srqcjYXkVuAEDQQZVnV1tCNOg50XGHOBT9DdtHjHEJnx0dfygCqKCRVFjHpFr83DZ4TW lTom5DGHR5VqXpJL0AEu2fMLtSIp0WcRguuySqU9z6L6r4adNFvTw7W7cQg721T2ZOl2 FSvA== MIME-Version: 1.0 X-Received: by 10.194.222.227 with SMTP id qp3mr4889594wjc.37.1398878888959; Wed, 30 Apr 2014 10:28:08 -0700 (PDT) Received: by 10.194.205.70 with HTTP; Wed, 30 Apr 2014 10:28:08 -0700 (PDT) In-Reply-To: References: Date: Wed, 30 Apr 2014 10:28:08 -0700 Message-ID: Subject: Re: Help with row and column design From: Software Dev To: user@hbase.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org I did not know of the FuzzyRowFilter.. that looks like it may be my best bet. Anyone know what Sematexts HBaseWD uses to perform efficient scanning? On Tue, Apr 29, 2014 at 11:31 PM, Liam Slusser wrote: > I would recommend pre-splitting the tables and then hashing your key and > putting that in the front. ie > > [hash(20140429:Country:US)][2014042901:Country:US] #notice you're not > hashing the sequence number > > some pseudo python code > >>>> import hashlib >>>> key = "2014042901:Country:US" >>>> ckey = "20140429:Country:US" >>>> hbase_key = "%s%s" % (hashlib.md5(ckey).hexdigest()[:5],key) >>>> hbase_key > '887d82014042901:Country:US' > > Now when you want to find something, you can just create the hash ('887d8) > and use FuzzyRowFilter to find it! > > cheers, > liam > > > > > > > > > On Tue, Apr 29, 2014 at 8:08 PM, Software Dev wrote: > >> Any improvements in the row key design? >> >> If i always know we will be querying by country could/should I prefix >> the row key with the country to help with hotspotting? >> >> FR/2014042901 >> FR/2014042902 >> .... >> US/2014042901 >> US/2014042902 >> ... >> >> Is this preferred over adding it in a column... ie 2014042901:Country:US >> >> On Tue, Apr 29, 2014 at 8:05 PM, Software Dev >> wrote: >> > Ok didnt know if the sheer number of gets would be a limiting factor. >> Thanks >> > >> > On Tue, Apr 29, 2014 at 7:57 PM, Ted Yu wrote: >> >> As I said this afternoon: >> >> See the following API in HTable for batching Get's : >> >> >> >> public Result[] get(List gets) throws IOException { >> >> >> >> Cheers >> >> >> >> >> >> On Tue, Apr 29, 2014 at 7:45 PM, Software Dev < >> static.void.dev@gmail.com>wrote: >> >> >> >>> Nothing against your code. I just meant that if we are doing a scan >> >>> say for hourly metrics across a 6 month period we are talking about >> >>> 4K+ gets. Is that something that can easily be handled? >> >>> >> >>> On Tue, Apr 29, 2014 at 5:08 PM, Rendon, Carlos (KBB) > > >> >>> wrote: >> >>> >> Gets a bit hairy when doing say a shitload of gets thought.. no? >> >>> > >> >>> > If you by "hairy" you mean the code is ugly, it was written for >> maximal >> >>> clarity. >> >>> > I think you'll find a few sensible loops makes it fairly clean. >> >>> > Otherwise I'm not sure what you mean. >> >>> > >> >>> > -----Original Message----- >> >>> > From: Software Dev [mailto:static.void.dev@gmail.com] >> >>> > Sent: Tuesday, April 29, 2014 5:02 PM >> >>> > To: user@hbase.apache.org >> >>> > Subject: Re: Help with row and column design >> >>> > >> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to >> >>> pre-store every level of aggregation you care about. >> >>> > >> >>> > Ok I think this makes sense. Gets a bit hairy when doing say a >> shitload >> >>> of gets thought.. no? >> >>> > >> >>> > On Tue, Apr 29, 2014 at 4:43 PM, Rendon, Carlos (KBB) < >> CRendon@kbb.com> >> >>> wrote: >> >>> >> You don't do a scan, you do a series of gets, which I believe you >> can >> >>> batch into one call. >> >>> >> >> >>> >> last 5 days query in pseudocode >> >>> >> res1 = Get( hash("2014-04-29") + "2014-04-29") >> >>> >> res2 = Get( hash("2014-04-28") + "2014-04-28") >> >>> >> res3 = Get( hash("2014-04-27") + "2014-04-27") >> >>> >> res4 = Get( hash("2014-04-26") + "2014-04-26") >> >>> >> res5 = Get( hash("2014-04-25") + "2014-04-25") >> >>> >> >> >>> >> For each result you look for the particular column or columns you >> are >> >>> >> interested in Total_usa = res1.get("c:usa") + res2.get("c:usa") + >> >>> res3.get("c:usa") + ... >> >>> >> Total_female_usa = res1.get("c:usa:sex:f") + ... >> >>> >> >> >>> >> "What happens when we add more fields? Do we just keep adding in >> more >> >>> column qualifiers? If so, how would we filter across columns to get an >> >>> aggregate total?" >> >>> >> >> >>> >> Yes. See total_usa vs. total_female_usa above. Basically you have to >> >>> pre-store every level of aggregation you care about. >> >>> >> >> >>> >> -----Original Message----- >> >>> >> From: Software Dev [mailto:static.void.dev@gmail.com] >> >>> >> Sent: Tuesday, April 29, 2014 4:36 PM >> >>> >> To: user@hbase.apache.org >> >>> >> Subject: Re: Help with row and column design >> >>> >> >> >>> >>> The downside is it still has a hotspot when inserting, but when >> >>> >>> reading a range of time it does not >> >>> >> >> >>> >> How can you do a scan query between dates when you hash the date? >> >>> >> >> >>> >>> Column qualifiers are just the collection of items you are >> >>> >>> aggregating on. Values are increments. In your case qualifiers >> might >> >>> >>> look like c:usa, c:usa:sex:m, c:usa:sex:f, c:italy:sex:m, >> >>> >>> c:italy:sex:f, c:italy, >> >>> >> >> >>> >> What happens when we add more fields? Do we just keep adding in more >> >>> column qualifiers? If so, how would we filter across columns to get an >> >>> aggregate total? >> >>> >>