Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8CFE8EE7B for ; Mon, 28 Jan 2013 21:54:51 +0000 (UTC) Received: (qmail 11484 invoked by uid 500); 28 Jan 2013 21:54:49 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 11415 invoked by uid 500); 28 Jan 2013 21:54:49 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 11403 invoked by uid 99); 28 Jan 2013 21:54:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Jan 2013 21:54:49 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of asaf.mesika@gmail.com designates 209.85.214.180 as permitted sender) Received: from [209.85.214.180] (HELO mail-ob0-f180.google.com) (209.85.214.180) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Jan 2013 21:54:43 +0000 Received: by mail-ob0-f180.google.com with SMTP id ef5so3209077obb.25 for ; Mon, 28 Jan 2013 13:54:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=X6lNEXQhZOiSFVjXSLy+r38JmsoUP322RAMIojsGRGo=; b=0xNOxhSHr5BZj0n+Qqkfu/kcOWBIFGY6Cd9/G3V3R944+3kPdBrkebeSXg768DK+gz rzlao7zMQX8hFAtCA9DgY1Scgdi92+pw2qsWLmEZG3N20MTbHJMKFM12jPYH+naShGPj 8V3Y7OKkQfBhbEW3B5qS4x+4AjossjLbExTB0aewh63UnQaQL9TbAvtgK02yBm7etx3U E37AQg0I8BD/LS+GWPJX9vNsa0zELcIpW/AblPUtmcYdwoN0wfXNslvUkZtHYKPS3yeB VtFJ2VwOJbC73ryKcnIEdxqb+nsoeawnQeezsLbjIodXusuzGaFRE+NrIy9LmPYWrJ1U /Rjg== MIME-Version: 1.0 X-Received: by 10.60.172.229 with SMTP id bf5mr12372823oec.81.1359410062499; Mon, 28 Jan 2013 13:54:22 -0800 (PST) Received: by 10.60.22.194 with HTTP; Mon, 28 Jan 2013 13:54:22 -0800 (PST) In-Reply-To: References: Date: Mon, 28 Jan 2013 23:54:22 +0200 Message-ID: Subject: Re: Tables vs CFs vs Cs From: Asaf Mesika To: "user@hbase.apache.org" Content-Type: multipart/alternative; boundary=bcaec54ee86ce11f5404d4605108 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec54ee86ce11f5404d4605108 Content-Type: text/plain; charset=UTF-8 I would go on using the row-key, on one table. = Row Key Structure = group-depth: 1..4, encoded as 1 byte A-D group; encoded as 1 byte and not as string Examples: <1><192> <2><192><168> <3><192><168><1> <4><192><168><1><10> Column Qualifier: "c" - stands for counters Column Qualifier: "t" - stands for total When you get a request for 192.168.1.10, you need to increase 4 rows, so build 4 Increment objects ands send them to HBase using HTable.batch. Each Increment object will increase the "t" column. When you scan, simply scan for the range based on the group. For example, all 192.168 group can get by fetch rows with prefix of <2><192><168> (each numbers is a byte in the byte array you compose as prefix). You'll get back at most 255 rows. In IPv4 you can have , on a popular site, 6-7 million unique IPs in 10 minutes of traffic. You can enhance it by having a column qualifier for each hour, by converting the epoch of that hour (long) into a byte array, on top of having that all-hours total counter. This way you can filter the traffic by range of dates/hours. On Sun, Jan 27, 2013 at 6:51 PM, Jean-Marc Spaggiari < jean-marc@spaggiari.org> wrote: > Hi, > > Let's imagine this scenario. > > I want to store IPs with counters. And I want to have counters by > groups of IPs. All of that will be calculated with MR jobs and stored > in HBase. > > Let's take some IPs and make sure they are ordered by adding some "0" > when required. > > 037.113.031.119 > 058.022.018.176 > 058.022.159.151 > 109.169.201.076 > 109.169.201.150 > 109.254.019.140 > 122.031.039.016 > 122.224.005.210 > 178.137.167.041 > > I want to have counters for all "levels" of those IPs. Which mean for > those groups. > > Group 1: > 037 > 058 > 109 > 122 > 178 > > Group 2: > > 037.113 > 058.022 > 109.169 > 109.254 > 122.031 > 122.224 > 178.167 > > Group 3: > > 037.113.031 > 058.022.018 > 058.022.159 > 109.169.201 > 109.254.019 > 122.031.039 > 122.224.005 > 178.137.167 > > And group 4 is the complete IPs list. > > Each time I see an IP, I will increment the required values into the 4 > groups. > > What's the bests way to store that knowing that I want to be able to > easily list all the entries (ranged based) from one group. > > Option 1 is to have one table per group. 1CF, 1C > Pros: Very easy to access, retrieve, etc. > Cons: Will generate 4 tables > > Option 2 is to have one table, but 1 CF per group. > Pros: Only one table, easy access. > Cons: Heard that we should try to keep CFs under 3. Might have bad > performances impacts. > > Option 3 is to have one table, one CF and one C per group. > Pros: Only one table, only one CF. > Cons: Access is less easy than option 1 and 2. > > I think Option 2 is the worst one. Option 1 is very easy to implement. > And for option 3, I don't see any benefit compared to option 1. > > So I'm tempted to go with option 1, but I don't like the idea of > multiplying the table. > > Does anyone have any comment on which options might be the best one, > or even proposed another option? > > JM > --bcaec54ee86ce11f5404d4605108--