Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4C00911D70 for ; Sat, 24 May 2014 01:15:30 +0000 (UTC) Received: (qmail 8345 invoked by uid 500); 24 May 2014 01:15:28 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 8278 invoked by uid 500); 24 May 2014 01:15:28 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 8270 invoked by uid 99); 24 May 2014 01:15:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 May 2014 01:15:28 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jtaylor@salesforce.com designates 209.85.216.170 as permitted sender) Received: from [209.85.216.170] (HELO mail-qc0-f170.google.com) (209.85.216.170) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 May 2014 01:15:24 +0000 Received: by mail-qc0-f170.google.com with SMTP id i8so9437351qcq.29 for ; Fri, 23 May 2014 18:15:03 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=HAejIte4ikeGx+lv4BNd7A5vliMgxeeq94wbjttCeIo=; b=HV+N+hTj/ZSzW+V4N7YmZqiFaFpLjjbT9Gp0FJ6QEN5WbbpVI2B2ECG4MGpF/lRiXe 9GboTP1OqlahDcYPraLgXn+A/JPeU7TuqE+/THgmwS7z3nekArfjXteXYy6NuT7ldKHH j1PxPT/mTue6Qcv8Sz+e0BE89gwRmIyO0wg2YVrVMdBM2+thx/GfIGNJXcwO6TmPpLVV el8te4Dzf4fWUXQg7qVL62nwgrhaQXZRG+1HJjCWAeJ+rYFeSTCT0z7xjS7G65Wgqfh6 bqwuFDjuE8uuzkmSziTIpVScPhwn4AeLbZG8A6/kSMQKSNKkkccnq1lXeJ4C/M6GK8DN hOUg== X-Gm-Message-State: ALoCoQliWKbn2t7YBe+n622QLlmsK0dKiw5DAX0DvIFLd2dl0DxqDpcAiqB7u02hZ2kFVFuGbOLt MIME-Version: 1.0 X-Received: by 10.224.43.148 with SMTP id w20mr12373968qae.26.1400894103645; Fri, 23 May 2014 18:15:03 -0700 (PDT) Received: by 10.96.125.199 with HTTP; Fri, 23 May 2014 18:15:03 -0700 (PDT) In-Reply-To: References: Date: Fri, 23 May 2014 18:15:03 -0700 Message-ID: Subject: Re: Copy some records from Huge hbase table to another table From: James Taylor To: "user@hbase.apache.org" Content-Type: multipart/alternative; boundary=047d7bdc82206a892804fa1b139c X-Virus-Checked: Checked by ClamAV on apache.org --047d7bdc82206a892804fa1b139c Content-Type: text/plain; charset=UTF-8 Hi Riyaz, You can do this with a single SQL command using Apache Phoenix, a SQL engine on top of HBase, and you'll get better performance than if you hand coded it using the HBase client APIs. Depending on your current schema, you may be able to run this command with no change to your data. Let's assume you have an MD5 hash of the website and the date/time in the row key with your website and counts in key values. That schema could be modeled like this in Phoenix: CREATE VIEW WEBSITE_STATS ( WEBSITE_MD5 BINARY(16) NOT NULL, DATE_COLLECTED UNSIGNED_DATE NOT NULL, WEBSITE_URL VARCHAR, HIT_COUNT UNSIGNED_LONG, CONSTRAINT PK PRIMARY KEY (WEBSITE_MD5, DATE_COLLECTED)); You could issue this create view statement and map directly to your HBase table. I used the UNSIGNED types above as they match the serialization you get when you use the HBase Bytes utility methods. Phoenix normalizes column names by upper casing them, so if your column qualifiers are lower case, you'd want to put the column names above in double quotes. Next, you'd create a table to hold the top10 information: CREATE TABLE WEBSITE_TOP10 ( WEBSITE_URL VARCHAR PRIMARY KEY, TOTAL_HIT_COUNT BIGINT); Then you'd run an UPSERT SELECT command like this: UPSERT INTO WEBSITE_TOP10 SELECT WEBSITE_URL, SUM(HIT_COUNT) FROM WEBSITE_STATS GROUP BY WEBSITE_URL ORDER BY SUM(HIT_COUNT) LIMIT 10; Phoenix will run the SELECT part of this in parallel on the client and use a coprocessor on the server side to aggregate over the WEBSITE_URLs returning the distinct set of urls per region with a final merge sort happening o the client to compute the total sum. Then the client will hold on to the top 10 rows it sees and upsert these into the WEBSITE_TOP10 table. HTH, James On Fri, May 23, 2014 at 5:14 PM, Ted Yu wrote: > Would the new HBase table reside in the same cluster as the original table > ? > > See this recent thread: http://search-hadoop.com/m/DHED4uBNqJ1 > > Cheers > > > On Fri, May 23, 2014 at 2:49 PM, Shaikh Ahmed > wrote: > > > Hi, > > > > We have one huge HBase table with billions of rows. This table holds the > > information about websites and number of hits on that site in every 15 > > minutes. > > > > Every website will have multiple records in data with different number of > > hit count and last updated timestamp. > > > > Now, we want to create another Hbase table which will contain information > > about only those TOP 10 websites which are having more number of hits. > > > > We are seeking help from experts to achieve this requirement. > > How we can filter top 10 websites based on hits count from billions of > > records and copy it into our new HBase table? > > > > I will greatly appreciate kind support from group members. > > > > Thanks in advance. > > Regards, > > Riyaz > > > --047d7bdc82206a892804fa1b139c--