Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of jtaylor@salesforce.com
 designates 209.85.216.170 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CALte62yyJyU+u=CY7-A6OfZrgvX7ce_1uAwMfQ=yxQejkuupnQ@mail.gmail.com>
References: 
 <CAJYC72KkrK40v-29R2hkW+vjf4iGqHFo=4nWE20DC9xifkhGfQ@mail.gmail.com>
	<CALte62yyJyU+u=CY7-A6OfZrgvX7ce_1uAwMfQ=yxQejkuupnQ@mail.gmail.com>
Date: Fri, 23 May 2014 18:15:03 -0700
Message-ID: 
 <CAG_TOPAf+pJ0CgdZAMpa6vUtF_=C4yuu5cuX8F-68834hWtRVA@mail.gmail.com>
Subject: Re: Copy some records from Huge hbase table to another table
From: James Taylor <jtaylor@salesforce.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Content-Type: multipart/alternative; boundary=047d7bdc82206a892804fa1b139c

--047d7bdc82206a892804fa1b139c
Content-Type: text/plain; charset=UTF-8

Hi Riyaz,
You can do this with a single SQL command using Apache Phoenix, a SQL
engine on top of HBase, and you'll get better performance than if you hand
coded it using the HBase client APIs. Depending on your current schema, you
may be able to run this command with no change to your data. Let's assume
you have an MD5 hash of the website and the date/time in the row key with
your website and counts in key values. That schema could be modeled like
this in Phoenix:

CREATE VIEW WEBSITE_STATS (
    WEBSITE_MD5 BINARY(16) NOT NULL,
    DATE_COLLECTED UNSIGNED_DATE NOT NULL,
    WEBSITE_URL VARCHAR,
    HIT_COUNT UNSIGNED_LONG,
    CONSTRAINT PK PRIMARY KEY (WEBSITE_MD5, DATE_COLLECTED));

You could issue this create view statement and map directly to your HBase
table. I used the UNSIGNED types above as they match the serialization you
get when you use the HBase Bytes utility methods. Phoenix normalizes column
names by upper casing them, so if your column qualifiers are lower case,
you'd want to put the column names above in double quotes.

Next, you'd create a table to hold the top10 information:

CREATE TABLE WEBSITE_TOP10 (
    WEBSITE_URL VARCHAR PRIMARY KEY,
    TOTAL_HIT_COUNT BIGINT);

Then you'd run an UPSERT SELECT command like this:

UPSERT INTO WEBSITE_TOP10
SELECT WEBSITE_URL, SUM(HIT_COUNT) FROM WEBSITE_STATS
GROUP BY WEBSITE_URL
ORDER BY SUM(HIT_COUNT)
LIMIT 10;

Phoenix will run the SELECT part of this in parallel on the client and use
a coprocessor on the server side to aggregate over the WEBSITE_URLs
returning the distinct set of urls per region with a final merge sort
happening o the client to compute the total sum. Then the client will hold
on to the top 10 rows it sees and upsert these into the WEBSITE_TOP10 table.

HTH,
James


On Fri, May 23, 2014 at 5:14 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> Would the new HBase table reside in the same cluster as the original table
> ?
>
> See this recent thread: http://search-hadoop.com/m/DHED4uBNqJ1
>
> Cheers
>
>
> On Fri, May 23, 2014 at 2:49 PM, Shaikh Ahmed <rnsr.shaikh@gmail.com>
> wrote:
>
> > Hi,
> >
> > We have one huge HBase table with billions of rows. This table holds the
> > information about websites and number of hits on that site in every 15
> > minutes.
> >
> > Every website will have multiple records in data with different number of
> > hit count and last updated timestamp.
> >
> > Now, we want to create another Hbase table which will contain information
> > about only those TOP 10 websites which are having more number of hits.
> >
> > We are seeking help from experts to achieve this requirement.
> > How we can filter top 10 websites based on hits count from billions of
> > records and copy it into our new HBase table?
> >
> > I will greatly appreciate kind support from group members.
> >
> > Thanks in advance.
> > Regards,
> > Riyaz
> >
>

--047d7bdc82206a892804fa1b139c--