hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shaikh Ahmed <rnsr.sha...@gmail.com>
Subject Re: Copy some records from Huge hbase table to another table
Date Sat, 24 May 2014 09:40:43 GMT
Thanks James,

I will investigate Apache Phoenix and get back to group with updates.


Regards,
Riyaz


On Sat, May 24, 2014 at 4:15 AM, James Taylor <jtaylor@salesforce.com>wrote:

> Hi Riyaz,
> You can do this with a single SQL command using Apache Phoenix, a SQL
> engine on top of HBase, and you'll get better performance than if you hand
> coded it using the HBase client APIs. Depending on your current schema, you
> may be able to run this command with no change to your data. Let's assume
> you have an MD5 hash of the website and the date/time in the row key with
> your website and counts in key values. That schema could be modeled like
> this in Phoenix:
>
> CREATE VIEW WEBSITE_STATS (
>     WEBSITE_MD5 BINARY(16) NOT NULL,
>     DATE_COLLECTED UNSIGNED_DATE NOT NULL,
>     WEBSITE_URL VARCHAR,
>     HIT_COUNT UNSIGNED_LONG,
>     CONSTRAINT PK PRIMARY KEY (WEBSITE_MD5, DATE_COLLECTED));
>
> You could issue this create view statement and map directly to your HBase
> table. I used the UNSIGNED types above as they match the serialization you
> get when you use the HBase Bytes utility methods. Phoenix normalizes column
> names by upper casing them, so if your column qualifiers are lower case,
> you'd want to put the column names above in double quotes.
>
> Next, you'd create a table to hold the top10 information:
>
> CREATE TABLE WEBSITE_TOP10 (
>     WEBSITE_URL VARCHAR PRIMARY KEY,
>     TOTAL_HIT_COUNT BIGINT);
>
> Then you'd run an UPSERT SELECT command like this:
>
> UPSERT INTO WEBSITE_TOP10
> SELECT WEBSITE_URL, SUM(HIT_COUNT) FROM WEBSITE_STATS
> GROUP BY WEBSITE_URL
> ORDER BY SUM(HIT_COUNT)
> LIMIT 10;
>
> Phoenix will run the SELECT part of this in parallel on the client and use
> a coprocessor on the server side to aggregate over the WEBSITE_URLs
> returning the distinct set of urls per region with a final merge sort
> happening o the client to compute the total sum. Then the client will hold
> on to the top 10 rows it sees and upsert these into the WEBSITE_TOP10
> table.
>
> HTH,
> James
>
>
>
>
>
>
> On Fri, May 23, 2014 at 5:14 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> > Would the new HBase table reside in the same cluster as the original
> table
> > ?
> >
> > See this recent thread: http://search-hadoop.com/m/DHED4uBNqJ1
> >
> > Cheers
> >
> >
> > On Fri, May 23, 2014 at 2:49 PM, Shaikh Ahmed <rnsr.shaikh@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > We have one huge HBase table with billions of rows. This table holds
> the
> > > information about websites and number of hits on that site in every 15
> > > minutes.
> > >
> > > Every website will have multiple records in data with different number
> of
> > > hit count and last updated timestamp.
> > >
> > > Now, we want to create another Hbase table which will contain
> information
> > > about only those TOP 10 websites which are having more number of hits.
> > >
> > > We are seeking help from experts to achieve this requirement.
> > > How we can filter top 10 websites based on hits count from billions of
> > > records and copy it into our new HBase table?
> > >
> > > I will greatly appreciate kind support from group members.
> > >
> > > Thanks in advance.
> > > Regards,
> > > Riyaz
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message