Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of dontariq@gmail.com designates
 209.85.128.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <BLU0-SMTP34679ADB43F652C228820E18F0C0@phx.gbl>
References: <B5B16EE9-3D39-468A-A2D5-A711AEB2913C@gmail.com>
 <51215EC7.1010003@salesforce.com>
 <BLU0-SMTP34679ADB43F652C228820E18F0C0@phx.gbl>
From: Mohammad Tariq <dontariq@gmail.com>
Date: Mon, 18 Feb 2013 05:00:30 +0530
Message-ID: 
 <CAMVC6RNsta5b0TFH2LOBHUL4_TAZEHHdpBTW23eo-gAGefBX_Q@mail.gmail.com>
Subject: Re: Row Key Design in time based aplication
To: "user@hbase.apache.org" <user@hbase.apache.org>
Content-Type: multipart/alternative; boundary=047d7bdc8db0df356904d5f40007

--047d7bdc8db0df356904d5f40007
Content-Type: text/plain; charset=ISO-8859-1

Hello Mehmet,

 If ProjectIds are sequential, then it is definitely not a feasible
approach. Division is just to make sure that all the regions are
evenly loaded. You can create pre-splitted tables to avoid the
region hotspotting. Alternatively hash your rowkeys so that all
the regionservers receive equal load.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Mon, Feb 18, 2013 at 4:48 AM, Michael Segel <michael_segel@hotmail.com>wrote:

> I'm not sure how a SQL interface above HBase will solve some of the issues
> with regional hot spotting when using time as the key. Or the problem with
> always adding data to the right of the last row.
>
> The same would apply with the project id, assuming that it too is a number
> that grows incrementally with each project.
> On Feb 17, 2013, at 4:50 PM, James Taylor <jtaylor@salesforce.com> wrote:
>
> > Hello,
> > Have you considered using Phoenix (
> https://github.com/forcedotcom/phoenix) for this use case? Phoenix is a
> SQL layer on top of HBase. For this use case, you'd connect to your cluster
> like this:
> >
> > Class.forName("com.salesforce.phoenix.jdbc.PhoenixDriver"); // register
> driver
> > Connection conn =
> DriverManager..getConnection("jdbc:phoenix:localhost"); // connect to local
> HBase
> >
> > Create a table like this (adding additional columns that you want to
> measure, like txn_count below):
> >
> > conn.createStatement().execute(
> >    "CREATE TABLE event_log (\n" +
> >    "     project_id INTEGER NOT NULL, \n" +
> >    "    time DATE NOT NULL,\n" +
> >    "txn_count LONG\n" +
> >    "CONSTRAINT pk PRIMARY KEY (project_id, time))");
> >
> > Then to insert data you'd do this:
> >
> > PreparedStatement preparedStmt = conn.prepareStatement(
> >    "UPSERT INTO event_log VALUES(?,?,0)");
> >
> > and you'd bind the values in JDBC like this:
> >
> > preparedStmt.setInt(1, projectId);
> > preparedStmt.setDate(2, time);
> > preparedStmt.execute();
> >
> > conn.commit(); // If upserting many values, you'd want to commit after
> upserting maybe 1000-10000 rows
> >
> > Then at query data time, assuming you want to report on this data by
> grouping into different "time buckets", you could do as show below. Phoenix
> stores your date values at the millisecond granularity and you can decide a
> query time how you'd like to roll it up:
> >
> > // Query with time bucket at the hour granularity
> > conn.createStatement().execute(
> >   "SELECT\n" +
> >   "    project_id, TRUNC(time,'HOUR') as time_bucket, \n" +
> >   "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
> >   "GROUP BY project_id, TRUNC(time,'HOUR')");
> >
> > // Query with time bucket at the day granularity
> > conn.createStatement().execute(
> >    "SELECT\n" +
> >    "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
> >    "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
> >    "GROUP BY project_id, TRUNC(time,'DAY')");
> >
> > You could, of course include a WHERE clause in the query to filter based
> on the range of dates, particular projectIds, etc. like this:
> >
> > conn.prepareStatement(
> >    "SELECT\n" +
> >    "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
> >    "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
> >    "WHERE project_id IN (?, ?, ?) AND date >= ? AND date < ?\n" +
> >    "GROUP BY project_id, TRUNC(time,'DAY')");
> > preparedStmt.setInt(1, projectId1);
> > preparedStmt.setInt(2, projectId2);
> > preparedStmt.setInt(3, projectId3);
> > preparedStmt.setDate(4, beginDate);
> > preparedStmt.setDate(5, endDate);
> > preparedStmt.execute();
> >
> >
> > HTH.
> >
> > Regards,
> >
> >    James
> >
> > On 02/17/2013 11:33 AM, Mehmet Simsek wrote:
> >> Hi,
> >>
> >> I want to hold event log data in hbase but I couldn't decide row key. I
> must hold project id and time,I will use project ld and time combination
> while searching.
> >>
> >> Row key can be below
> >>
> >> ProjectId+timeInMs
> >>
> >> In similiar application(open source TSDB) time is divided 1000 to round
> in this project.I can use this strategy but I don't know how we decide what
> divider must be?  1000 or 10000.
> >>
> >> Why time is divided 1000 in this application? why didn't be hold
> without division?
> >>
> >> Can you explain this strategy?
> >>
> >
> >
>
> Michael Segel  | (m) 312.755.9623
>
> Segel and Associates
>
>
>

--047d7bdc8db0df356904d5f40007--