hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammad Tariq <donta...@gmail.com>
Subject Re: Row Key Design in time based aplication
Date Sun, 17 Feb 2013 23:30:30 GMT
Hello Mehmet,

 If ProjectIds are sequential, then it is definitely not a feasible
approach. Division is just to make sure that all the regions are
evenly loaded. You can create pre-splitted tables to avoid the
region hotspotting. Alternatively hash your rowkeys so that all
the regionservers receive equal load.

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Mon, Feb 18, 2013 at 4:48 AM, Michael Segel <michael_segel@hotmail.com>wrote:

> I'm not sure how a SQL interface above HBase will solve some of the issues
> with regional hot spotting when using time as the key. Or the problem with
> always adding data to the right of the last row.
>
> The same would apply with the project id, assuming that it too is a number
> that grows incrementally with each project.
> On Feb 17, 2013, at 4:50 PM, James Taylor <jtaylor@salesforce.com> wrote:
>
> > Hello,
> > Have you considered using Phoenix (
> https://github.com/forcedotcom/phoenix) for this use case? Phoenix is a
> SQL layer on top of HBase. For this use case, you'd connect to your cluster
> like this:
> >
> > Class.forName("com.salesforce.phoenix.jdbc.PhoenixDriver"); // register
> driver
> > Connection conn =
> DriverManager..getConnection("jdbc:phoenix:localhost"); // connect to local
> HBase
> >
> > Create a table like this (adding additional columns that you want to
> measure, like txn_count below):
> >
> > conn.createStatement().execute(
> >    "CREATE TABLE event_log (\n" +
> >    "     project_id INTEGER NOT NULL, \n" +
> >    "    time DATE NOT NULL,\n" +
> >    "txn_count LONG\n" +
> >    "CONSTRAINT pk PRIMARY KEY (project_id, time))");
> >
> > Then to insert data you'd do this:
> >
> > PreparedStatement preparedStmt = conn.prepareStatement(
> >    "UPSERT INTO event_log VALUES(?,?,0)");
> >
> > and you'd bind the values in JDBC like this:
> >
> > preparedStmt.setInt(1, projectId);
> > preparedStmt.setDate(2, time);
> > preparedStmt.execute();
> >
> > conn.commit(); // If upserting many values, you'd want to commit after
> upserting maybe 1000-10000 rows
> >
> > Then at query data time, assuming you want to report on this data by
> grouping into different "time buckets", you could do as show below. Phoenix
> stores your date values at the millisecond granularity and you can decide a
> query time how you'd like to roll it up:
> >
> > // Query with time bucket at the hour granularity
> > conn.createStatement().execute(
> >   "SELECT\n" +
> >   "    project_id, TRUNC(time,'HOUR') as time_bucket, \n" +
> >   "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
> >   "GROUP BY project_id, TRUNC(time,'HOUR')");
> >
> > // Query with time bucket at the day granularity
> > conn.createStatement().execute(
> >    "SELECT\n" +
> >    "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
> >    "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
> >    "GROUP BY project_id, TRUNC(time,'DAY')");
> >
> > You could, of course include a WHERE clause in the query to filter based
> on the range of dates, particular projectIds, etc. like this:
> >
> > conn.prepareStatement(
> >    "SELECT\n" +
> >    "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
> >    "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
> >    "WHERE project_id IN (?, ?, ?) AND date >= ? AND date < ?\n" +
> >    "GROUP BY project_id, TRUNC(time,'DAY')");
> > preparedStmt.setInt(1, projectId1);
> > preparedStmt.setInt(2, projectId2);
> > preparedStmt.setInt(3, projectId3);
> > preparedStmt.setDate(4, beginDate);
> > preparedStmt.setDate(5, endDate);
> > preparedStmt.execute();
> >
> >
> > HTH.
> >
> > Regards,
> >
> >    James
> >
> > On 02/17/2013 11:33 AM, Mehmet Simsek wrote:
> >> Hi,
> >>
> >> I want to hold event log data in hbase but I couldn't decide row key. I
> must hold project id and time,I will use project ld and time combination
> while searching.
> >>
> >> Row key can be below
> >>
> >> ProjectId+timeInMs
> >>
> >> In similiar application(open source TSDB) time is divided 1000 to round
> in this project.I can use this strategy but I don't know how we decide what
> divider must be?  1000 or 10000.
> >>
> >> Why time is divided 1000 in this application? why didn't be hold
> without division?
> >>
> >> Can you explain this strategy?
> >>
> >
> >
>
> Michael Segel  | (m) 312.755.9623
>
> Segel and Associates
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message