hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jtay...@salesforce.com>
Subject Re: Row Key Design in time based aplication
Date Sun, 17 Feb 2013 22:50:47 GMT
Have you considered using Phoenix 
(https://github.com/forcedotcom/phoenix) for this use case? Phoenix is a 
SQL layer on top of HBase. For this use case, you'd connect to your 
cluster like this:

Class.forName("com.salesforce.phoenix.jdbc.PhoenixDriver"); // register 
Connection conn = 
DriverManager..getConnection("jdbc:phoenix:localhost"); // connect to 
local HBase

Create a table like this (adding additional columns that you want to 
measure, like txn_count below):

     "CREATE TABLE event_log (\n" +
     "     project_id INTEGER NOT NULL, \n" +
     "    time DATE NOT NULL,\n" +
     "txn_count LONG\n" +
     "CONSTRAINT pk PRIMARY KEY (project_id, time))");

Then to insert data you'd do this:

PreparedStatement preparedStmt = conn.prepareStatement(
     "UPSERT INTO event_log VALUES(?,?,0)");

and you'd bind the values in JDBC like this:

preparedStmt.setInt(1, projectId);
preparedStmt.setDate(2, time);

conn.commit(); // If upserting many values, you'd want to commit after 
upserting maybe 1000-10000 rows

Then at query data time, assuming you want to report on this data by 
grouping into different "time buckets", you could do as show below. 
Phoenix stores your date values at the millisecond granularity and you 
can decide a query time how you'd like to roll it up:

// Query with time bucket at the hour granularity
    "SELECT\n" +
    "    project_id, TRUNC(time,'HOUR') as time_bucket, \n" +
    "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
    "GROUP BY project_id, TRUNC(time,'HOUR')");

// Query with time bucket at the day granularity
     "SELECT\n" +
     "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
     "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
     "GROUP BY project_id, TRUNC(time,'DAY')");

You could, of course include a WHERE clause in the query to filter based 
on the range of dates, particular projectIds, etc. like this:

     "SELECT\n" +
     "    project_id, TRUNC(time,'DAY') as time_bucket,\n" +
     "    MIN(txnCount), MAX(txnCount), AVG(txnCount) FROM event_log\n" +
     "WHERE project_id IN (?, ?, ?) AND date >= ? AND date < ?\n" +
     "GROUP BY project_id, TRUNC(time,'DAY')");
preparedStmt.setInt(1, projectId1);
preparedStmt.setInt(2, projectId2);
preparedStmt.setInt(3, projectId3);
preparedStmt.setDate(4, beginDate);
preparedStmt.setDate(5, endDate);




On 02/17/2013 11:33 AM, Mehmet Simsek wrote:
> Hi,
> I want to hold event log data in hbase but I couldn't decide row key. I must hold project
id and time,I will use project ld and time combination while searching.
> Row key can be below
> ProjectId+timeInMs
> In similiar application(open source TSDB) time is divided 1000 to round in this project.I
can use this strategy but I don't know how we decide what divider must be?  1000 or 10000.
> Why time is divided 1000 in this application? why didn't be hold without division?
> Can you explain this strategy?

View raw message