hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Taylor <jtay...@salesforce.com>
Subject Re: Rowkey design and presplit table
Date Thu, 07 Mar 2013 08:42:35 GMT
Another possible solution for you: use Phoenix: 

Phoenix would allow you to model your scenario using SQL through JDBC, 
like this:

Connection conn = DriverManager.connect("jdbc:phoenix:<your zookeeper 
Statement stmt = conn.createStatement(
     "CREATE TABLE article_discussion (" +
     "    category_id BIGINT not null," +
     "    article_id BIGINT not null," +
     "    comment_id BIGINT not null
     "    CONSTRAINT pk PRIMARY KEY (category_id, article_id, 

You could, of course, add other non primary key columns to the above.  
You could also pre-split the table and/or pass through any HBase 
configuration parameters necessary.

Then, you populate your table like this:

PrepareStatement stmt = conn.prepareStatement(
     "UPSERT INTO article_discussion VALUES(?,?,?)");
stmt.setLong("category_id", categoryId);
stmt.setLong("article_id", articleId);
stmt.setLong("comment_id", commentId);

And to query for all comments in a given category, you'd do this:

PrepareStatement stmt1 = conn.prepareStatement(
     "SELECT * FROM article_discussion WHERE category_id = ?");
stmt1.setLong("category_id", categoryId);
ResultSet rs = stmt1.executeQuery();
while (rs.next()) {
         "category_id=" + rs.getLong("category_id") +
         ",article_id=" + rs.getLong("article_id") +
         ",comment_id=" + rs.getLong("comment_id"));

and for all comments for a given category and article:

PrepareStatement stmt1 = conn.prepareStatement(
     "SELECT * FROM article_discussion WHERE category_id = ? AND 
article_id = ?");
stmt1.setLong("category_id", categoryId);
stmt1.setLong("article_id", articleId);

you could use arbitrary where clauses like this:

SELECT * FROM article_discussion
WHERE (article_name LIKE '% baseball %' OR article_name LIKE '% food %')
AND category_id IN (?, ?, ?)
AND published_date >= to_date('2013-01-01 00:00:00')
AND published_date <= to_date('2013-01-31 00:00:00')

or do aggregation like this:

SELECT category_id, count(article_id), count(comment_id)
FROM article_discussion
WHERE published_date > ?
GROUP BY category_id
HAVING count(comment_id) > 100



On 03/06/2013 11:42 PM, Asaf Mesika wrote:
> I would convert each id to long and then use Bytes.toBytes to convert this
> long to a byte array. If it is an int then even better.
> Now, write all 3 longs one after another to one array which will be your
> rowkey.
> This gives you:
> * fixed size
> * small row key - 3*8 bytes if you use long and 3*4 for int.
> Why do you need to use prefix split policy?
> On Monday, March 4, 2013, Lukáš Drbal wrote:
>> Hi,
>> i have one question about rowkey design and presplit table.
>> My usecase:
>> I need store a lot of comments where each comment are for one article and
>> this article has one category.
>> What i need:
>> 1) read one comment by id (where i know commentId, articleId and
>> categoryId)
>> 2) read all coments for article (i know categoryId and articleId)
>> 3) read all comments for category (i know categoryId)
>>  From this read pattern i see one good rowkey:
>> <categoryId>_<articleId>_<commentId>
>> But here i don't have fixed size of rowkey, so i don't know how to define
>> split pattern. How can be this solved?
>> This id's come from external system and grow very fast, so add some like
>> "padding" for each part are hard.
>> Maybe i can use hash function for each part
>> md5(<categoryId>_md5(<articleId>)_md5(<commentId>), but this rowkey
is very
>> long (3*32+2 bytes), i don't have experience with this long rowkeys.
>> Can someone give me a suggestions please?
>> Regards
>> Lukas Drbal

View raw message