hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wilm Schumacher <wilm.schumac...@cawoom.com>
Subject Newbie question: Rowkey design
Date Mon, 16 Dec 2013 15:34:34 GMT

I'm a newbie to hbase and have a question on the rowkey design and I
hope this question isn't to newbie-like for this list. I have a question
which cannot be answered by knoledge of code but by experience with
large databases, thus this mail.

For the sake of explaination I create a small example. Suppose you want
to design a small "blogging" plattform. You just want to store the name
of the user and a small text. And of course you want to get all postings
of one user.

Furthermore we have 4 users, let's call them A,B,C,D (and you can trust
that the length of the username is fixed). Now let's say the A,B,C and D
have N postings, and D has 6*N postings. BUT: the data of A is 3 times
more often fetched than the data from the other users each!

If you create a hbase cluster with 10 nodes, every node is holding N
postings (of course I know, that the data is hold redundantly, but this
is not so important for the question).

Rowkey design #1:
the i-th posting of user X would have the rowkey: "$X$i", e.g. "A003".
The table just would be: "create 'postings' , 'text'"

For this rowkey design the first node would hold the data of A, the
second of B, the third of C and the fourth to the tenth node the data of D.

Fetching of data would be very easy, but half of the traffic would hit
the first node.

Rowkey design #2
the rowkey would be random, e.g. an uuid. The table design would be now:
"create 'postings' , 'user' , 'text'"

the fetching of the data would be a "real" map-reduce job, checking for
the user and emit etc..

So, if a fetching takes place I have to do more computation cycles and
IO. But in this scenario all traffic would hit all 10 servers.

If the number of N (number of postings) is large enough that the disk
space is critical, I'm also not able to adjust the key regions in a way
that e.g. the data of D is only on the last server and the key space of
A would span the first 5 nodes. Or making replication very broad (e.g.
10 times in this case)

So basically the question is: What's the better plan? Trying to avoid
computation cycles of map reducing and get the key design straight, or
trying to scale the computation, but doing more IO?

I hope that the small example helped to make the question more vivid.

Best wishes


View raw message