hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tao Xiao <xiaotao.cs....@gmail.com>
Subject Re: Newbie question: Rowkey design
Date Tue, 17 Dec 2013 02:56:52 GMT
Sometimes row key design is a trade-off issue between load-balance and
query : if you design row key such that you can query it very fast and
convenient, maybe the records are not spread evenly across the nodes; if
you design row key such that the records are spread evenly across the
nodes, maybe it's not convenient to query or impossible to get the record
through row key directly (say you have a random number as the row key's

You can have a look at secondary index. Secondary index is very helpful.

2013/12/16 Wilm Schumacher <wilm.schumacher@cawoom.com>

> Hi,
> I'm a newbie to hbase and have a question on the rowkey design and I
> hope this question isn't to newbie-like for this list. I have a question
> which cannot be answered by knoledge of code but by experience with
> large databases, thus this mail.
> For the sake of explaination I create a small example. Suppose you want
> to design a small "blogging" plattform. You just want to store the name
> of the user and a small text. And of course you want to get all postings
> of one user.
> Furthermore we have 4 users, let's call them A,B,C,D (and you can trust
> that the length of the username is fixed). Now let's say the A,B,C and D
> have N postings, and D has 6*N postings. BUT: the data of A is 3 times
> more often fetched than the data from the other users each!
> If you create a hbase cluster with 10 nodes, every node is holding N
> postings (of course I know, that the data is hold redundantly, but this
> is not so important for the question).
> Rowkey design #1:
> the i-th posting of user X would have the rowkey: "$X$i", e.g. "A003".
> The table just would be: "create 'postings' , 'text'"
> For this rowkey design the first node would hold the data of A, the
> second of B, the third of C and the fourth to the tenth node the data of D.
> Fetching of data would be very easy, but half of the traffic would hit
> the first node.
> Rowkey design #2
> the rowkey would be random, e.g. an uuid. The table design would be now:
> "create 'postings' , 'user' , 'text'"
> the fetching of the data would be a "real" map-reduce job, checking for
> the user and emit etc..
> So, if a fetching takes place I have to do more computation cycles and
> IO. But in this scenario all traffic would hit all 10 servers.
> If the number of N (number of postings) is large enough that the disk
> space is critical, I'm also not able to adjust the key regions in a way
> that e.g. the data of D is only on the last server and the key space of
> A would span the first 5 nodes. Or making replication very broad (e.g.
> 10 times in this case)
> So basically the question is: What's the better plan? Trying to avoid
> computation cycles of map reducing and get the key design straight, or
> trying to scale the computation, but doing more IO?
> I hope that the small example helped to make the question more vivid.
> Best wishes
> Wilm

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message