hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cristofer Weber <cristofer.we...@neogrid.com>
Subject RES: Rowkey hashing to avoid hotspotting
Date Mon, 16 Jul 2012 19:00:12 GMT
Hi Anand,

As usual, the answer is that 'it depends'  :)

I think that the main question here is: why are you afraid that this setup would lead to region
server hotspotting? Is because you don't know how your production data will seems? 

Based on what you told about your rowkey, you will query mostly by providing model no. + serial
no., but:
1 - How is your rowkey distribution? There are tons of different modelNumbers AND serialNumbers?
Few modelNumbers and a lot of serialNumbers? Few of both?
2 - Putting modelNumber in front of your rowkey means that your data will be sorted by rowkey.
So, what is the rule that determinates a modelNumber creation? Is it a sequential number that
will be increased by time? If so, are newer members accessed a lot more than older members?
If not, what will drive this number? Is it an encoding rule? 
3 - Do you expect more write/read load over a few of these modelNumbers and/or serialNumbers?
Will it be similar to a Pareto Distribution? Distributed over what? 

Also, two other things got my attention here... 
1 - Why are you filtering with regex? If your queries are over model no. + serial no., why
don't you just scan starting by your modelNumber+SerialNumber, and stoping on your next modelNumber+SerialNumber?
Or is there another access pattern that doesn't apply to your composited rowkey?
2 - Why do you have to add a timestamp to ensure uniqueness?

Now, answering your question without more info about your data, you can apply hash in two
ways:
1 - Generating a hash (MD5 is the most common as far as I read about) and using only this
hash as your rowkey. Based on what you have told, this way doesn't fit your needs, because
you would not be able to do apply your filter anymore.
2 - Salting, by prefixing your current rowkey with a pinch of hash. Notice that the hash portion
must be your rowkey prefix to ensure a kind of balanced distribution over something (where
something is your region servers). I'm working with a case that is a bit similar to yours,
and what I'm doing right now is calculating the hashValue of my rowkey and using a Java Formatter
to create a hex string to prepend to my rowkey. Something like a String.format("%03x", hashValue)

In both cases, you still have to split your regions in advance, and it will be better to work
your splitting before starting to feed your table with production data. 

Also, you have to study the consequences that changing your rowkey will bring. It's not for
free. 

There's a lot of words here and a lot of questions, so by now I feel I started to shoot in
the dark. Try to understand your production data and if you have more to share, for sure it
will help!

Regards,
Cristofer

-----Mensagem original-----
De: AnandaVelMurugan Chandra Mohan [mailto:ananthu2050@gmail.com] 
Enviada em: segunda-feira, 16 de julho de 2012 02:30
Para: user@hbase.apache.org
Assunto: Rowkey hashing to avoid hotspotting

Hi,

I am using Hbase to store data about mechanical components. Each component has model no. and
serial no. and some other attributes.

I would be querying my data mostly by model no. and serial no. So I created a composite key
with these two attributes and added timestamp to make it unique.

To filter the data, I use rowkey filter with regex string comparator and it works well with
sample seed data. Now I am afraid whether this set up will lead to region server hotspotting
when we load production data in HBase. I read hashing may solve this problem. Can some one
help me in implementing hashing the row key? Also I would want the row filter to work as I
have to display the number of components in a web page and I use row key filter for implementing
that functionality? Any guidance would be of great help.

--
Regards,
Anand

Mime
View raw message