hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cristofer Weber <cristofer.we...@neogrid.com>
Subject RES: Rowkey hashing to avoid hotspotting
Date Tue, 17 Jul 2012 21:44:53 GMT
So, Anand, there are some things that can help, but again, most of them are related with the
famous access patterns. 

Sometimes is not easy to get more information about them in advance, but if you are replacing
another system you can study its data distribution, grouping for counts, mean, changes over
time, etc. It is possible to analyze with partial data too, but it is risky because you will
be subjected to the way this partial data was gathered; sample data may not be representative.

Salting your rowkey with a hash calculated over your model# will probably result in an uniform
distribution over a range (if using modulus), and pre-spliting your table will balance your
load over your Region Servers. Also, you will be able to recalculate your hash for your model#
before scanning for it, allowing for a scan over specific rowkey while restricting this scan
by startRow and stopRow. Remember that if your rowkeys shares the same prefix they will probably
be located in the same region and your scan will be favored by this.

I'm still curious about your need of adding a timestamp after your model#,serial#... I have
some background in manufacturing systems and usually a serial number is unique. But, of course,
it's just curiosity.  :-) 


-----Mensagem original-----
De: Alex Baranau [mailto:alex.baranov.v@gmail.com] 
Enviada em: terça-feira, 17 de julho de 2012 12:53
Para: user@hbase.apache.org
Assunto: Re: Rowkey hashing to avoid hotspotting

The most common reason for RS hotspotting during writing data in HBase is writing rows with
monotonically increasing/decreasing row keys. E.g. if you put timestamp in the first part
of your key, then you are likely to have monotonically increasing row keys. You can find more
info about this issue and how to solve it here: [1] and also you may want to look at already
implemented salting solution [2].

As for RS hotspotting during reading - it is hard to predict without knowing what it the most
common data access patterns. E.g. putting model # in first part of a key may seem like a good
distribution, but if your web site used mostly by Mercedes owners, the majority of the read
load may be directed to just few regions. Again, salting can help a lot here.

+1 to what Cristofer said on other things, esp: use partial key scans 
possible instead of filters and pre-split your table.

Alex Baranau
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

[1] http://bit.ly/HnKjbc
[2] https://github.com/sematext/HBaseWD

On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan < ananthu2050@gmail.com>

> Hi Cristofer,
> Thanks for elaborate response!!!
> I have no much information about production data as I work with 
> partial data. But based on discussion with my project partners, I have 
> some answers for you.
> Number of model numbers and serial numbers will be finite. Not so many...
> As far as I know,there is no predefined rule for model number or 
> serial number creation.
> I have two access pattern. I count the number of rows for a specific 
> model number. I use rowkey filter for this. Also I filter the rows 
> based on model, serial number and some other columns. I scan the table 
> with column value filter for this case.
> I will evaluate salting as you have explained.
> Regards,
> Anand.C
> On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber < 
> cristofer.weber@neogrid.com> wrote:
> > Hi Anand,
> >
> > As usual, the answer is that 'it depends'  :)
> >
> > I think that the main question here is: why are you afraid that this
> setup
> > would lead to region server hotspotting? Is because you don't know 
> > how
> your
> > production data will seems?
> >
> > Based on what you told about your rowkey, you will query mostly by 
> > providing model no. + serial no., but:
> > 1 - How is your rowkey distribution? There are tons of different 
> > modelNumbers AND serialNumbers? Few modelNumbers and a lot of 
> > serialNumbers? Few of both?
> > 2 - Putting modelNumber in front of your rowkey means that your data 
> > will be sorted by rowkey. So, what is the rule that determinates a 
> > modelNumber creation? Is it a sequential number that will be 
> > increased by time? If
> so,
> > are newer members accessed a lot more than older members? If not, 
> > what
> will
> > drive this number? Is it an encoding rule?
> > 3 - Do you expect more write/read load over a few of these 
> > modelNumbers and/or serialNumbers? Will it be similar to a Pareto Distribution?
> > Distributed over what?
> >
> > Also, two other things got my attention here...
> > 1 - Why are you filtering with regex? If your queries are over model no.
> +
> > serial no., why don't you just scan starting by your
> > modelNumber+SerialNumber, and stoping on your next SerialNumber? Or 
> > modelNumber+is there another access pattern that doesn't
> > apply to your composited rowkey?
> > 2 - Why do you have to add a timestamp to ensure uniqueness?
> >
> > Now, answering your question without more info about your data, you 
> > can apply hash in two ways:
> > 1 - Generating a hash (MD5 is the most common as far as I read 
> > about) and using only this hash as your rowkey. Based on what you 
> > have told, this
> way
> > doesn't fit your needs, because you would not be able to do apply 
> > your filter anymore.
> > 2 - Salting, by prefixing your current rowkey with a pinch of hash.
> Notice
> > that the hash portion must be your rowkey prefix to ensure a kind of 
> > balanced distribution over something (where something is your region 
> > servers). I'm working with a case that is a bit similar to yours, 
> > and
> what
> > I'm doing right now is calculating the hashValue of my rowkey and 
> > using a Java Formatter to create a hex string to prepend to my 
> > rowkey. Something like a String.format("%03x", hashValue)
> >
> > In both cases, you still have to split your regions in advance, and 
> > it will be better to work your splitting before starting to feed 
> > your table with production data.
> >
> > Also, you have to study the consequences that changing your rowkey 
> > will bring. It's not for free.
> >
> > There's a lot of words here and a lot of questions, so by now I feel 
> > I started to shoot in the dark. Try to understand your production 
> > data and
> if
> > you have more to share, for sure it will help!
> >
> > Regards,
> > Cristofer
> >
> > -----Mensagem original-----
> > De: AnandaVelMurugan Chandra Mohan [mailto:ananthu2050@gmail.com] 
> > Enviada em: segunda-feira, 16 de julho de 2012 02:30
> > Para: user@hbase.apache.org
> > Assunto: Rowkey hashing to avoid hotspotting
> >
> > Hi,
> >
> > I am using Hbase to store data about mechanical components. Each
> component
> > has model no. and serial no. and some other attributes.
> >
> > I would be querying my data mostly by model no. and serial no. So I 
> > created a composite key with these two attributes and added 
> > timestamp to make it unique.
> >
> > To filter the data, I use rowkey filter with regex string comparator 
> > and it works well with sample seed data. Now I am afraid whether 
> > this set up will lead to region server hotspotting when we load 
> > production data in HBase. I read hashing may solve this problem. Can 
> > some one help me in implementing hashing the row key? Also I would 
> > want the row filter to
> work
> > as I have to display the number of components in a web page and I 
> > use row key filter for implementing that functionality? Any guidance 
> > would be of great help.
> >
> > --
> > Regards,
> > Anand
> >
> --
> Regards,
> Anand

Alex Baranau
Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr

View raw message