hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michel Segel <michael_se...@hotmail.com>
Subject Re: Rowkey hashing to avoid hotspotting
Date Tue, 17 Jul 2012 16:44:45 GMT
Reading hot spotting?
Hmmm there's a cache and I don't see any real use cases where you would have it occur naturally.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Jul 17, 2012, at 10:53 AM, Alex Baranau <alex.baranov.v@gmail.com> wrote:

> The most common reason for RS hotspotting during writing data in HBase is
> writing rows with monotonically increasing/decreasing row keys. E.g. if you
> put timestamp in the first part of your key, then you are likely to have
> monotonically increasing row keys. You can find more info about this issue
> and how to solve it here: [1] and also you may want to look at already
> implemented salting solution [2].
> As for RS hotspotting during reading - it is hard to predict without
> knowing what it the most common data access patterns. E.g. putting model #
> in first part of a key may seem like a good distribution, but if your web
> site used mostly by Mercedes owners, the majority of the read load may be
> directed to just few regions. Again, salting can help a lot here.
> +1 to what Cristofer said on other things, esp: use partial key scans were
> possible instead of filters and pre-split your table.
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr
> [1] http://bit.ly/HnKjbc
> [2] https://github.com/sematext/HBaseWD
> On Tue, Jul 17, 2012 at 10:44 AM, AnandaVelMurugan Chandra Mohan <
> ananthu2050@gmail.com> wrote:
>> Hi Cristofer,
>> Thanks for elaborate response!!!
>> I have no much information about production data as I work with partial
>> data. But based on discussion with my project partners, I have some answers
>> for you.
>> Number of model numbers and serial numbers will be finite. Not so many...
>> As far as I know,there is no predefined rule for model number or serial
>> number creation.
>> I have two access pattern. I count the number of rows for a specific model
>> number. I use rowkey filter for this. Also I filter the rows based on
>> model, serial number and some other columns. I scan the table with column
>> value filter for this case.
>> I will evaluate salting as you have explained.
>> Regards,
>> Anand.C
>> On Tue, Jul 17, 2012 at 12:30 AM, Cristofer Weber <
>> cristofer.weber@neogrid.com> wrote:
>>> Hi Anand,
>>> As usual, the answer is that 'it depends'  :)
>>> I think that the main question here is: why are you afraid that this
>> setup
>>> would lead to region server hotspotting? Is because you don't know how
>> your
>>> production data will seems?
>>> Based on what you told about your rowkey, you will query mostly by
>>> providing model no. + serial no., but:
>>> 1 - How is your rowkey distribution? There are tons of different
>>> modelNumbers AND serialNumbers? Few modelNumbers and a lot of
>>> serialNumbers? Few of both?
>>> 2 - Putting modelNumber in front of your rowkey means that your data will
>>> be sorted by rowkey. So, what is the rule that determinates a modelNumber
>>> creation? Is it a sequential number that will be increased by time? If
>> so,
>>> are newer members accessed a lot more than older members? If not, what
>> will
>>> drive this number? Is it an encoding rule?
>>> 3 - Do you expect more write/read load over a few of these modelNumbers
>>> and/or serialNumbers? Will it be similar to a Pareto Distribution?
>>> Distributed over what?
>>> Also, two other things got my attention here...
>>> 1 - Why are you filtering with regex? If your queries are over model no.
>> +
>>> serial no., why don't you just scan starting by your
>>> modelNumber+SerialNumber, and stoping on your next
>>> modelNumber+SerialNumber? Or is there another access pattern that doesn't
>>> apply to your composited rowkey?
>>> 2 - Why do you have to add a timestamp to ensure uniqueness?
>>> Now, answering your question without more info about your data, you can
>>> apply hash in two ways:
>>> 1 - Generating a hash (MD5 is the most common as far as I read about) and
>>> using only this hash as your rowkey. Based on what you have told, this
>> way
>>> doesn't fit your needs, because you would not be able to do apply your
>>> filter anymore.
>>> 2 - Salting, by prefixing your current rowkey with a pinch of hash.
>> Notice
>>> that the hash portion must be your rowkey prefix to ensure a kind of
>>> balanced distribution over something (where something is your region
>>> servers). I'm working with a case that is a bit similar to yours, and
>> what
>>> I'm doing right now is calculating the hashValue of my rowkey and using a
>>> Java Formatter to create a hex string to prepend to my rowkey. Something
>>> like a String.format("%03x", hashValue)
>>> In both cases, you still have to split your regions in advance, and it
>>> will be better to work your splitting before starting to feed your table
>>> with production data.
>>> Also, you have to study the consequences that changing your rowkey will
>>> bring. It's not for free.
>>> There's a lot of words here and a lot of questions, so by now I feel I
>>> started to shoot in the dark. Try to understand your production data and
>> if
>>> you have more to share, for sure it will help!
>>> Regards,
>>> Cristofer
>>> -----Mensagem original-----
>>> De: AnandaVelMurugan Chandra Mohan [mailto:ananthu2050@gmail.com]
>>> Enviada em: segunda-feira, 16 de julho de 2012 02:30
>>> Para: user@hbase.apache.org
>>> Assunto: Rowkey hashing to avoid hotspotting
>>> Hi,
>>> I am using Hbase to store data about mechanical components. Each
>> component
>>> has model no. and serial no. and some other attributes.
>>> I would be querying my data mostly by model no. and serial no. So I
>>> created a composite key with these two attributes and added timestamp to
>>> make it unique.
>>> To filter the data, I use rowkey filter with regex string comparator and
>>> it works well with sample seed data. Now I am afraid whether this set up
>>> will lead to region server hotspotting when we load production data in
>>> HBase. I read hashing may solve this problem. Can some one help me in
>>> implementing hashing the row key? Also I would want the row filter to
>> work
>>> as I have to display the number of components in a web page and I use row
>>> key filter for implementing that functionality? Any guidance would be of
>>> great help.
>>> --
>>> Regards,
>>> Anand
>> --
>> Regards,
>> Anand
> -- 
> Alex Baranau
> ------
> Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch -
> Solr

View raw message