From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject Re: Timestamp as a key good practice?
Date Thu, 14 Jun 2012 10:39:43 GMT
Wow! This is exactly what I was looking for. So I will read all of that now.

Need to read here at the bottom: https://github.com/sematext/HBaseWD
and here: http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/



2012/6/14, Otis Gospodnetic <otis_gospodnetic@yahoo.com>:
> JM, have a look at https://github.com/sematext/HBaseWD (this comes up
> often.... Doug, maybe you could add it to the Ref Guide?)
> Otis
> ----
> Performance Monitoring for Solr / ElasticSearch / HBase -
> http://sematext.com/spm
>> From: Jean-Marc Spaggiari <jean-marc@spaggiari.org>
>>To: user@hbase.apache.org
>>Sent: Wednesday, June 13, 2012 12:16 PM
>>Subject: Timestamp as a key good practice?
>>I watched Lars George's video about HBase and read the documentation
>>and it's saying that it's not a good idea to have the timestamp as a
>>key because that will always load the same region until the timestamp
>>reach a certain value and move to the next region (hotspotting).
>>I have a table with a uniq key, a file path and a "last update" field.
>>I can easily find back the file with the ID and find when it has been
>>But what I need too is to find the files not updated for more than a
>>certain period of time.
>>If I want to retrieve that from this single table, I will have to do a
>>full parsing of the table. Which might take a while.
>>So I thought of building a table to reference that (kind of secondary
>>index). The key is the "last update", one FC and each column will have
>>the ID of the file with a dummy content.
>>When a file is updated, I remove its cell from this table, and
>>introduce a new cell with the new timestamp as the key.
>>And so one.
>>With this schema, I can find the files by ID very quickly and I can
>>find the files which need to be updated pretty quickly too. But it's
>>hotspotting one region.
> >From the video (0:45:10) I can see 4 situations.
>>1) Hotspotting.
>>2) Salting.
>>3) Key field swap/promotion
>>4) Randomization.
>>I need to avoid hostpotting, so I looked at the 3 other options.
>>I can do salting. Like prefix the timestamp with a number between 0
>>and 9. So that will distribut the load over 10 servers. To find all
>>the files with a timestamp below a specific value, I will need to run
>>10 requests instead of one. But when the load will becaume to big for
>>10 servers, I will have to prefix by a byte between 0 and 99? Which
>>mean 100 request? And the more regions I will have, the more requests
>>I will have to do. Is that really a good approach?
>>Key field swap is close to salting. I can add the first few bytes from
>>the path before the timestamp, but the issue will remain the same.
>>I looked and randomization, and I can't do that. Else I will have no
>>way to retreive the information I'm looking for.
>>So the question is. Is there a good way to store the data to retrieve
>>them base on the date?

