hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject Re: Timestamp as a key good practice?
Date Thu, 14 Jun 2012 18:47:03 GMT
Hi Michael,

For now this is more a proof of concept than a production application.
And if it's working, it should be growing a lot and database at the
end will easily be over 1B rows. each individual server will have to
send it's own information to one centralized server which will insert
that into a database. That's why it need to be very quick and that's
why I'm looking in HBase's direction. I tried with some relational
databases with 4M rows in the table but the insert time is to slow
when I have to introduce entries in bulk. Also, the ability for HBase
to keep only the cells with values will allow to save a lot on the
disk space (futur projects).

I'm not yet used with HBase and there is still many things I need to
undertsand but until I'm able to create a solution and test it, I will
continue to read, learn and try that way. Then at then end I will be
able to compare the 2 options I have (HBase or relational) and decide
based on the results.

So yes, your reply helped because it's giving me a way to achieve this
goal (using co-processors). I don't know ye thow this part is working,
so I will dig the documentation for it.

Thanks,

JM

2012/6/14, Michael Segel <michael_segel@hotmail.com>:
> Jean-Marc,
>
> You do realize that this really isn't a good use case for HBase, assuming
> that what you are describing is a stand alone system.
> It would be easier and better if you just used a simple relational database.
>
> Then you would have your table w an ID, and a secondary index on the
> timestamp.
> Retrieve the data in Ascending order by timestamp and take the top 500 off
> the list.
>
> If you insist on using HBase, yes you will have to have a secondary table.
> Then using co-processors...
> When you update the row in your base table, you
> then get() the row in your index by timestamp, removing the column for that
> rowid.
> Add the new column to the timestamp row.
>
> As you put it.
>
> Now you can just do a partial scan on your index. Because your index table
> is so small... you shouldn't worry about hotspots.
> You may just want to rebuild your index every so often...
>
> HTH
>
> -Mike
>
> On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote:
>
>> Hi Michael,
>>
>> Thanks for your feedback. Here are more details to describe what I'm
>> trying to achieve.
>>
>> My goal is to store information about files into the database. I need
>> to check the oldest files in the database to refresh the information.
>>
>> The key is an 8 bytes ID of the server name in the network hosting the
>> file + MD5 of the file path. Total is a 24 bytes key.
>>
>> So each time I look at a file and gather the information, I update its
>> row in the database based on the key including a "last_update" field.
>> I can calculate this key for any file in the drives.
>>
>> In order to know which file I need to check in the network, I need to
>> scan the table by "last_update" field. So the idea is to build another
>> table which contain the last_update as a key and the files IDs in
>> columns. (Here is the hotspotting)
>>
>> Each time I work on a file, I will have to update the main table by ID
>> and remove the cell from the second table (the index) and put it back
>> with the new "last_update" key.
>>
>> I'm mainly doing 3 operations in the database.
>> 1) I retrieve a list of 500 files which need to be update
>> 2) I update the information for  those 500 files (bulk update)
>> 3) I load new files references to be checked.
>>
>> For 2 and 3, I use the main table with the file ID as the key. the
>> distribution is almost perfect because I'm using hash. The prefix is
>> the server ID but it's not always going to the same server since it's
>> done by last_update. But this allow a quick access to the list of
>> files from one server.
>> For 1, I have expected to build this second table with the
>> "last_update" as the key.
>>
>> Regarding the frequency, it really depends on the activities on the
>> network, but it should be "often".  The faster the database update
>> will be, the more up to date I will be able to keep it.
>>
>> JM
>>
>> 2012/6/14, Michael Segel <michael_segel@hotmail.com>:
>>> Actually I think you should revisit your key design....
>>>
>>> Look at your access path to the data for each of the types of queries
>>> you
>>> are going to run.
>>> From your post:
>>> "I have a table with a uniq key, a file path and a "last update" field.
>>>>>> I can easily find back the file with the ID and find when it has
been
>>>>>> updated.
>>>>>>
>>>>>> But what I need too is to find the files not updated for more than
a
>>>>>> certain period of time.
>>> "
>>> So your primary query is going to be against the key.
>>> Not sure if you meant to say that your key was a composite key or not...
>>> sounds like your key is just the unique key and the rest are columns in
>>> the
>>> table.
>>>
>>> The secondary query or path to the data is to find data where the files
>>> were
>>> not updated for more than a period of time.
>>>
>>> If you make your key temporal, that is adding time as a component of
>>> your
>>> key, you will end up creating new rows of data while the old row still
>>> exists.
>>> Not a good side effect.
>>>
>>> The other nasty side effect of using time as your key is that you not
>>> only
>>> have the potential for hot spotting, but that you also have the nasty
>>> side
>>> effect of creating splits that will never grow.
>>>
>>> How often are you going to ask to see the files where they were not
>>> updated
>>> in the last couple of days/minutes? If its infrequent, then you really
>>> should care if you have to do a complete table scan.
>>>
>>>
>>>
>>>
>>> On Jun 14, 2012, at 5:39 AM, Jean-Marc Spaggiari wrote:
>>>
>>>> Wow! This is exactly what I was looking for. So I will read all of that
>>>> now.
>>>>
>>>> Need to read here at the bottom: https://github.com/sematext/HBaseWD
>>>> and here:
>>>> http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
>>>>
>>>> Thanks,
>>>>
>>>> JM
>>>>
>>>> 2012/6/14, Otis Gospodnetic <otis_gospodnetic@yahoo.com>:
>>>>> JM, have a look at https://github.com/sematext/HBaseWD (this comes up
>>>>> often.... Doug, maybe you could add it to the Ref Guide?)
>>>>>
>>>>> Otis
>>>>> ----
>>>>> Performance Monitoring for Solr / ElasticSearch / HBase -
>>>>> http://sematext.com/spm
>>>>>
>>>>>
>>>>>
>>>>>> ________________________________
>>>>>> From: Jean-Marc Spaggiari <jean-marc@spaggiari.org>
>>>>>> To: user@hbase.apache.org
>>>>>> Sent: Wednesday, June 13, 2012 12:16 PM
>>>>>> Subject: Timestamp as a key good practice?
>>>>>>
>>>>>> I watched Lars George's video about HBase and read the documentation
>>>>>> and it's saying that it's not a good idea to have the timestamp as
a
>>>>>> key because that will always load the same region until the timestamp
>>>>>> reach a certain value and move to the next region (hotspotting).
>>>>>>
>>>>>> I have a table with a uniq key, a file path and a "last update"
>>>>>> field.
>>>>>> I can easily find back the file with the ID and find when it has
been
>>>>>> updated.
>>>>>>
>>>>>> But what I need too is to find the files not updated for more than
a
>>>>>> certain period of time.
>>>>>>
>>>>>> If I want to retrieve that from this single table, I will have to
do
>>>>>> a
>>>>>> full parsing of the table. Which might take a while.
>>>>>>
>>>>>> So I thought of building a table to reference that (kind of secondary
>>>>>> index). The key is the "last update", one FC and each column will
>>>>>> have
>>>>>> the ID of the file with a dummy content.
>>>>>>
>>>>>> When a file is updated, I remove its cell from this table, and
>>>>>> introduce a new cell with the new timestamp as the key.
>>>>>>
>>>>>> And so one.
>>>>>>
>>>>>> With this schema, I can find the files by ID very quickly and I can
>>>>>> find the files which need to be updated pretty quickly too. But it's
>>>>>> hotspotting one region.
>>>>>>
>>>>>> From the video (0:45:10) I can see 4 situations.
>>>>>> 1) Hotspotting.
>>>>>> 2) Salting.
>>>>>> 3) Key field swap/promotion
>>>>>> 4) Randomization.
>>>>>>
>>>>>> I need to avoid hostpotting, so I looked at the 3 other options.
>>>>>>
>>>>>> I can do salting. Like prefix the timestamp with a number between
0
>>>>>> and 9. So that will distribut the load over 10 servers. To find all
>>>>>> the files with a timestamp below a specific value, I will need to
run
>>>>>> 10 requests instead of one. But when the load will becaume to big
for
>>>>>> 10 servers, I will have to prefix by a byte between 0 and 99? Which
>>>>>> mean 100 request? And the more regions I will have, the more requests
>>>>>> I will have to do. Is that really a good approach?
>>>>>>
>>>>>> Key field swap is close to salting. I can add the first few bytes
>>>>>> from
>>>>>> the path before the timestamp, but the issue will remain the same.
>>>>>>
>>>>>> I looked and randomization, and I can't do that. Else I will have
no
>>>>>> way to retreive the information I'm looking for.
>>>>>>
>>>>>> So the question is. Is there a good way to store the data to retrieve
>>>>>> them base on the date?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> JM
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>>>
>>
>
>

Mime
View raw message