hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cristofer Weber <cristofer.we...@neogrid.com>
Subject RES: Schema for sorted results
Date Tue, 24 Jul 2012 15:21:43 GMT
Hi Hari,

Using date as column qualifier is nice, but I experienced a drawback in a scenario where I
left the window open: I kept a large range of dates per RowKey and the amount of rows per
region became lower and lower as I started to split regions. 

You can manage this with TTL if you don't need this data after some time, using HDFS to store
older data (or even a different table or different RowKey pattern). You can also keep date
as part of  your RowKey as you showed us before, there's nothing wrong with that as you realized
that categories fits better as first component of your RowKey. Or you can create a hybrid,
with year+month in your RowKey and days as Column Qualifiers. 

The way you query your data should be in your design considerations.


-----Mensagem original-----
De: Hari Prasanna [mailto:hari@slideshare.com] 
Enviada em: terça-feira, 24 de julho de 2012 11:50
Para: user@hbase.apache.org
Assunto: Re: Schema for sorted results

JM - I am searching for top N urls in date+category, so this rowkey does work well for the
my purpose.
Cristofer - I realize that having the raw date at the beginning of the rowkey makes all the
writes in a day rush to the same region server.
Maybe I could have the rowkey start with the category(which is more
distributed) and have date in the column qualifier.
I just went through the slides. Was very enlightening. thanks for that.

Thank again!

On Tue, Jul 24, 2012 at 7:59 PM, Jean-Marc Spaggiari <jean-marc@spaggiari.org> wrote:
> Hi Hari,
> Why do you think it's wasteful?
> Let's imagine this situation.
> Key=<date>|<category>|<padded_visits>|<url> Value = nothing.
> And this one:
> Key=<url> Value = <date>|<category>|<padded_visits>
> Both situation will, at the end, represent almost the same size in the database.
> You can also do somthing like that:
> Key=<url> ColumnFamillyName=<date> Value=<category>|<padded_visits>
> Just that the first option will allow you to retreive the information 
> you are looking for very quickly.
> Now, are you sure that this key is really what you need? What will be 
> the access model for your database? With the key you are using, you 
> will have to search by date first. So if you want to fine all the 
> entries for one URL, you will have to scan the entire table, jumping 
> to the next date each time you find it.
> If you are searching by date, then this key is good.
> So you really need first to think on the way you are going to read 
> your data, and then, you will be able to design a key to match your 
> needs.
> JM
> 2012/7/24, Minh Duc Nguyen <mdnguyen@gmail.com>:
>> Hari,
>>    According to the HBase book: 
>> http://hbase.apache.org/book.html#dm.sort
>> All data model operations HBase return data in sorted order. First by 
>> row, then by ColumnFamily, followed by column qualifier, and finally 
>> timestamp (sorted in reverse, so newest records are returned first).
>>     ~ Minh
>> On Tue, Jul 24, 2012 at 9:50 AM, Hari Prasanna <hari@slideshare.com> wrote:
>>> Hello -
>>> I'm using HBase for web server log processing and I'm trying to save 
>>> the top N urls per category per day in a sorted manner in HBase. 
>>> From what I've read, the only sortable structure that HBase offers 
>>> is the lexicographic sort in the row keys. So, here is the rowkey 
>>> format I'm currently using <date>|<category>|<padded_visits>|<url>
>>> where,  padded_visits = Long.MAX_VALUE - visits
>>> This seems wasteful because of the long rowkeys. Is there any other 
>>> approach to maintain sorted results in HBase?
>>> Thanks
>>> Hari Prasanna


View raw message