hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hari Prasanna <h...@slideshare.com>
Subject Re: Schema for sorted results
Date Tue, 24 Jul 2012 14:50:08 GMT
JM - I am searching for top N urls in date+category, so this rowkey
does work well for the my purpose.
Cristofer - I realize that having the raw date at the beginning of the
rowkey makes all the writes in a day rush to the same region server.
Maybe I could have the rowkey start with the category(which is more
distributed) and have date in the column qualifier.
I just went through the slides. Was very enlightening. thanks for that.

Thank again!

On Tue, Jul 24, 2012 at 7:59 PM, Jean-Marc Spaggiari
<jean-marc@spaggiari.org> wrote:
> Hi Hari,
> Why do you think it's wasteful?
> Let's imagine this situation.
> Key=<date>|<category>|<padded_visits>|<url> Value = nothing.
> And this one:
> Key=<url> Value = <date>|<category>|<padded_visits>
> Both situation will, at the end, represent almost the same size in the database.
> You can also do somthing like that:
> Key=<url> ColumnFamillyName=<date> Value=<category>|<padded_visits>
> Just that the first option will allow you to retreive the information
> you are looking for very quickly.
> Now, are you sure that this key is really what you need? What will be
> the access model for your database? With the key you are using, you
> will have to search by date first. So if you want to fine all the
> entries for one URL, you will have to scan the entire table, jumping
> to the next date each time you find it.
> If you are searching by date, then this key is good.
> So you really need first to think on the way you are going to read
> your data, and then, you will be able to design a key to match your
> needs.
> JM
> 2012/7/24, Minh Duc Nguyen <mdnguyen@gmail.com>:
>> Hari,
>>    According to the HBase book: http://hbase.apache.org/book.html#dm.sort
>> All data model operations HBase return data in sorted order. First by row,
>> then by ColumnFamily, followed by column qualifier, and finally timestamp
>> (sorted in reverse, so newest records are returned first).
>>     ~ Minh
>> On Tue, Jul 24, 2012 at 9:50 AM, Hari Prasanna <hari@slideshare.com> wrote:
>>> Hello -
>>> I'm using HBase for web server log processing and I'm trying to save
>>> the top N urls per category per day in a sorted manner in HBase. From
>>> what I've read, the only sortable structure that HBase offers is the
>>> lexicographic sort in the row keys. So, here is the rowkey format I'm
>>> currently using
>>> <date>|<category>|<padded_visits>|<url>
>>> where,  padded_visits = Long.MAX_VALUE - visits
>>> This seems wasteful because of the long rowkeys. Is there any other
>>> approach to maintain sorted results in HBase?
>>> Thanks
>>> Hari Prasanna


View raw message