hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject Re: Schema for sorted results
Date Tue, 24 Jul 2012 14:29:27 GMT
Hi Hari,

Why do you think it's wasteful?

Let's imagine this situation.
Key=<date>|<category>|<padded_visits>|<url> Value = nothing.

And this one:
Key=<url> Value = <date>|<category>|<padded_visits>

Both situation will, at the end, represent almost the same size in the database.

You can also do somthing like that:
Key=<url> ColumnFamillyName=<date> Value=<category>|<padded_visits>

Just that the first option will allow you to retreive the information
you are looking for very quickly.

Now, are you sure that this key is really what you need? What will be
the access model for your database? With the key you are using, you
will have to search by date first. So if you want to fine all the
entries for one URL, you will have to scan the entire table, jumping
to the next date each time you find it.

If you are searching by date, then this key is good.

So you really need first to think on the way you are going to read
your data, and then, you will be able to design a key to match your


2012/7/24, Minh Duc Nguyen <mdnguyen@gmail.com>:
> Hari,
>    According to the HBase book: http://hbase.apache.org/book.html#dm.sort
> All data model operations HBase return data in sorted order. First by row,
> then by ColumnFamily, followed by column qualifier, and finally timestamp
> (sorted in reverse, so newest records are returned first).
>     ~ Minh
> On Tue, Jul 24, 2012 at 9:50 AM, Hari Prasanna <hari@slideshare.com> wrote:
>> Hello -
>> I'm using HBase for web server log processing and I'm trying to save
>> the top N urls per category per day in a sorted manner in HBase. From
>> what I've read, the only sortable structure that HBase offers is the
>> lexicographic sort in the row keys. So, here is the rowkey format I'm
>> currently using
>> <date>|<category>|<padded_visits>|<url>
>> where,  padded_visits = Long.MAX_VALUE - visits
>> This seems wasteful because of the long rowkeys. Is there any other
>> approach to maintain sorted results in HBase?
>> Thanks
>> Hari Prasanna

View raw message