hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cristofer Weber <cristofer.we...@neogrid.com>
Subject RES: Schema for sorted results
Date Tue, 24 Jul 2012 14:19:14 GMT
Hello Hari!

Just for the sake of maintaining sorted results, that's it. You have to keep it in lexicographic
order.  An alternative, for example, could be maintain date|category as RowKey and store your
N URLs as members of a Column Family, where padded_visits could be the Column Qualifier and
URL the value. In the end, it will depend on how you need to access your log data. 

Wasteful is relative... if you have to keep all those fields, store them as part of your RowKey,
Column Qualifier or value will have the same 'physical' result, which is, all these values
will be repeated for every row. Don't know if my last sentence is clear, but Lars George made
a good diagram to explain this. It's inside his HBase book, but also in this presentation:
http://www.slideshare.net/larsgeorge/hbase-advanced-schema-design-berlin-buzzwords-june-2012
(check slides 14 and 15). 

If storage is a hard constraint, you can try to work with reduced data... one two bytes can
represent a good amount of distinct  categories, and if you know a theoretical limit for total
of visits you can probably work with a range lower than a Long.

Also, are you aware of the effect of having a raw date as the start of your RowKey? 

Regards,
Cristofer

-----Mensagem original-----
De: Hari Prasanna [mailto:hari@slideshare.com] 
Enviada em: terça-feira, 24 de julho de 2012 10:51
Para: user@hbase.apache.org
Assunto: Schema for sorted results

Hello -

I'm using HBase for web server log processing and I'm trying to save the top N urls per category
per day in a sorted manner in HBase. From what I've read, the only sortable structure that
HBase offers is the lexicographic sort in the row keys. So, here is the rowkey format I'm
currently using <date>|<category>|<padded_visits>|<url>
where,  padded_visits = Long.MAX_VALUE - visits

This seems wasteful because of the long rowkeys. Is there any other approach to maintain sorted
results in HBase?

Thanks
Hari Prasanna

Mime
View raw message