hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alwin Roosen <drbrain...@gmail.com>
Subject new to HBase/NoSQL, need some help with database design
Date Fri, 16 Dec 2011 16:26:05 GMT

I have been suggested to use HBase for a project, but after reading
some manuals/guidelines, I am still not sure how to design the
database and getting more confused by the minute. I am new to any form
of NoSQL database and having a hard time figuring this one out.

I am hoping that someone can suggest a HBase design to me based on the
info below. It would also be nice to guide me to some of the HBase
classes/methods I need to use to get the results I need.

I basically have two tables, a category table and a keyword table. The
category table only contains a few hundred records, but the keyword
table could contain millions over time (hence the HBase suggestion).

The project is a bit more complex then this, but if I can get started
and understand the NoSQL concept for this example, I hope I am able to
figure out the rest by myself.

SQL approach:

id		name		parent
1		cat1		NULL
2		cat2		NULL
3		cat1-1		1
4		cat3		NULL
5		cat3-1		4
6		cat3-2		4

name	category	score
book	1			23
house	4			14
cup		5			75
shoe	2			3
phone	1			58
tablet	1			NULL

I need to be able to query HBase with the following example SQL scenarios:

1) Get the root categories

   SELECT * FROM `category` WHERE `parent` IS NULL

2) Get the child categories for a certain root category (one level)

   SELECT * FROM `category` WHERE `parent`=4

3) Get a list of root categories sorted by the total highest score
from their keywords

   SELECT `c`.*, SUM(`k`.`score`) AS `cat_score` FROM `category` AS
`c` LEFT JOIN `keyword` AS `k` ON `k`.`category`=`c`.`id` WHERE
`parent` IS NULL GROUP BY `c`.`id` ORDER BY `cat_score` DESC

4) Get a list of child categories sorted by the total highest score
from their keywords

   SELECT `c`.*, SUM(`k`.`score`) AS `cat_score` FROM `category` AS
`c` LEFT JOIN `keyword` AS `k` ON `k`.`category`=`c`.`id` WHERE
`parent`=4 GROUP BY `c`.`id` ORDER BY `cat_score` DESC

5) Get a list of keywords that do not have a score yet

   SELECT * FROM `keyword` WHERE `score` IS NULL

6) Get the total number of categories:

   SELECT COUNT(`id`) FROM `category`

7) Get the total number of root categories:

   SELECT COUNT(`id`) FROM `category` WHERE `parent` IS NULL

8) Get the total number of keywords:

   SELECT COUNT(`name`) FROM `keyword`

9) Get the total number of keywords without a score:

   SELECT COUNT(`name`) FROM `keyword` WHERE `score` IS NULL

HBase approach (what I have so far but is most certainly very wrong):

key = category name
 - info:id (id of the category)
 - relation:parent (id of the parent category)

I am able to parse a category-tree in java (using table.getScanner())
by just selecting all the rows (small table, no big deal) and creating
a tree.

Since the table is sorted by key, I get a nice category-tree
alphabetically sorted.

key = keyword name
 - info:name (name of the keyword)
 - info:score (score is available, otherwise not set)
 - relation:category (id of the category)

As I understand, you cannot sort on column values (info:score for
example), so how should I approach this? It would be possible to
process this in java, but I have a feeling this is not the correct
approach and the database design should be very different.

Any help would be much appreciated!

View raw message