hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jthie...@ina.fr
Subject Table design question
Date Wed, 18 Feb 2009 17:38:24 GMT
Hi,

I setup a cluster of 4 machines running hbase.
I'm working on a web archiving application that needs to access (randomly) records with request
of type :

Record record = getClosestRecord(url, requestedDate);
This method should find the record for the specified url at the nearest date from the requestedDate.
The requested dates have very little chance to match insertion date.

Each record is made of 10 columns, and each insert is of the type;

insertRecord(url, date, record);

There are several possible designs for my record table :

1. RowKey= url and all columns are labelled with the same date.
2. RowKey=url and we use timestamp and version support of hbase, and columns names are columnFamily
names (no label). .
3. RowKey=url+date, and columns names are columnFamily names (no label).

For now, I use method 1 that implies to answer correctly to getClosestRecord to load an entire
columnFamily for a specified row,
to find the closest date among the columnFamily, and to load  the others columns labelled
with this closest date.
I choose this method because I thought I could use the method HTable.getClosestRowBefore(url,
columFamily:requestedDate) to minimize column loads, but in fact I need the closest row before
and the closest row after to determine which one is at the closest date, so I don't use the
method getClosestRowBefore.

The solution 2. seems to be a good alternative, I could have the same fonctionnality with
the same process, but date would be stored once per row insert (as timestamp) instead of once
per column.

Solution 3. implies only one insert per row key, but increases dramatically the number of
rows.

What is the best solution to ensure best random acces time ?

Jérôme Thièvre

Mime
View raw message