hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Marc Spaggiari <jean-m...@spaggiari.org>
Subject HBase first steps: Design a table
Date Tue, 12 Jun 2012 21:42:46 GMT

I have read all the documentation here
http://hbase.apache.org/book/book.html and I now have few questions.

I currently have a mysql table with millions of lines (4 for now, but
it's growing by 4 millions a month). It's running on a fast computer,
but it's still way to slow when it's time to insert new data. The
bigger the table is, the slower the inserts are, and soon, the
application will not be working anymore.

Here is what the table looks like:
| Field        | Type         | Null | Key | Default | Extra |
| IDLow        | bigint(20)   | NO   | PRI | NULL    |       |
| IDHigh       | bigint(20)   | NO   | PRI | NULL    |       |
| IDAux        | bigint(20)   | NO   | PRI | NULL    |       |
| Value        | varchar(512) | NO   |     | NULL    |       |
| lastUpdate   | bigint(20)   | NO   |     |       0 |       |
| crc          | bigint(20)   | YES  |     | NULL    |       |
| language     | int(11)      | YES  |     | NULL    |       |
| status       | int(11)      | YES  |     | NULL    |       |
| size         | int(11)      | YES  |     | NULL    |       |

Now, I'm trying to design an HBase table to do the same thing.
IDLow, IDHigh and IDAux are 64 bits numbers. So I will just convert
them to a 24 bytes array. That's fine. They are uniq, and they are the
main key for the inserts.

lastUpdate, I don't think I will need it anymore since it seems to be
already handled by HBase (timestamp)

Value, crc, language, status and size are parameters I will have to be
able to retrieve.

Based on what I read on the documentation, I need to keep as few
Column Familly as possible. And the Columns names as short as
possible. So let's try to have just one, named "a".

That will give me something like:
create 'mytable', 'a'

Then for the cells, I can insert that way:
put 'mytable', 'row1', 'a:a', 'ID'
Where value1 is my 24 bytes ID.
put 'mytable', 'row1', 'a:b', 'VALUE'
Which is my "value" parameter from my mysql table.
put 'mytable', 'row1', 'a:c', 'CRC'
Which will be my CRC

And so on.

Now, I have few questions.

1) I was not able to find any reference of the primary key or primary
index or similar in the HBase documentation. Is it automatically done
of the first cell ("a" in the example)? Or on the row name ("row1" in
the example)?
2) I will need to be able to parse the rows filtering on "status"
field. I searched for the way to add a secondary index but I was not
able to find it. Any place I can look for that?
3) Since all my entries will have unique id, can I simply use that as
the row title instead of "row1"? What's pros what's cons?
4) If I don't have an index on a field will I still be able to filter
on that field? Like select * from mytable where status = 0... With 50
millions lines, it might take some time... But is that still doable
without reaching memory limitation or something like that?
5) I would like to try my architecture on small computers before I go
for bigger. What's the minimum memory I should have on each
RegionServer (Or DataNode) to start the application and load few
hundred thousands lines?

I just order "HBase: The Definitive Guide" from amazon. I hope it will
help me, but in the meantime, if you have some responses for me, it
will be welcome.

Thanks for your help.


View raw message