hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wilm Schumacher <wilm.schumac...@cawoom.com>
Subject Re: Newbie Question about 37TB binary storage on HBase
Date Fri, 28 Nov 2014 00:20:05 GMT
Am 28.11.2014 um 00:32 schrieb Aleks Laz:
> What's the plan about the "MOB-extension"?
https://issues.apache.org/jira/browse/HBASE-11339

> From development point of view I can build HBase with the "MOB-extension"
> but from sysadmin point of view a 'package' (jar,zip, dep, rpm, ...) is
> much
> easier to maintain.
that's true :/

> We need to make some "accesslog" analyzing like piwik or awffull.
I see. Well, this is of course possible, too.

> Maybe elasticsearch is a better tool for that?
I used elastic search for full text search. Works veeery well :D. Loved
it. But I never used it as primary database. And I wouldn't see an
advantage for using ES here.

> As far as I have understood hadoop client see a 'Filesystem' with 37 TB or
> 120 TB but from the server point of view how should I plan the
> storage/server
> setup for the datanodes.
now I get your question. If you have a replication factor of 3 (so every
data is hold three times by the cluster), then the aggregated storage
has to be at least 3 times the 120 TB (+ buffer + operating system
etc.). So you could use 360 1TB nodes. Or 3 120 TB nodes.

> What happen when a datanode have 20TB but the whole hadoop/HBase 2 node
> cluster have 40?
well, if it is in a cluster of enough 20 TB nodes, nothing. hbase
distributes the data over the nodes.

> ?! why "40 million rows", do you mean the file tables?
> In the DB is only some Data like, User account, id for a directory and
> so on.
If you use hbase as primary storage, every file would be a row. Think of
a "blob" in RDBMS. 40 millions files => 40 million rows.

Assume you create an access log for the 40 millions files and assume
every file is accessed 100 times and every access is a row in another
"access log" table => 4 billion rows ;).

> Currently, yes php is the main language.
> I don't know a good solution for php similar like hadoop, anyone else
> know one?
well, the basic stuff could be done by thrift/rest with a native php
binding. It depends on what you are trying to do. If it's just CRUD and
some scanning and filtering, thrift/rest should be enough. But as you
said ... who knows what the future brings. If you want to do the fancy
stuff, you should use java and deliver the data to your php application-

Just for completeness: There is HiveQL, too. This is kind of "SQL for
hadoop". There is a hive client for php (as it is delivered by thrift)
https://cwiki.apache.org/confluence/display/Hive/HiveClient

Another fitting option for your access log could be cassandra. Cassandra
is good at write performance, thus it is used for logging. Cassandra has
a "sql like" language, called cql. This works from php almost like a
normal RDBMS. Prepared statements and all this stuff.

But I think this is done the wrong way around. You should select a
technology and then choose the language/interfaces etc. And if you
choose hbase, and java is a good choice, and you use nginx and php is a
good choice, the only task is to deliver data from A to B and back.

Best wishes,

Wilm

Mime
View raw message