hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Em <mailformailingli...@yahoo.de>
Subject Re: HBase (BigTable) many to many with students and courses
Date Tue, 29 May 2012 20:25:53 GMT
Hi Ian,

answers between the lines:

Am 29.05.2012 21:26, schrieb Ian Varley:
>> However this means that Sheldon has to do at least two requests to fetch
>> his latest tweets.
>> First: Get the latest columns (aka tweets) of his row and second do a
>> multiget to fetch their content. Okay, that's more than one request but
>> you get what I mean. Am I right?
> Yup, unless you denormalize the tweet bodies as well--then you just read the current
user's record and you have everything you need (with the downside of massive data duplication).
Well, I think this would be bad practice for editable stuff like tweets.
They can be deleted, updated etc.. Furthermore at some point the data
duplication will come at its expense - big data or not.
Do you aggree?

>> Although I got this little overhead here, does read perform at low
>> latency in a large scale environment? I think that a RDMBS has to do
>> something equal, doesn't it?
> Yes, but the relational DB has it all on one node, whereas in a distributed database,
it's as many RPC calls as you have nodes, or something on the order of that (see Nicolas's
explanation, which is better than mine). 

Nicolas..? :)

> The key difference is that that read access time for HBase is still constant when you
have PB of data (whereas with a PB of data, the relational DB has long since fallen over).
The underlying reason for that is that HBase basically took the top level of the seek operation
(which is a fast memory access into the top node of a B+ tree, if you're talking about an
RDBMS on one machine) and made it a cross machine lookup ("which server would have this data?").
So it's fundamentally a lot slower than an RDBMS, but still constant w/r/t your overall data
size. If you remember your algorithms class, constant factors fall out when N gets larger,
and you only care about the big O. :)

Yes :) I am interested in the underlying algorithms of HBase's
hash-algorithms. I think it's an interesting approach.

>> Maybe it makes sense to have a desing of a key so that all the  relevant
>> tweets for a user are placed at the same region?
> Again, that works in a denormalized sense, where you lead each row key with the current
user (the tweet recipient) and copy the tweets there. If you are saying that you'd somehow
find a magic formula so that people who follow each other happen to be on the same region
server, you can forget it. Facebook and Twitter have tried partitioning schemes like that
for their social graph data, and found there's no magic boundaries, even countries. (I know,
citation needed, and I can't find it just now, but I'm sure I read that somewhere. :) But
it makes sense; I can follow anybody, anybody can follow me, so on average my followers and
followees would be randomly sprayed across all region servers. I either denormalize stuff
(so my row contains everything I need) or I commit to doing GETs to all other region servers
when I need that data. And if I really need it to be low latency, it's not a winning bet to
have every read spray across a lot of servers like that. (That's just my opinion, perha
s I'm wrong about that.)

I think we missunderstood eachother.
If we go down the line of data duplication, why not generate keys in a
way that all of the user's tweet-stream's tweets end up at the same
region server?
This way, doing a multiget you only call one or two region servers for
all your tweets.

>> Maybe one better does a tall-schema, since row-locking can impact
>> performance a lot for people who follow a lot of other persons but
>> that's not the topic here.
> Yes, very true. We also haven't broached the subject of how long writes would take if
you denormalize, and whether they should be asynchronous (for everyone, or just for the Wil
Wheatons). If you put all of my incoming tweets into a single row for me, that's potentially
lock-wait city. Twitter has a follow limit, maybe that's why. :)
Oh, do they?
However, as far as I know they are using MySQL for their tweets.

So, to generalize our results:
There are two majore approaches to design data streams which are coming
from n users and were streamed by m users (with m beeing egual or larger
than n).

One approach is to create an index of the interesting data per user
where each column represents the key to the information of interest
(wide-schema) or each row which is associated by a key-prefix with the
user represents a pointer to the data (tall-schema).
If you want to access that data one has to design a HBase-side join.

On the other hand the second approach is to make usage of massive data
duplication. That means writing the same stuff to every user so that
every user is able to access the data immediatly without the need of
multigets (given that one uses a wide-schema-design). This saves
requests and latency at the cost of writes, througput and network traffic.

Are there other approaches for designing data-streams in HBase?

Do you aggree with that?

HBase seems to be kind of over-engineering, if you do not need that scale.


View raw message