hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Varley <ivar...@salesforce.com>
Subject Re: HBase (BigTable) many to many with students and courses
Date Tue, 29 May 2012 17:49:01 GMT
A few more responses:

On May 29, 2012, at 10:54 AM, Em wrote:

> In fact, everything you model with a Key-Value-storage like HBase,
> Cassandra etc. can be modeled as an RDMBS-scheme.
> Since a lot of people, like me, are coming from that edge, we must
> re-learn several basic things.
> It starts with understanding that you model a K-V-storage the way you
> want to access the data, not as the data relates to eachother (in
> general terms) and ends with translating the connections of data into a
> K-V-schema as good as possible.

Yes, that's a good way of putting it. I did a talk at HBaseCon this year that deals with some
of these questions. The video isn't up yet, but the slides are here:

http://www.slideshare.net/cloudera/5-h-base-schemahbasecon2012

>> 3. You could also let a higher client layer worry about this. For
>> example, your data layer query just returns a student with a list of
>> their course IDs, and then another process in your client code looks
>> up each course by ID to get the name. You can then put an external
>> caching layer (like memcached) in the middle and make things a lot
>> faster (though that does put the burden on you to have the code path
>> for changing course info also flush the relevant cache entries). 
> Hm, in what way does this give me an advantage over using HBase -
> assuming that the number of courses is small enough to fit in RAM - ?
> I know that Memcached is optimized for this purpose and might have much
> faster response times - no doubts.
> However, from a conceptual point of view: Why does Memcached handles the
> K-V-distribution more efficiently than a HBase with warmed caches?
> Hopefully this question isn't that hard :).

The only architectural advantage I can think of is that reads in HBase still have to check
the memstore and all relevant file blocks. So even if that's warm in the HBase cache, it's
not quite as lightweight. That said, I guess I was also sort of assuming that table you're
looking up into would be the smaller one. The details of which is better here would depend
on many things; YMMV. I'd personally go with #2 as a default and them optimize based on real
workloads, rather than over engineering it. 

> Without ever doing it, you never get a real feeling of when to use the
> right tool.
> Using a good tool for the wrong problem can be an interesting
> experience, since you learn some of the do's and don'ts of the software
> you use.

Very good points. Definitely not trying to discourage learning! :) I just always feel compelled
to make that caveat early on, while people are making technology evaluations. You can learn
to crochet with a rail gun, but I wouldn't recommend it, unless what you really want to learn
is about rail guns, not crocheting. :) Sounds like you really want to learn about rail guns.


> Since I am a reader of the MEAP-edition of HBase in Action, I am aware
> of the TwitBase-example application presented in that book.
> I am very interested in seeing the author presenting a solution for
> efficiently accessing the Tweets of the persons I follow.
> This is an n:m-relation.
> You got n users with m tweets and each user is seeing his own tweets as
> well as the tweets of followed persons in descending order by timestamp.
> This must be done with a join within an RDMBs (and maybe in HBase also),
> since I can not think of another scalable way of doing so.

Don't forget about denormalization! You can put copies of the tweets, or at least copies the
unique ids of the tweets, into each follower's stream. Yes, that means when Wil Wheaton tweets
for the 1000th time about comic con, you get a million copies of the tweet (or ID). But you're
trading time & space at write time for extremely fast speeds at write time. Whether this
makes sense depends on a zillion other factors, there's no hard & fast rule.

> However, if you do this by a Join, this means that a person with 40.000
> followers needs a batch-request consisting of 40.000 GET-objects. That's
> huge and I bet that this is everything but not fast nor scalable. It
> sounds like broken by design when designing for Big Data.

I guess you could say it's "broken by design" (though I'd argue for "unavailable by design"
;). Full join use cases (like you'd do for, say, analytics) don't work well with big data,
taking a naive approach. 

But at least for the twitter app, that's not actually what you're doing. Instead, you've typically
got a pattern like *pagination*: you'd never be requesting 40K objects, at most you'd be requesting
20 or 40 objects (a page worth), along some dimension like time. You're optimizing for getting
those really fast, with the knowledge that the stream itself is so big, you'd never offer
the user a way to act on it in any kind of bulk way. If you want to do processing like that,
you do it asynchronously (e.g. via map/reduce).

That's actually a really interesting way to see the difference between relational DBs and
big data non-relational ones: relational databases promise you easy whole-set operations,
and big data nosql databases don't, because they assume the "whole set" will be too big for
that.

Let's say your product manager for this twitter-like thing said "You know what would be awesome?
A widget that shows the average lengths of all tweets in the system, in real time!". And if
this product manager was really slick, they might even say "It's easy, look, I even wrote
the SQL for it: 'SELECT avg(length(body)) FROM tweets'. I'm a genius."

In a SQL database, this query is going to do a full table scan to get this average, every
time you ask for it. It would be in HBase, too; the difference is just that HBase makes you
be more explicit about it: no simple declarative "SELECT avg ..." syntax, you have to whip
out your iterators and see that you're scanning a billion rows to get your average.

Would it be nice for HBase to ALSO offer declarative and simple ways to do things like joins,
averages, etc? Maybe; but as soon as you dip a toe into this, you kind of have to jump in
with both feet. Would you add indexes to make joins faster? Would those indexes be written
transactionally with the records in the base tables, even across region server boundaries?
Would you build a query optimizer to know how to execute the joins? Would you allow really
complex N-way joins? The list goes on. (I'm not suggesting you're asking for any of this,
I'm just giving some context for why HBase doesn't have any of it, yet.)

> Therefore I am interested in general best practices for such problems.

I find the best practice for all such designs to be a simple game of make believe. Pretend
you have an INFINITE (not just large) number of records in various dimensions, and then think
about what kinds of data storage and access would let you still serve things with really low
latency.

That said, it's just a "good practice", not a "best practice", because this is not a simple
design space. (Any fans of the Cynefin framework out there?).

Ian


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message