hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mork0075 <mork0...@googlemail.com>
Subject Re: Why is scaling HBase much simpler then scaling a relational db?
Date Thu, 21 Aug 2008 17:49:24 GMT
This answer is very, very good!

You example with the friends makes perfectly sense. Can you imagine a 
scenario where storing the data in column oriented instead of row 
oriented db (so if you will an counterexample) causes such a huge 
performance mismatch, like the friends one in row/column comparison?

Can you please provide an example of "good de-normalization" in HBase 
and how its held consitent (in your friends example in a relational db, 
there would be a cascadingDelete)? As i think of the users table: if i 
delete an user with the userid='123', then if have to walk through all 
of the other users column-family "friends" to guranty consitency?! Is 
de-normalization in HBase only used to avoid joins? Our webapp doenst 
use joins at the moment anyway.

Jonathan Gray schrieb:
> A few very big differences...
> - HBase/BigTable don't have "transactions" in the same way that a relational database
does.  While it is possible (and was just recently implemented for HBase, see HBASE-669) it
is not at the core of this design.  A major bottleneck of distributed multi-master relational
databases is distributed transactions/locks.
> - There's a very big difference between storage of relational/row-oriented databases
and column-oriented databases.  For example, if I have a table of 'users' and I need to store
friendships between these users... In a relational database my design is something like:
> Table: users(pkey = userid)
> Table: friendships(userid,friendid,...) which contains one (or maybe two depending on
how it's impelemented) row for each friendship.
> In order to lookup a given users friend, SELECT * FROM friendships WHERE userid = 'myid';
> This query would use an index on the friendships table to retrieve all the necessary
rows.  Depending on the relational database you might also be fetching each and every row
(entirely) off of disk to be read.  In a sharded relational database, this would require hitting
every node to get whichever friendships were stored on that node.  There's lots of room for
optimizations here but any way you slice it, you're likely pulling non-sequential blocks off
disk.  When you add in the overhead of ACID transactions this can get slow.
> The cost of this relational query continues to increase as a user adds more friends.
 You also begin to have practical limits.  If I have millions of users, each with many thousands
of potential friends, the size of these indexes grow exponentially and things get nasty quickly.
 Rather than friendships, imagine I'm storing activity logs of actions taken by users.
> In a column-oriented database these things scale continuously with minimal difference
between 10 users and 10,000,000 users, 10 friendships and 10,000 friendships.
> Rather than a friendships table, you could just have a friendships column family in the
users table.  Each column in that family would contain the ID of a friend.  The value could
store anything else you would have stored in the friendships table in the relational model.
 As column families are stored together/sequentially on a per-row basis, reading a user with
1 friend versus a user with 10,000 friends is virtually the same.  The biggest difference
is just in the shipping of this information across the network which is unavoidable.  In this
system a user could have 10,000,000 friends.  In a relational database the size of the friendship
table would grow massively and the indexes would be out of control.
> It's certainly possible to make relational databases "scale".  What that is about is
usually massive optimizations, manual sharding, being very clever about how you query things,
and often de-normalizing.  Index bloat and table bloat can thrash a relational db.
> In HBase, de-normalizing is usually a good thing.  Storage space is often considered
free (not a large cost associated with storing something in multiple places).  In a relational
database this is not so much the case.
> In HBase, the sharding and distribution come for free out of the box.  With a relational
database you must jump through hoops and often end up implementing your own sharding/distributed
hashing algorithms so you can distribute across machines.
> Yes, you lose the relational primitives.  Sometimes you wish you could do a simple join.
 But if you get in the right mindset, you learn how to put your data into this new data model.
 And in the end the payoffs are huge.
> In addition to all this, with HBase/Hadoop you get MapReduce.  It's feasible to implement
something like that on top of a distributed relational database but again the complexity is
enormous.  With HBase/Hadoop it's a built-in part of the system, a system which is very intelligent
at keeping logic close to data, etc.
> To directly answer your question, it's "simpler" to scale HBase because the scaling comes
for free out of the box.  You get automatic sharding/distribution of storage and queries.
 There's nothing simpler than that.  Distributing a relational database is never simple.
> I hope this starts to shed some light on what the differences are.
> Jonathan Gray
> Streamy Inc.
> -----Original Message-----
> From: Mork0075 [mailto:mork0075@googlemail.com] 
> Sent: Thursday, August 21, 2008 8:48 AM
> To: core-user@hadoop.apache.org; hbase-user@hadoop.apache.org
> Subject: Re: Why is scaling HBase much simpler then scaling a relational db?
> Thank you, but i still don't got it.
> I've read tons of websites and papers, but there's no clear und founded 
> answer "why use BigTable instead of relational databases".
> MySQL Cluster seams to offer the same scalabilty and level of 
> abstraction, whithout switching to a non relational pardigm. Lots of 
> blog posts are highly emotional, without answering the core question:
> "Why RDBMS don't scale and why something like BigTable do". Often you 
> read something like this:
> "They have also built a system called BigTable, which is a Column 
> Oriented Database, which splits a table into columns rather than rows 
> making is much simpler to distribute and parallelize."
> Why?
> Really confusing ... ;)
> Stuart Sierra schrieb:
>> On Tue, Aug 19, 2008 at 9:44 AM, Mork0075 <mork0075@googlemail.com> wrote:
>>> Can you please explain, why someone should use HBase for horizontal
>>> scaling instead of a relational database? One reason for me would be,
>>> that i don't have to implement the sharding logic myself. Are there other?
>> A slight tangent -- there are various tools that implement sharding
>> over relational databases like MySQL.  Two that I know of are
>> DBSlayer,
>> http://code.nytimes.com/projects/dbslayer
>> and MySQL Proxy,
>> http://forge.mysql.com/wiki/MySQL_Proxy
>> I don't know of any formal comparisons between sharding traditional
>> database servers and distributed databases like HBase.
>> -Stuart

View raw message