hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fernando Padilla <f...@alum.mit.edu>
Subject Re: Why is scaling HBase much simpler then scaling a relational db?
Date Thu, 21 Aug 2008 16:24:04 GMT
I'm no expert, but maybe I can explain it the way I see it, maybe it 
will resonate with other newbies like me :)  Sorry if it's long winded, 
or boring for those who already know all this.


BigTable and Hadoop are inherently sharded and distributed.  They are 
architected to store the data in redundant shards across many machines. 
  This allows you to add more capacity ( both in processing band-width 
as well as storage capacity ), simply by adding more machines to host 
your cluster.

Mysql is implemented around the idea of a single database running on a 
single machine.  This isn't inherently bad, as long as you have a 
machine with large enough storage and processing bandwidth ( memory/cpu 
).  You can implement sharding over mysql (and other databases), but the 
application will have to be hand tailored to work in such a setup. 
You're essentially implemented a home-grown data storage system, from a 
collection of regular databases.

You mentioned the Mysql Cluster technology; which is an attempt to bring 
in Mysql native level sharding and distribution of processing bandwidth. 
  But at least from their early architecture (I have not kept up with 
any evolution of it), it was not yet close to real sharding nor full 
scalability.  The technology could be described as a collection of 
Master-Master In-Memory database nodes backed by a collection of 
persistence nodes.  So that it worked great as long as you had enough 
ram to hold your whole database in a single node.




Every system has their pros and cons.

A single Mysql is simpler and solid and people have lots of experience 
with it.  If your application allows, and with some application specific 
architecting, you could shard mysql to some high level of scalability. 
But this really depends on your application data and queries you do over 
that data.  Or how much extra engineering you're willing to do (and 
maintain) to queries that span multiple shards run efficiently.

BigTable and Hadoop are implemented to support sharding and distributed 
queries from the get-go, so you can easily scale out without having to 
add or maintain more complex or homegrown software, just more hardware.




Mork0075 wrote:
> Thank you, but i still don't got it.
> 
> I've read tons of websites and papers, but there's no clear und founded 
> answer "why use BigTable instead of relational databases".
> 
> MySQL Cluster seams to offer the same scalabilty and level of 
> abstraction, whithout switching to a non relational pardigm. Lots of 
> blog posts are highly emotional, without answering the core question:
> 
> "Why RDBMS don't scale and why something like BigTable do". Often you 
> read something like this:
> 
> "They have also built a system called BigTable, which is a Column 
> Oriented Database, which splits a table into columns rather than rows 
> making is much simpler to distribute and parallelize."
> 
> Why?
> 
> Really confusing ... ;)
> 
> Stuart Sierra schrieb:
>> On Tue, Aug 19, 2008 at 9:44 AM, Mork0075 <mork0075@googlemail.com> 
>> wrote:
>>> Can you please explain, why someone should use HBase for horizontal
>>> scaling instead of a relational database? One reason for me would be,
>>> that i don't have to implement the sharding logic myself. Are there 
>>> other?
>>
>> A slight tangent -- there are various tools that implement sharding
>> over relational databases like MySQL.  Two that I know of are
>> DBSlayer,
>> http://code.nytimes.com/projects/dbslayer
>> and MySQL Proxy,
>> http://forge.mysql.com/wiki/MySQL_Proxy
>>
>> I don't know of any formal comparisons between sharding traditional
>> database servers and distributed databases like HBase.
>> -Stuart
>>
> 

Mime
View raw message