hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Trivial Update of "Hbase/FAQ" by Vaibhav Puranik
Date Tue, 31 Mar 2009 16:41:46 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by Vaibhav Puranik:
http://wiki.apache.org/hadoop/Hbase/FAQ

------------------------------------------------------------------------------
  
  Rather than a friendships table, you could just have a friendships column family in the
users table. Each column in that family would contain the ID of a friend. The value could
store anything else you would have stored in the friendships table in the relational model.
As column families are stored together/sequentially on a per-row basis, reading a user with
1 friend versus a user with 10,000 friends is virtually the same. The biggest difference is
just in the shipping of this information across the network which is unavoidable. In this
system a user could have 10,000,000 friends. In a relational database the size of the friendship
table would grow massively and the indexes would be out of control.
  
- '''Q: Can you please provide an example of "good de-normalization" in HBase and how its
held consitent (in your friends example in a relational db, there would be a cascadingDelete)?
As i think of the users table: if i delete an user with the userid='123', then if have to
walk through all of the other users column-family "friends" to guranty consitency?! Is de-normalization
in HBase only used to avoid joins? Our webapp doenst use joins at the moment anyway.'''
+ '''Q: Can you please provide an example of "good de-normalization" in HBase and how its
held consistent (in your friends example in a relational db, there would be a cascadingDelete)?
As i think of the users table: if i delete an user with the userid='123', then if have to
walk through all of the other users column-family "friends" to guaranty consistency?! Is de-normalization
in HBase only used to avoid joins? Our webapp doenst use joins at the moment anyway.'''
  
  You lose any concept of foreign keys. You have a primary key, that's it. No
  secondary keys/indexes, no foreign keys.
  
- Another example of "good denormalization" would be something like storing a users "favorite
pages". If we want to query this data in two ways: for a given user, all of his favorites.
Or, for a given favorite, all of the users who have
+ It's the responsibility of your application to handle something like deleting a friend and
cascading to the friendships. Again, typical small web apps are far simpler to write using
SQL, you become responsible for some of the things that were once handled for you.
+ 
- it as a favorite. Relational database would probably have tables for users, favorites, and
userfavorites. Each link would be stored in one row in the userfavorites table. We would have
indexes on both 'userid' and 'favoriteid' and could thus query it in both ways described above.
In HBase we'd probably put a column in both the users table and the favorites table, there
would be no link table.
+ Another example of "good denormalization" would be something like storing a users "favorite
pages". If we want to query this data in two ways: for a given user, all of his favorites.
Or, for a given favorite, all of the users who have it as a favorite. Relational database
would probably have tables for users, favorites, and userfavorites. Each link would be stored
in one row in the userfavorites table. We would have indexes on both 'userid' and 'favoriteid'
and could thus query it in both ways described above. In HBase we'd probably put a column
in both the users table and the favorites table, there would be no link table.
  
  That would be a very efficient query in both architectures, with relational performing better
much better with small datasets but less so with a large dataset.
  
  Now asking for the favorites of these 10 users. That starts to get tricky in HBase and will
undoubtedly suffer worse from random reading. The flexibility of SQL allows us to just ask
the database for the answer to that question. In a
- small dataset it will come up with a decent solution, and return the results to you in a
matter of milliseconds. Now let's make that userfavorites table a few billion rows, and the
number of users you're asking for a couple thousand. The query planner will come up with something
but things will fall down and it will end up taking forever. The worst problem will be in
the index bloat. Insertions to this link table will start to take a very long time. HBase
will
+ small dataset it will come up with a decent solution, and return the results to you in a
matter of milliseconds. Now let's make that userfavorites table a few billion rows, and the
number of users you're asking for a couple thousand. The query planner will come up with something
but things will fall down and it will end up taking forever. The worst problem will be in
the index bloat. Insertions to this link table will start to take a very long time. HBase
will perform virtually the same as it did on the small table, if not better because of superior
region distribution.
- perform virtually the same as it did on the small table, if not better because of superior
region distribution.
  
  '''Q:[Michael Dagaev] How would you design an Hbase table for many-to-many association between
two entities, for example Student and Course?'''
  

Mime
View raw message