hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hbase/FAQ" by Vaibhav Puranik
Date Mon, 30 Mar 2009 20:32:26 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by Vaibhav Puranik:

   1. [#18 Please explain HBase version numbering?]
   1. [#19 What version of Hadoop do I need to run HBase?]
   1. [#20 Any other troubleshooting pointers for me?]
+  1. [#21 Schema Design Examples]
  == Answers ==
@@ -198, +199 @@

  Please see our [http://wiki.apache.org/hadoop/Hbase/Troubleshooting Troubleshooting] page.
+ '''21 [[Anchor(21)]] HBase Schema Design examples'''
+ Following text is taken from Jonathan Gray's mailing list posts.
+ - There's a very big difference between storage of relational/row-oriented databases and
column-oriented databases. For example, if I have a table of 'users' and I need to store friendships
between these users... In a relational database my design is something like:
+ Table: users(pkey = userid) Table: friendships(userid,friendid,...) which contains one (or
maybe two depending on how it's impelemented) row for each friendship.
+ In order to lookup a given users friend, SELECT * FROM friendships WHERE userid = 'myid';
+ The cost of this relational query continues to increase as a user adds more friends. You
also begin to have practical limits. If I have millions of users, each with many thousands
of potential friends, the size of these indexes grow exponentially and things get nasty quickly.
Rather than friendships, imagine I'm storing activity logs of actions taken by users.
+ In a column-oriented database these things scale continuously with minimal difference between
10 users and 10,000,000 users, 10 friendships and 10,000 friendships.
+ Rather than a friendships table, you could just have a friendships column family in the
users table. Each column in that family would contain the ID of a friend. The value could
store anything else you would have stored in the friendships table in the relational model.
As column families are stored together/sequentially on a per-row basis, reading a user with
1 friend versus a user with 10,000 friends is virtually the same. The biggest difference is
just in the shipping of this information across the network which is unavoidable. In this
system a user could have 10,000,000 friends. In a relational database the size of the friendship
table would grow massively and the indexes would be out of control.
+ '''Q: Can you please provide an example of "good de-normalization" in HBase and how its
held consitent (in your friends example in a relational db, there would be a cascadingDelete)?
As i think of the users table: if i delete an user with the userid='123', then if have to
walk through all of the other users column-family "friends" to guranty consitency?! Is de-normalization
in HBase only used to avoid joins? Our webapp doenst use joins at the moment anyway.'''
+ You lose any concept of foreign keys. You have a primary key, that's it. No
+ secondary keys/indexes, no foreign keys.
+ Another example of "good denormalization" would be something like storing a users "favorite
pages". If we want to query this data in two ways: for a given user, all of his favorites.
Or, for a given favorite, all of the users who have
+ it as a favorite. Relational database would probably have tables for users, favorites, and
userfavorites. Each link would be stored in one row in the userfavorites table. We would have
indexes on both 'userid' and 'favoriteid' and could thus query it in both ways described above.
In HBase we'd probably put a column in both the users table and the favorites table, there
would be no link table.
+ That would be a very efficient query in both architectures, with relational performing better
much better with small datasets but less so with a large dataset.
+ Now asking for the favorites of these 10 users. That starts to get tricky in HBase and will
undoubtedly suffer worse from random reading. The flexibility of SQL allows us to just ask
the database for the answer to that question. In a
+ small dataset it will come up with a decent solution, and return the results to you in a
matter of milliseconds. Now let's make that userfavorites table a few billion rows, and the
number of users you're asking for a couple thousand. The query planner will come up with something
but things will fall down and it will end up taking forever. The worst problem will be in
the index bloat. Insertions to this link table will start to take a very long time. HBase
+ perform virtually the same as it did on the small table, if not better because of superior
region distribution.
+ '''Q:[Michael Dagaev] How would you design an Hbase table for many-to-many association between
two entities, for example Student and Course?'''
+ I would define two tables:
+ Student: student id student data (name, address, ...) courses (use course ids as column
qualifiers here)
+ Course: course id course data (name, syllabus, ...) students (use student ids as column
qualifiers here)
+ Does it make sense? 
+ A[Jonathan Gray] : 
+ Your design does make sense.
+ As you said, you'd probably have two column-families in each of the Student and Course tables.
One for the data, another with a column per student or course.
+ For example, a student row might look like:
+ Student :
+ id/row/key = 1001 
+ data:name = Student Name 
+ data:address = 123 ABC St 
+ courses:2001 = (If you need more information about this association, for example, if they
are on the waiting list) 
+ courses:2002 = ...
+ This schema gives you fast access to the queries, show all classes for a student (student
table, courses family), or all students for a class (courses table, students family). 

View raw message