Return-Path: X-Original-To: apmail-hadoop-common-commits-archive@www.apache.org Delivered-To: apmail-hadoop-common-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 54D236C95 for ; Sat, 6 Aug 2011 18:54:28 +0000 (UTC) Received: (qmail 79889 invoked by uid 500); 6 Aug 2011 18:54:28 -0000 Delivered-To: apmail-hadoop-common-commits-archive@hadoop.apache.org Received: (qmail 79857 invoked by uid 500); 6 Aug 2011 18:54:27 -0000 Mailing-List: contact common-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-dev@hadoop.apache.org Delivered-To: mailing list common-commits@hadoop.apache.org Received: (qmail 79850 invoked by uid 500); 6 Aug 2011 18:54:27 -0000 Delivered-To: apmail-hadoop-core-commits@hadoop.apache.org Received: (qmail 79847 invoked by uid 99); 6 Aug 2011 18:54:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 06 Aug 2011 18:54:27 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.131] (HELO eos.apache.org) (140.211.11.131) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 06 Aug 2011 18:54:23 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 952AE69F; Sat, 6 Aug 2011 18:54:01 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Sat, 06 Aug 2011 18:54:01 -0000 Message-ID: <20110806185401.82426.49962@eos.apache.org> Subject: =?utf-8?q?=5BHadoop_Wiki=5D_Update_of_=22Hbase/FAQ=5FGeneral=22_by_DougMe?= =?utf-8?q?il?= Auto-Submitted: auto-generated X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for ch= ange notification. The "Hbase/FAQ_General" page has been changed by DougMeil: http://wiki.apache.org/hadoop/Hbase/FAQ_General?action=3Ddiff&rev1=3D1&rev2= =3D2 - Describe Hbase/FAQ_General here. + FAQ - General Questions = + =3D=3D Questions =3D=3D + 1. [[#1|When would I use HBase?]] + 1. [[#2|Can someone give an example of basic API-usage going against hba= se?]] + 1. [[#3|What other hbase-like applications are there out there?]] + 1. [[#8|How do I access HBase from my Ruby/Python/Perl/PHP/etc. applicat= ion?]]on?]] + 1. [[#14|Can HBase development be done on windows?]] + 1. [[#15|Please explain HBase version numbering?]] + 1. [[#16|What version of Hadoop do I need to run HBase?]] + 1. [[#18|Are there any schema design examples?]] + = + =3D=3D Answers =3D=3D + = + = + '''1. <> When would I use HBase?''' + = + See [[http://blog.rapleaf.com/dev/?p=3D26|Bryan Duxbury's post]] on this = topic. + = + = + '''2. <> Can someone give an example of basic API-usage going = against hbase?''' + = + See the Data Model section in the HBase Book: http://hbase.apache.org/bo= ok.html#datamodel + = + See the [[Hbase|wiki home page]] for sample code accessing HBase from oth= er than java. + = + '''3. <> What other hbase-like applications are there out ther= e?''' + = + Broadly speaking, there are many. One place to start your search is here= [[http://blog.oskarsson.nu/2009/06/nosql-debrief.html|nosql]]. + = + '''8. <> How do I access Hbase from my Ruby/Python/Perl/PHP/et= c. application?''' + = + See non-java access on [[Hbase|HBase wiki home page]] + = + = + '''14. <> Can HBase development be done on windows?''' + = + See the the Getting Started section in the HBase Book: http://hbase.apac= he.org/book.html#getting_started + = + '''15. <> Please explain HBase version numbering?''' + = + See [[http://wiki.apache.org/hadoop/Hbase/HBaseVersions|HBase Versions si= nce 0.20.x]]. The below is left in place for the historians. + = + Originally HBase lived under src/contrib in Hadoop Core. The HBase versi= on was that of the hosting Hadoop. The last HBase version that bundled und= er contrib was part of Hadoop 0.16.1 (March of 2008). + = + The first HBase Hadoop subproject release was versioned 0.1.0. Subsequen= t releases went at least as far as 0.2.1 (September 2008). + = + In August of 2008, consensus had it that since HBase depends on a particu= lar Hadoop Core version, the HBase major+minor versions would from now on m= irror that of the Hadoop Core version HBase depends on. The first HBase re= lease to take on this new versioning regimine was 0.18.0 HBase; HBase 0.18.= 0 depends on Hadoop 0.18.x. + = + Sorry for any confusion caused. + = + '''16. <> What version of Hadoop do I need to run HBase?''' + = + Different versions of HBase require different versions of Hadoop. Consul= t the table below to find which version of Hadoop you will need: + = + ||'''HBase Release Number'''||'''Hadoop Release Number'''|| + ||0.1.x||0.16.x|| + ||0.2.x||0.17.x|| + ||0.18.x||0.18.x|| + ||0.19.x||0.19.x|| + ||0.20.x||0.20.x|| + = + Releases of Hadoop can be found [[http://hadoop.apache.org/core/releases.= html|here]]. We recommend using the most recent version of Hadoop possible= , as it will contain the most bug fixes. + = + Note that HBase-0.2.x can be made to work on Hadoop-0.18.x. HBase-0.2.x = ships with Hadoop-0.17.x, so to use Hadoop-0.18.x you must recompile Hadoop= -0.18.x, remove the Hadoop-0.17.x jars from HBase, and replace them with th= e jars from Hadoop-0.18.x. + = + Also note that after HBase-0.2.x, the HBase release numbering schema will= change to align with the Hadoop release number on which it depends. + = + = + '''18. <> Are there any Schema Design examples?''' + = + See [[http://www.slideshare.net/hmisty/20090713-hbase-schema-design-case-= studies|HBase Schema Design -- Case Studies]] by Evan(Qingyan) Liu or the f= ollowing text taken from Jonathan Gray's mailing list posts. + = + - There's a very big difference between storage of relational/row-oriente= d databases and column-oriented databases. For example, if I have a table o= f 'users' and I need to store friendships between these users... In a relat= ional database my design is something like: + = + Table: users(pkey =3D userid) Table: friendships(userid,friendid,...) whi= ch contains one (or maybe two depending on how it's impelemented) row for e= ach friendship. + = + In order to lookup a given users friend, SELECT * FROM friendships WHERE = userid =3D 'myid'; + = + The cost of this relational query continues to increase as a user adds mo= re friends. You also begin to have practical limits. If I have millions of = users, each with many thousands of potential friends, the size of these ind= exes grow exponentially and things get nasty quickly. Rather than friendshi= ps, imagine I'm storing activity logs of actions taken by users. + = + In a column-oriented database these things scale continuously with minima= l difference between 10 users and 10,000,000 users, 10 friendships and 10,0= 00 friendships. + = + Rather than a friendships table, you could just have a friendships column= family in the users table. Each column in that family would contain the ID= of a friend. The value could store anything else you would have stored in = the friendships table in the relational model. As column families are store= d together/sequentially on a per-row basis, reading a user with 1 friend ve= rsus a user with 10,000 friends is virtually the same. The biggest differen= ce is just in the shipping of this information across the network which is = unavoidable. In this system a user could have 10,000,000 friends. In a rela= tional database the size of the friendship table would grow massively and t= he indexes would be out of control. + = + '''Q: Can you please provide an example of "good de-normalization" in HBa= se and how its held consistent (in your friends example in a relational db,= there would be a cascadingDelete)? As I think of the users table: if I del= ete an user with the userid=3D'123', do I have to walk through all of the o= ther users column-family "friends" to guaranty consistency?! Is de-normaliz= ation in HBase only used to avoid joins? Our webapp doesn't use joins at th= e moment anyway.''' + = + You lose any concept of foreign keys. You have a primary key, that's it. = No + secondary keys/indexes, no foreign keys. + = + It's the responsibility of your application to handle something like dele= ting a friend and cascading to the friendships. Again, typical small web ap= ps are far simpler to write using SQL, you become responsible for some of t= he things that were once handled for you. + = + Another example of "good denormalization" would be something like storing= a users "favorite pages". If we want to query this data in two ways: for a= given user, all of his favorites. Or, for a given favorite, all of the use= rs who have it as a favorite. Relational database would probably have table= s for users, favorites, and userfavorites. Each link would be stored in one= row in the userfavorites table. We would have indexes on both 'userid' and= 'favoriteid' and could thus query it in both ways described above. In HBas= e we'd probably put a column in both the users table and the favorites tabl= e, there would be no link table. + = + That would be a very efficient query in both architectures, with relation= al performing better much better with small datasets but less so with a lar= ge dataset. + = + Now asking for the favorites of these 10 users. That starts to get tricky= in HBase and will undoubtedly suffer worse from random reading. The flexib= ility of SQL allows us to just ask the database for the answer to that ques= tion. In a + small dataset it will come up with a decent solution, and return the resu= lts to you in a matter of milliseconds. Now let's make that userfavorites t= able a few billion rows, and the number of users you're asking for a couple= thousand. The query planner will come up with something but things will fa= ll down and it will end up taking forever. The worst problem will be in the= index bloat. Insertions to this link table will start to take a very long = time. HBase will perform virtually the same as it did on the small table, i= f not better because of superior region distribution. + = + '''Q:[Michael Dagaev] How would you design an Hbase table for many-to-man= y association between two entities, for example Student and Course?''' + = + I would define two tables: + = + Student: student id student data (name, address, ...) courses (use course= ids as column qualifiers here) + Course: course id course data (name, syllabus, ...) students (use student= ids as column qualifiers here) + = + Does it make sense? = + = + A[Jonathan Gray] : = + Your design does make sense. + = + As you said, you'd probably have two column-families in each of the Stude= nt and Course tables. One for the data, another with a column per student o= r course. + For example, a student row might look like: + Student : + id/row/key =3D 1001 = + data:name =3D Student Name = + data:address =3D 123 ABC St = + courses:2001 =3D (If you need more information about this association, fo= r example, if they are on the waiting list) = + courses:2002 =3D ... + = + This schema gives you fast access to the queries, show all classes for a = student (student table, courses family), or all students for a class (cours= es table, students family). = +=20