hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "Hbase/FAQ" by DougMeil
Date Sat, 06 Aug 2011 19:41:24 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hbase/FAQ" page has been changed by DougMeil:

- There is some overlap on this page with the FAQ section in the HBase book (http://hbase.apache.org/book.html#faq),
but this page will continue to exist.  However, the intent is to refer to the HBase book where
ever possible and not have this FAQ be a separate source of critical documentation.
+ There is some overlap on this page with the FAQ section in the HBase book (http://hbase.apache.org/book.html#faq).
+ == Frequently Asked Questions ==
+  * [[Hbase/FAQ_General|FAQ for General HBase Questions]]
+  * [[Hbase/FAQ_Design|FAQ for HBase Design Questions]]
+  * [[Hbase/FAQ_Operations|FAQ for HBase Operations and Troubleshooting Questions]]
- == Questions ==
-  1. [[#1|When would I use HBase?]]
-  1. [[#2|Can someone give an example of basic API-usage going against hbase?]]
-  1. [[#3|What other hbase-like applications are there out there?]]
-  1. [[#4|Can I fix OutOfMemoryExceptions in hbase?]]
-  1. [[#5|How do I enable hbase DEBUG-level logging?]]
-  1. [[#6|Why do I see "java.io.IOException...(Too many open files)" in my logs?]]
-  1. [[#7|What can I do to improve hbase performance?]]
-  1. [[#8|How do I access HBase from my Ruby/Python/Perl/PHP/etc. application?]]
-  1. [[#9|What ports does HBase use?]]
-  1. [[#10|Why is HBase ignoring HDFS client configuration such as dfs.replication?]]
-  1. [[#11|Can I change the regionserver behavior so it, for example, orders keys other than
lexicographically, etc.?]]
-  1. [[#12|Can I safely move the master from node A to node B?]]
-  1. [[#13|Can I safely move the hbase rootdir in hdfs?]]
-  1. [[#14|Can HBase development be done on windows?]]
-  1. [[#15|Please explain HBase version numbering?]]
-  1. [[#16|What version of Hadoop do I need to run HBase?]]
-  1. [[#17|Any other troubleshooting pointers for me?]]
-  1. [[#18|Are there any schema design examples?]]
-  1. [[#19|How do I add/remove a node?]]
-  1. [[#20|Why do servers have start codes?]]
-  1. [[#21|What is the maximum recommended cell size?]]
-  1. [[#22|Why can't I iterate through the rows of a table in reverse order?]]
- == Answers ==
+ == See Also ==
+  * The Apache HBase Book is the main repository of HBase documentation.
+  * [[http://hbase.apache.org/book.html|Apache HBase Book]]
+    * [[http://hbase.apache.org/book.html#architecture|HBase Architecture]]
+    * [[http://hbase.apache.org/book.html#configuration|HBase Configuration]]
+    * [[http://hbase.apache.org/book.html#performance|HBase Performance]]
+    * [[http://hbase.apache.org/book.html#trouble|HBase Troubleshooting]]
+    * [[http://hbase.apache.org/book.html#schema|HBase Schema Design]]
- '''1. <<Anchor(1)>> When would I use HBase?'''
- See [[http://blog.rapleaf.com/dev/?p=26|Bryan Duxbury's post]] on this topic.
- '''2. <<Anchor(2)>> Can someone give an example of basic API-usage going against
- See the Data Model section in the HBase Book:  http://hbase.apache.org/book.html#datamodel
- See the [[Hbase|wiki home page]] for sample code accessing HBase from other than java.
- '''3. <<Anchor(3)>> What other hbase-like applications are there out there?'''
- Broadly speaking, there are many.  One place to start your search is here [[http://blog.oskarsson.nu/2009/06/nosql-debrief.html|nosql]].
- '''4. <<Anchor(4)>> Can I fix OutOfMemoryExceptions in hbase?'''
- Out-of-the-box, hbase uses a default of 1G heap size.  Set the ''HBASE_HEAPSIZE'' environment
variable in ''${HBASE_HOME}/conf/hbase-env.sh'' if your install needs to run with a larger
heap.  ''HBASE_HEAPSIZE'' is like ''HADOOP_HEAPSIZE'' in that its value is the desired heap
size in MB.  The surrounding '-Xmx' and 'm' needed to make up the maximum heap size java option
are added by the hbase start script (See how ''HBASE_HEAPSIZE'' is used in the ''${HBASE_HOME}/bin/hbase''
script for clarification).
- '''5. <<Anchor(5)>> How do I enable hbase DEBUG-level logging?'''
- Either add the following line to your log4j.properties file -- ''log4j.logger.org.apache.hadoop.hbase=DEBUG''
-- and restart your cluster or, if running a post-0.15.x version, you can set DEBUG via the
UI by clicking on the 'Log Level' link (but you need set 'org.apache.hadoop.hbase' to DEBUG
without the 'log4j.logger' prefix).
- '''6. <<Anchor(6)>> Why do I see "java.io.IOException...(Too many open files)"
in my logs?'''
- See the Troubleshooting section in the HBase Book http://hbase.apache.org/book.html#trouble
- '''7. <<Anchor(7)>> What can I do to improve hbase performance?'''
- See the Performance section in the HBase book http://hbase.apache.org/book.html#performance
- Also, see [[PerformanceTuning|Performance Tuning]] on the wiki home page
- '''8. <<Anchor(8)>> How do I access Hbase from my Ruby/Python/Perl/PHP/etc.
- See non-java access on [[Hbase|HBase wiki home page]]
- '''9. <<Anchor(9)>> What ports does HBase use?'''
- Not counting the ports used by hadoop -- hdfs and mapreduce -- by default, hbase runs the
master and its informational http server at 60000 and 60010 respectively and regionservers
at 60020 and their informational http server at 60030.  ''${HBASE_HOME}/conf/hbase-default.xml''
lists the default values of all ports used.  Also check ''${HBASE_HOME}/conf/hbase-site.xml''
for site-specific overrides.
- '''10. <<Anchor(10)>> Why is HBase ignoring HDFS client configuration such as
- If you have made HDFS client configuration on your hadoop cluster, HBase will not see this
configuration unless you do one of the following:
-  * Add a pointer to your ''HADOOP_CONF_DIR'' to ''CLASSPATH'' in ''hbase-env.sh'' or symlink
your hadoop-site.xml from the hbase conf directory.
-  * Add a copy of ''hadoop-site.xml'' to ''${HBASE_HOME}/conf'', or
-  * If only a small set of HDFS client configurations, add them to ''hbase-site.xml''
- The first option is the better of the three since it avoids duplication.
- '''11. <<Anchor(11)>> Can I change the regionserver behavior so it, for example,
orders keys other than lexicographically, etc.?'''
-   No.  See [[https://issues.apache.org/jira/browse/HBASE-605|HBASE-605]]
- '''12. <<Anchor(12)>> Can I safely move the master from node A to node B?'''
-   Yes.  HBase must be shutdown.  Edit your hbase-site.xml configuration across the cluster
setting hbase.master to point at the new location.
- '''13. <<Anchor(13)>> Can I safely move the hbase rootdir in hdfs?'''
-   Yes.  HBase must be down for the move.  After the move, update the hbase-site.xml across
the cluster and restart.
- '''14. <<Anchor(14)>> Can HBase development be done on windows?'''
- See the the Getting Started section in the HBase Book:  http://hbase.apache.org/book.html#getting_started
- '''15. <<Anchor(15)>> Please explain HBase version numbering?'''
- See [[http://wiki.apache.org/hadoop/Hbase/HBaseVersions|HBase Versions since 0.20.x]]. 
The below is left in place for the historians.
- Originally HBase lived under src/contrib in Hadoop Core.  The HBase version was that of
the hosting Hadoop.  The last HBase version that bundled under contrib was part of Hadoop
0.16.1 (March of 2008).
- The first HBase Hadoop subproject release was versioned 0.1.0.  Subsequent releases went
at least as far as 0.2.1 (September 2008).
- In August of 2008, consensus had it that since HBase depends on a particular Hadoop Core
version, the HBase major+minor versions would from now on mirror that of the Hadoop Core version
HBase depends on.  The first HBase release to take on this new versioning regimine was 0.18.0
HBase; HBase 0.18.0 depends on Hadoop 0.18.x.
- Sorry for any confusion caused.
- '''16. <<Anchor(16)>> What version of Hadoop do I need to run HBase?'''
- Different versions of HBase require different versions of Hadoop.  Consult the table below
to find which version of Hadoop you will need:
- ||'''HBase Release Number'''||'''Hadoop Release Number'''||
- ||0.1.x||0.16.x||
- ||0.2.x||0.17.x||
- ||0.18.x||0.18.x||
- ||0.19.x||0.19.x||
- ||0.20.x||0.20.x||
- Releases of Hadoop can be found [[http://hadoop.apache.org/core/releases.html|here]].  We
recommend using the most recent version of Hadoop possible, as it will contain the most bug
- Note that HBase-0.2.x can be made to work on Hadoop-0.18.x.  HBase-0.2.x ships with Hadoop-0.17.x,
so to use Hadoop-0.18.x you must recompile Hadoop-0.18.x, remove the Hadoop-0.17.x jars from
HBase, and replace them with the jars from Hadoop-0.18.x.
- Also note that after HBase-0.2.x, the HBase release numbering schema will change to align
with the Hadoop release number on which it depends.
- '''17. <<Anchor(17)>> Any other troubleshooting pointers for me?'''
- See the troubleshooting section in the HBase book  http://hbase.apache.org/book.html#trouble
- Also, see [[http://wiki.apache.org/hadoop/Hbase/Troubleshooting|Troubleshooting]] page.
- '''18. <<Anchor(18)>> Are there any Schema Design examples?'''
- See [[http://www.slideshare.net/hmisty/20090713-hbase-schema-design-case-studies|HBase Schema
Design -- Case Studies]] by Evan(Qingyan) Liu or the following text taken from Jonathan Gray's
mailing list posts.
- - There's a very big difference between storage of relational/row-oriented databases and
column-oriented databases. For example, if I have a table of 'users' and I need to store friendships
between these users... In a relational database my design is something like:
- Table: users(pkey = userid) Table: friendships(userid,friendid,...) which contains one (or
maybe two depending on how it's impelemented) row for each friendship.
- In order to lookup a given users friend, SELECT * FROM friendships WHERE userid = 'myid';
- The cost of this relational query continues to increase as a user adds more friends. You
also begin to have practical limits. If I have millions of users, each with many thousands
of potential friends, the size of these indexes grow exponentially and things get nasty quickly.
Rather than friendships, imagine I'm storing activity logs of actions taken by users.
- In a column-oriented database these things scale continuously with minimal difference between
10 users and 10,000,000 users, 10 friendships and 10,000 friendships.
- Rather than a friendships table, you could just have a friendships column family in the
users table. Each column in that family would contain the ID of a friend. The value could
store anything else you would have stored in the friendships table in the relational model.
As column families are stored together/sequentially on a per-row basis, reading a user with
1 friend versus a user with 10,000 friends is virtually the same. The biggest difference is
just in the shipping of this information across the network which is unavoidable. In this
system a user could have 10,000,000 friends. In a relational database the size of the friendship
table would grow massively and the indexes would be out of control.
- '''Q: Can you please provide an example of "good de-normalization" in HBase and how its
held consistent (in your friends example in a relational db, there would be a cascadingDelete)?
As I think of the users table: if I delete an user with the userid='123', do I have to walk
through all of the other users column-family "friends" to guaranty consistency?! Is de-normalization
in HBase only used to avoid joins? Our webapp doesn't use joins at the moment anyway.'''
- You lose any concept of foreign keys. You have a primary key, that's it. No
- secondary keys/indexes, no foreign keys.
- It's the responsibility of your application to handle something like deleting a friend and
cascading to the friendships. Again, typical small web apps are far simpler to write using
SQL, you become responsible for some of the things that were once handled for you.
- Another example of "good denormalization" would be something like storing a users "favorite
pages". If we want to query this data in two ways: for a given user, all of his favorites.
Or, for a given favorite, all of the users who have it as a favorite. Relational database
would probably have tables for users, favorites, and userfavorites. Each link would be stored
in one row in the userfavorites table. We would have indexes on both 'userid' and 'favoriteid'
and could thus query it in both ways described above. In HBase we'd probably put a column
in both the users table and the favorites table, there would be no link table.
- That would be a very efficient query in both architectures, with relational performing better
much better with small datasets but less so with a large dataset.
- Now asking for the favorites of these 10 users. That starts to get tricky in HBase and will
undoubtedly suffer worse from random reading. The flexibility of SQL allows us to just ask
the database for the answer to that question. In a
- small dataset it will come up with a decent solution, and return the results to you in a
matter of milliseconds. Now let's make that userfavorites table a few billion rows, and the
number of users you're asking for a couple thousand. The query planner will come up with something
but things will fall down and it will end up taking forever. The worst problem will be in
the index bloat. Insertions to this link table will start to take a very long time. HBase
will perform virtually the same as it did on the small table, if not better because of superior
region distribution.
- '''Q:[Michael Dagaev] How would you design an Hbase table for many-to-many association between
two entities, for example Student and Course?'''
- I would define two tables:
- Student: student id student data (name, address, ...) courses (use course ids as column
qualifiers here)
- Course: course id course data (name, syllabus, ...) students (use student ids as column
qualifiers here)
- Does it make sense? 
- A[Jonathan Gray] : 
- Your design does make sense.
- As you said, you'd probably have two column-families in each of the Student and Course tables.
One for the data, another with a column per student or course.
- For example, a student row might look like:
- Student :
- id/row/key = 1001 
- data:name = Student Name 
- data:address = 123 ABC St 
- courses:2001 = (If you need more information about this association, for example, if they
are on the waiting list) 
- courses:2002 = ...
- This schema gives you fast access to the queries, show all classes for a student (student
table, courses family), or all students for a class (courses table, students family). 
- '''19. <<Anchor(19)>> How do I add/remove a node?'''
- For removing nodes, see the section on decommissioning nodes in the HBase Book http://hbase.apache.org/book.html#decommission
- Adding and removing nodes works the same way in HBase and Hadoop. To add a new node, do
the following steps:
-  1. Edit $HBASE_HOME/conf/regionservers on the Master node and add the new address.
-  2. Setup the new node with needed software, permissions.
-  3. On that node run $HBASE_HOME/bin/hbase-daemon.sh start regionserver
-  4. Confirm it worked by looking at the Master's web UI or in that region server's log.
- Removing a node is as easy, first issue "stop" instead of start then remove the address
from the regionservers file. 
- For Hadoop, use the same kind of script (starts with hadoop-*), their process names (datanode,
tasktracker), and edit the slaves file. Removing datanodes is tricky, please review the dfsadmin
command before doing it.
- '''20. <<Anchor(20)>> Why do servers have start codes?'''
- If a region server crashes and recovers, it cannot be given work until its lease times out.
If the lease is identified only by an IP address and port number, then that server can't do
any progress until the lease times out. A start code is added so that the restarted server
can begin doing work immediately upon recovery. For more, see https://issues.apache.org/jira/browse/HBASE-1156.
- '''21. <<Anchor(21)>> What is the maximum recommended cell size?'''
- A rough rule of thumb, with little empirical validation, is to keep the data in HDFS and
store pointers to the data in HBase if you expect the cell size to be consistently above 10
MB. If you do expect large cell values and you still plan to use HBase for the storage of
cell contents, you'll want to increase the block size and the maximum region size for the
table to keep the index size reasonable and the split frequency acceptable.
- '''22. <<Anchor(22)>> Why can't I iterate through the rows of a table in reverse
- Because of the way [[http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/io/hfile/HFile.html|HFile]]
works: for efficiency, column values are put on disk with the length of the value written
first and then the bytes of the actual value written second. To navigate through these values
in reverse order, these length values would need to be stored twice (at the end as well) or
in a side file. A robust secondary index implementation is the likely solution here to ensure
the primary use case remains fast.

View raw message