hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@yahoo.com>
Subject Re: The HBase Basics
Date Tue, 26 Aug 2008 17:22:15 GMT
Hello Charles.

> From: Charles Mason <charlie.mas@gmail.com>
> Subject: The HBase Basics
> To: hbase-user@hadoop.apache.org
> Date: Tuesday, August 26, 2008, 9:32 AM
> Firstly in an ideal environment where you deploy Habase on
> an HDFS cluster do you have two separate clusters one for
> the HDFS nodes and one for the Hbase Nodes or in reality
> does each cluster node run the HDFS software as well as
> HBase, are there any benefits to doing that.

Running the HDFS data nodes and HBase region servers
together is good for reasons of data locality (anecdotal,
not formally proven), and provides better load spreading
because, for example, if you were to split your cluster
down the middle and run HDFS on one partition and HBase
on the other, you effectively double the number of data
nodes and region servers if you do not make that split.

In my experience what brings down Hadoop/HBase clusters
is the load caused by mapreduce tasks. The first failures
seen typically occur at the DFS layer, but if you 
attempt to run too many maps and/or reduces for your
available resources, many threads are starved and 
communication on the cluster breaks down, and so failures
can occur at any layer. In this case maybe it does make
sense to bring more resources to the table *and*
partition them so the mapreduce subsystem (and map and/or
reducer child processes) run on nodes separate from HDFS
and HBase. 

> Secondly, if you access HBase through the Thrift api how
> does its server relate to the Hbase cluster.

The Thrift API runs as a HBase client, in effect a 
gateway. I think it is common practice to run the Thrift
API on each client. They will then all communicate with
the HBase cluster via HRPC on behalf of your application
in a way that does not introduce (much of) a bottleneck.
So you should run a HBase Thrift server on each of your
web servers and your application should interact with the
local instance. 

> Thirdly just to be clear, HBase doesn't support
> secondary indexes.


Typically this is handled through denormalization/data
duplication. I explain it to my colleagues as "insert-time

* - See https://issues.apache.org/jira/browse/HBASE-270 on
the topic of building Lucene indexes from HBase tables.

See also http://markmail.org/message/e23aoyfhunsrsrda
You may wish to get in contact with Ning Li or Jun Rao at
IBM Almaden. 

> Thanks in advance.

Hope this helps,

   - Andy


View raw message