hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Gray" <jl...@streamy.com>
Subject RE: Few questions about map reduce in Hbase
Date Sun, 16 Nov 2008 20:34:16 GMT
> Hi,
> I am new to Hadoop and Hbase. I am trying to understand how to use map
> reduce with Hbase as source and sink and had following questions. Would
> appreciate if someone can answer them and may be point me to some
> sample
> code:
> -- As far as I understood, the tables gets stored in different regions
> in
> Hbase which are split across various nodes in HDFS. Is there a way to
> control the amount of replication of a particular table ?

The regions are split across the different region servers, the contents of
each region is made up of many different files/blocks which are then
replicated across the nodes of HDFS.  Replication is set in HDFS, HBase has
not concept of replication.  Therefore it's not possible (as far as I know)
to set per-table replication levels.  If it was possible to set
per-directory replication settings in HDFS, then this might be possible, I'm
unsure if that is possible though I think it is a global setting. 

> --When we try to use a table scanner, it automatically switches between
> various regions of a table which may be present across different nodes
> and
> returns us the row handle. So it is a single process doing that. Am I
> correct ?

The META table (which is stored in regions/on regionservers like any other
table) contains the start/end key and node locations of all other tables and
their regions.  When using a scanner, it will start with the region which
includes your startrow (first region of the table if no startrow given) and
once you have reached the end of the current region, you will use META
information to find the next region.  Your scanner will then continue in
that region, which might be on a different node.

> -- When we use TableMap to run map reduce jobs on Hbase, it
> automatically
> creates several map jobs i.e. one per region and performs map operation
> on
> the key range of that particular region. So if I use a table scanner
> inside
> a map job, will I be still iterating through only row ranges of that
> particular region or again the whole table ?

If you're using HTable.getScanner within an MR job, it will have the same
behavior as anywhere else.  You will be iterating through the whole table.

> -- What is the best way if I may want to iterate through all the rows
> for a
> particualr region in a map job. This may be required to perform a
> select
> operation parallely.

That is exactly what you are doing by using TableMap as the input to the MR
job.  Each map task is a scanner through a single region.  You do not need
to create a scanner within the map().  There will be a call to the map() for
each row in that region, a task for each region in the table.

> Sorry for the long email. Many of the questions may be basic. I
> appreciate
> if someone can answer them. Also any suggestions of implementing joins
> using
> map reduce on hbase.
> Thanks

Can you be more specific?  HBase is not typically meant for joining data,
though there are certainly plenty of valid cases for doing so.  You may be
able to get around it with better structuring of your data (denormalization
is your friend), otherwise it's certainly possible to do with MR depending
on the specifics.

Hope that helps.  Don't hesitate to ask more questions, that's what the list
is for, but don't forget to read the HBase Architecture docs, the other wiki
pages, and to search the mailing list archives as well.

Jonathan Gray

View raw message