hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Gray" <jl...@streamy.com>
Subject RE: Few questions about map reduce in Hbase
Date Sun, 16 Nov 2008 22:20:06 GMT
You get two sorted maps in HBase.  The first is row key, the second is
columns within each family.  Beyond those two "indexes" you would have to
build something separately.  Others have done things internally, but I have
no experience with that.  For our uses, part of which is extensive merging
and joining, we have opted to do that work in an external and in-memory
custom built application.

If you only need to fetch by row, then column (and not value) then you can
do that relatively fast in HBase already.  This use case, a random seek, is
a major goal for performance improvement slated for 0.20.

TableMap will do just that, if I understand what you're asking.  You want a
map task per-region?  That's what TableMap does.

Are you saying you already know which region a column exists in, and you
only want to look within that region?  Seems like an odd use-case, but you
can of course provide a startRow to your scanner which you could set to the
startRow of the region you know to look in.  If you actually know the row,
this is exactly what you should be doing (you do not need to scan to find a
particular row).

I still don't really understand the exact requirements for your joining, but
whatever the specifics, there are not any built-in methods for joining so
I'm not sure how else I can help you.  Do you need to do this joining in
"realtime" or in batch?

JG


> -----Original Message-----
> From: Nishant Khurana [mailto:nishant2984@gmail.com]
> Sent: Sunday, November 16, 2008 1:10 PM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Few questions about map reduce in Hbase
> 
> Hi Jonathan,
> Thanks for your reply. That made things lot clear to me. But there are
> more
> questions :) .
> -- What is the best way to build a index over a field in Hbase ? Do I
> have
> to build it in a custom way and store it on HDFS. If I have a query
> (not on
> HQL) like selection over 2 fields out of which one is the row_id and
> other
> is some other column, I can easily figure out which regions that
> belongs to
> and if I have a index on other column too, that can again give me
> another
> set of regions which I can intersect to get the final one.
> -- Is there a way I can iterate thorough all the rows of a particular
> region
> only (for a particular relation)? Table Map will do it for all If I am
> not
> wrong.
> 
> Well I would need joins if say individual relations are managed
> independently and people would like to see data from all the relations
> for a
> query. Its like sharing of data amongst a topic specific community. So
> what
> should be the best way to implement sort merge joins. I hope that
> should be
> easiest to start with.
> 
> Thanks for the response Jonathan.
> 
> On Sun, Nov 16, 2008 at 3:34 PM, Jonathan Gray <jlist@streamy.com>
> wrote:
> 
> > > Hi,
> > > I am new to Hadoop and Hbase. I am trying to understand how to use
> map
> > > reduce with Hbase as source and sink and had following questions.
> Would
> > > appreciate if someone can answer them and may be point me to some
> > > sample
> > > code:
> > >
> > > -- As far as I understood, the tables gets stored in different
> regions
> > > in
> > > Hbase which are split across various nodes in HDFS. Is there a way
> to
> > > control the amount of replication of a particular table ?
> >
> > The regions are split across the different region servers, the
> contents of
> > each region is made up of many different files/blocks which are then
> > replicated across the nodes of HDFS.  Replication is set in HDFS,
> HBase has
> > not concept of replication.  Therefore it's not possible (as far as I
> know)
> > to set per-table replication levels.  If it was possible to set
> > per-directory replication settings in HDFS, then this might be
> possible,
> > I'm
> > unsure if that is possible though I think it is a global setting.
> >
> >
> > > --When we try to use a table scanner, it automatically switches
> between
> > > various regions of a table which may be present across different
> nodes
> > > and
> > > returns us the row handle. So it is a single process doing that. Am
> I
> > > correct ?
> >
> > The META table (which is stored in regions/on regionservers like any
> other
> > table) contains the start/end key and node locations of all other
> tables
> > and
> > their regions.  When using a scanner, it will start with the region
> which
> > includes your startrow (first region of the table if no startrow
> given) and
> > once you have reached the end of the current region, you will use
> META
> > information to find the next region.  Your scanner will then continue
> in
> > that region, which might be on a different node.
> >
> >
> > > -- When we use TableMap to run map reduce jobs on Hbase, it
> > > automatically
> > > creates several map jobs i.e. one per region and performs map
> operation
> > > on
> > > the key range of that particular region. So if I use a table
> scanner
> > > inside
> > > a map job, will I be still iterating through only row ranges of
> that
> > > particular region or again the whole table ?
> >
> > If you're using HTable.getScanner within an MR job, it will have the
> same
> > behavior as anywhere else.  You will be iterating through the whole
> table.
> >
> >
> > > -- What is the best way if I may want to iterate through all the
> rows
> > > for a
> > > particualr region in a map job. This may be required to perform a
> > > select
> > > operation parallely.
> >
> > That is exactly what you are doing by using TableMap as the input to
> the MR
> > job.  Each map task is a scanner through a single region.  You do not
> need
> > to create a scanner within the map().  There will be a call to the
> map()
> > for
> > each row in that region, a task for each region in the table.
> >
> >
> > > Sorry for the long email. Many of the questions may be basic. I
> > > appreciate
> > > if someone can answer them. Also any suggestions of implementing
> joins
> > > using
> > > map reduce on hbase.
> > > Thanks
> >
> > Can you be more specific?  HBase is not typically meant for joining
> data,
> > though there are certainly plenty of valid cases for doing so.  You
> may be
> > able to get around it with better structuring of your data
> (denormalization
> > is your friend), otherwise it's certainly possible to do with MR
> depending
> > on the specifics.
> >
> > Hope that helps.  Don't hesitate to ask more questions, that's what
> the
> > list
> > is for, but don't forget to read the HBase Architecture docs, the
> other
> > wiki
> > pages, and to search the mailing list archives as well.
> >
> > Jonathan Gray
> >
> >
> 
> 
> --
> Nishant Khurana
> Candidate for Masters in Engineering (Dec 2009)
> Computer and Information Science
> School of Engineering and Applied Science
> University of Pennsylvania


Mime
View raw message