hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amandeep Khurana <ama...@gmail.com>
Subject Re: Question about MapReduce
Date Fri, 16 Oct 2009 00:42:39 GMT
On Thu, Oct 15, 2009 at 5:31 PM, Something Something <luckyguy2050@yahoo.com
> wrote:

> Kevin - Interesting way to solve the problem, but I don't think this
> solution is bullet-proof.  While the MapReduce is running, someone may
> modify the "flag" and that will completely change the outcome - unless of
> course there's some way in HBase to lock this column.
>
> Amandeep - I was hoping to do this without writing to a flat file since the
> data I need is already in memory.  Also, I am not sure what you mean by "Why
> are you using HBase?"  I am using it because that's where the data I need
> for calculations is.
>

1. Why are you using HBase? HBase isnt a "one size fits all" solution. You
might be complicating your job by using HBase for certain tasks. I'm not
saying that you are in this case, but you might be.. Thats why I asked the
rationale behind it.

2. Writing to a flat file isnt a bad idea at all. When you need intermediate
values, I dont see any harm in writing them to a flat file and processing
them after that.
You can also look at the Cascading project. I havent used it myself, but it
has ways in which you can define data flows and probably do something like
what you are looking to do. (They use intermediate temporary lists too..
just that you wont see it explicitly).



>
>
>
>
>
> ________________________________
> From: Kevin Peterson <kpeterson@biz360.com>
> To: general@hadoop.apache.org
> Sent: Thu, October 15, 2009 2:44:58 PM
> Subject: Re: Question about MapReduce
>
> On Thu, Oct 15, 2009 at 2:20 PM, Something Something <
> luckyguy2050@yahoo.com
> > wrote:
>
> > I have 3 HTables.... Table1, Table2 & Table3.
> > I have 3 different flat files.  One contains keys for Table1, 2nd
> contains
> > keys for Table2 & 3rd contains keys for Table3.
> >
> > Use case:  For every combination of these 3 keys, I need to perform some
> > complex calculation and save the result in another HTable.  In other
> words,
> > I need to calculate values for the following combos:
> >
> > (1,1,1) (1,1,2).......   (1,1,N) (1,2,1) (1,3,1) & so on....
> >
> > So I figured the best way to do this is to start a MapReduce Job for each
> > of these combinations.  The MapReduce will get (Key1, Key2, Key3) as
> input,
> > then read Table1, Table2 & Table3 with these keys and perform the
> > calculations.  Is this the correct approach?  If it is, I need to pass
> Key1,
> > Key2 & Key3 to the Mapper & Reducer.  What's the best way to do this?
> >
> > So you need the Cartesian product of all these files. My recommendation:
>
> Run three jobs which each read one of these files and set a flag in the row
> of the appropriate table. This way, you don't need the files at all, you
> just read some "flag:active" column in the tables.
>
> Next, pick one of the tables. It doesn't really matter which one from a
> logical standpoint, you could say table1, you could pick the one with the
> most data in it, or you may pick the one iwth the most individual entries
> flagged. Use it as input to tableinputformat, with a filter that only
> passes
> through those rows that are flagged.
>
> In the mapper, create a scanner over each of the other two columns using
> the
> same filter. You have two nested loops inside your map. In the innermost
> loop, be sure to updated a counter or call progress() to avoid the
> jobtracker timing out.
>
> Use tableoutputformat from that job to write to your output table.
>
> Depending on what exactly it means when you get a row key in your original
> input files, the next time through you will likely need to go through and
> clear all the flags before starting the process again.
>
> You definitely will not be starting multiple map reduce jobs. You will have
> one map reduce job that iterates through all the possible combinations, and
> your goal needs to be to make sure that the task can be split up enough
> that
> it can be parallelized.
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message