hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Parker <michael.g.par...@gmail.com>
Subject Re: Side-loading output from one MR into another?
Date Thu, 23 Aug 2012 06:57:00 GMT
Thanks for the prompt reply!

Unfortunately, it's not that small.

I'm using the new API; are map side joins accomplished using
Are there any examples which use this package or map side joins?

The way I was thinking of doing it was to output the user-to-cohort
mapping from the first MR as a SequenceFile, and then each mapper in
the second MR could use a SequenceFile.Reader to find the cohort for a
user. It seems reasonable, but is this actually doable? It's like a
manual map-side join, I suppose, although likely not as elegant as
what you were proposing.


On Wed, Aug 22, 2012 at 10:27 PM, Harsh J <harsh@cloudera.com> wrote:
> If it is a small set, you can load it onto distributed cache and then
> onto the task's memory, or if its pretty big, perhaps you can do a
> map-side join?
> On Thu, Aug 23, 2012 at 10:12 AM, Michael Parker
> <michael.g.parker@gmail.com> wrote:
>> Hi all,
>> Is it possible to take a collection of sorted key-value pairs,
>> generated from one MapReduce, and side-load them into another
>> MapReduce, i.e. as it runs, the second MapReduce can look up the value
>> for a given key computed by the first MapReduce?
>> I need this for a cohort study -- one MR puts users into cohorts, and
>> the second MR needs that user-to-cohort mapping to see how cohorts
>> behave over time.
>> Any help would be greatly appreciated. Thanks!
>> - Mike
> --
> Harsh J

View raw message