Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of harsh@cloudera.com designates
 209.85.214.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAJyA549p29b7yDLOHRKP6woOk=_RnjOn=_BC=5-PDs2vGyeSWA@mail.gmail.com>
References: 
 <CAJyA549p29b7yDLOHRKP6woOk=_RnjOn=_BC=5-PDs2vGyeSWA@mail.gmail.com>
From: Harsh J <harsh@cloudera.com>
Date: Thu, 23 Aug 2012 10:57:52 +0530
Message-ID: 
 <CAOcnVr0hkQobJT9aKTfXpBwLCctO0dtjg6j54x+pkePxEZ0LQg@mail.gmail.com>
Subject: Re: Side-loading output from one MR into another?
To: user@hadoop.apache.org
Content-Type: text/plain; charset=ISO-8859-1

If it is a small set, you can load it onto distributed cache and then
onto the task's memory, or if its pretty big, perhaps you can do a
map-side join?

On Thu, Aug 23, 2012 at 10:12 AM, Michael Parker
<michael.g.parker@gmail.com> wrote:
> Hi all,
>
> Is it possible to take a collection of sorted key-value pairs,
> generated from one MapReduce, and side-load them into another
> MapReduce, i.e. as it runs, the second MapReduce can look up the value
> for a given key computed by the first MapReduce?
>
> I need this for a cohort study -- one MR puts users into cohorts, and
> the second MR needs that user-to-cohort mapping to see how cohorts
> behave over time.
>
> Any help would be greatly appreciated. Thanks!
>
> - Mike


-- 
Harsh J