Mailing-List: contact general-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: general@hadoop.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <y2p44be54d01005050829gf270a41du98eaf93b0fe2f070@mail.gmail.com>
References: <y2p44be54d01005050829gf270a41du98eaf93b0fe2f070@mail.gmail.com>
From: Aaron Kimball <aaron@cloudera.com>
Date: Wed, 5 May 2010 18:04:08 -0700
Message-ID: <j2hd6d7c4411005051804p40b2b598ud593f984c09d779@mail.gmail.com>
Subject: Re: Hadoop Data Sharing
To: general@hadoop.apache.org
Content-Type: multipart/alternative; boundary=000e0cd118263da5870485e28698

--000e0cd118263da5870485e28698
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Renato,

In general if you need to perform a multi-pass MapReduce workflow, each pas=
s
materializes its output to files. The subsequent pass then reads those same
files back in as input. This allows the workflow to start at the last
"checkpoint" if it gets interrupted. There is no persistent in-memory
distributed storage feature in Hadoop that would allow a MapReduce job to
post results to memory for consumption by a subsequent job.

So you would just read your initial data from /input, and write your interi=
m
results to /iteration0. Then the next pass reads from /iteration0 and write=
s
to /iteration1, etc..

If your data is reasonably small and you think it could fit in memory
somewhere, then you could experiment with using other distributed key-value
stores (memcached[b], hbase, cassandra, etc..) to hold intermediate results=
.
But this will require some integration work on your part.
- Aaron

On Wed, May 5, 2010 at 8:29 AM, Renato Marroqu=EDn Mogrovejo <
renatoj.marroquin@gmail.com> wrote:

> Hi everyone, I have recently started to play around with hadoop, but I am
> getting some into some "design" problems.
> I need to make a loop to execute the same job several times, and in each
> iteration get the processed values (not using a file because I would need
> to
> read it). I was using an static vector in my main class (the one that
> iterates and executes the job in each iteration) to retrieve those values=
,
> and it did work while I was using a standalone mode. Now I tried to test =
it
> on a pseudo-distributed manner and obviously is not working.
> Any suggestions, please???
>
> Thanks in advance,
>
>
> Renato M.
>

--000e0cd118263da5870485e28698--