hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: Hadoop Data Sharing
Date Tue, 11 May 2010 17:34:49 GMT
Perhaps this is guidance in the area you were hoping for: If your data is in
objects that implement the interface 'Writable', then you can use the
SequenceFileOutputFormat and SequenceFileInputFormat to store your
intermediate data in binary form in disk-backed files called SequenceFiles.
The serialization will happen through the write() and readFields() methods
of your objects, which will automatically be called by the
OutputFormat/InputFormat as they move through the system. So your subsequent
MR pass will receive objects back in the same form as they were emitted.
This is a considerably better idea (from both a throughput and a sanity
perspective) in a chained MapReduce job.

- Aaron

On Tue, May 11, 2010 at 10:31 AM, Aaron Kimball <aaron@cloudera.com> wrote:

> What objects are you referring to? I'm not sure I understand your question.
> - Aaron
>
>
> On Tue, May 11, 2010 at 6:38 AM, Renato Marroquín Mogrovejo <
> renatoj.marroquin@gmail.com> wrote:
>
>> Thanks Aaron! I was thinking the same after doing some reading.
>> Man what about serialize the objects? Would you think that is a good idea?
>> Thanks again.
>>
>> Renato M.
>>
>>
>> 2010/5/5 Aaron Kimball <aaron@cloudera.com>
>>
>> > Renato,
>> >
>> > In general if you need to perform a multi-pass MapReduce workflow, each
>> > pass
>> > materializes its output to files. The subsequent pass then reads those
>> same
>> > files back in as input. This allows the workflow to start at the last
>> > "checkpoint" if it gets interrupted. There is no persistent in-memory
>> > distributed storage feature in Hadoop that would allow a MapReduce job
>> to
>> > post results to memory for consumption by a subsequent job.
>> >
>> > So you would just read your initial data from /input, and write your
>> > interim
>> > results to /iteration0. Then the next pass reads from /iteration0 and
>> > writes
>> > to /iteration1, etc..
>> >
>> > If your data is reasonably small and you think it could fit in memory
>> > somewhere, then you could experiment with using other distributed
>> key-value
>> > stores (memcached[b], hbase, cassandra, etc..) to hold intermediate
>> > results.
>> > But this will require some integration work on your part.
>> > - Aaron
>> >
>> > On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo <
>> > renatoj.marroquin@gmail.com> wrote:
>> >
>> > > Hi everyone, I have recently started to play around with hadoop, but I
>> am
>> > > getting some into some "design" problems.
>> > > I need to make a loop to execute the same job several times, and in
>> each
>> > > iteration get the processed values (not using a file because I would
>> need
>> > > to
>> > > read it). I was using an static vector in my main class (the one that
>> > > iterates and executes the job in each iteration) to retrieve those
>> > values,
>> > > and it did work while I was using a standalone mode. Now I tried to
>> test
>> > it
>> > > on a pseudo-distributed manner and obviously is not working.
>> > > Any suggestions, please???
>> > >
>> > > Thanks in advance,
>> > >
>> > >
>> > > Renato M.
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message