hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Booth <jaybo...@gmail.com>
Subject Re: Hadoop Data Sharing
Date Tue, 11 May 2010 20:34:28 GMT
Probably the most direct route to get your desired result is to save
the objects to either a SequenceFile or plain text file on DFS.  Then
in the configure() section of your mapreduce jobs, you open the file
on DFS, stream contents into a local variable and refer to it as  you
need to.  Either way, you'll need some sort of serialization via
Writable or plain text.

On Tue, May 11, 2010 at 4:19 PM, Renato Marroquín Mogrovejo
<renatoj.marroquin@gmail.com> wrote:
> Hi Aaron,
>
> The thing is that I had a data structure that is saved into a vector, and
> this vector needs to be available for my MapReduce jobs while iterating. So
> would you think it would a good and easy way to serialize this objects? It's
> a vector that each node contains another user define data structure. Maybe I
> will try to do it first just using files, and see how the throughput goes.
> Hey do you know where I can find some examples of serializing objects for
> Hadoop to save them into SequenceFiles?
> Thanks in advance.
>
> Renato M.
>
>
> 2010/5/11 Aaron Kimball <aaron@cloudera.com>
>
>> Perhaps this is guidance in the area you were hoping for: If your data is
>> in
>> objects that implement the interface 'Writable', then you can use the
>> SequenceFileOutputFormat and SequenceFileInputFormat to store your
>> intermediate data in binary form in disk-backed files called SequenceFiles.
>> The serialization will happen through the write() and readFields() methods
>> of your objects, which will automatically be called by the
>> OutputFormat/InputFormat as they move through the system. So your
>> subsequent
>> MR pass will receive objects back in the same form as they were emitted.
>> This is a considerably better idea (from both a throughput and a sanity
>> perspective) in a chained MapReduce job.
>>
>> - Aaron
>>
>> On Tue, May 11, 2010 at 10:31 AM, Aaron Kimball <aaron@cloudera.com>
>> wrote:
>>
>> > What objects are you referring to? I'm not sure I understand your
>> question.
>> > - Aaron
>> >
>> >
>> > On Tue, May 11, 2010 at 6:38 AM, Renato Marroquín Mogrovejo <
>> > renatoj.marroquin@gmail.com> wrote:
>> >
>> >> Thanks Aaron! I was thinking the same after doing some reading.
>> >> Man what about serialize the objects? Would you think that is a good
>> idea?
>> >> Thanks again.
>> >>
>> >> Renato M.
>> >>
>> >>
>> >> 2010/5/5 Aaron Kimball <aaron@cloudera.com>
>> >>
>> >> > Renato,
>> >> >
>> >> > In general if you need to perform a multi-pass MapReduce workflow,
>> each
>> >> > pass
>> >> > materializes its output to files. The subsequent pass then reads those
>> >> same
>> >> > files back in as input. This allows the workflow to start at the last
>> >> > "checkpoint" if it gets interrupted. There is no persistent in-memory
>> >> > distributed storage feature in Hadoop that would allow a MapReduce
job
>> >> to
>> >> > post results to memory for consumption by a subsequent job.
>> >> >
>> >> > So you would just read your initial data from /input, and write your
>> >> > interim
>> >> > results to /iteration0. Then the next pass reads from /iteration0 and
>> >> > writes
>> >> > to /iteration1, etc..
>> >> >
>> >> > If your data is reasonably small and you think it could fit in memory
>> >> > somewhere, then you could experiment with using other distributed
>> >> key-value
>> >> > stores (memcached[b], hbase, cassandra, etc..) to hold intermediate
>> >> > results.
>> >> > But this will require some integration work on your part.
>> >> > - Aaron
>> >> >
>> >> > On Wed, May 5, 2010 at 8:29 AM, Renato Marroquín Mogrovejo <
>> >> > renatoj.marroquin@gmail.com> wrote:
>> >> >
>> >> > > Hi everyone, I have recently started to play around with hadoop,
but
>> I
>> >> am
>> >> > > getting some into some "design" problems.
>> >> > > I need to make a loop to execute the same job several times, and
in
>> >> each
>> >> > > iteration get the processed values (not using a file because I
would
>> >> need
>> >> > > to
>> >> > > read it). I was using an static vector in my main class (the one
>> that
>> >> > > iterates and executes the job in each iteration) to retrieve those
>> >> > values,
>> >> > > and it did work while I was using a standalone mode. Now I tried
to
>> >> test
>> >> > it
>> >> > > on a pseudo-distributed manner and obviously is not working.
>> >> > > Any suggestions, please???
>> >> > >
>> >> > > Thanks in advance,
>> >> > >
>> >> > >
>> >> > > Renato M.
>> >> > >
>> >> >
>> >>
>> >
>> >
>>
>

Mime
View raw message