crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Simulating Avro reads in MemPipeline
Date Wed, 04 Jun 2014 15:10:41 GMT
Tracking this here: https://issues.apache.org/jira/browse/CRUNCH-412

David, do you have a patch that would be easy to adapt here?


On Wed, Jun 4, 2014 at 8:05 AM, Gabriel Reid <gabriel.reid@gmail.com> wrote:

> On Wed, Jun 4, 2014 at 4:53 PM, Josh Wills <josh.wills@gmail.com> wrote:
> > As long as we're on the subject, the fact that we don't
> > serialize/deserialize DoFns before we run MemPipelines has burned me a
> few
> > times when I would make a change and forget about the serialization
> > implications until I tried to run the job. One of the things I like about
> > the local Spark implementation is that it does this serialization check
> for
> > you. So if we were to have some mode that could be enabled to allow us to
> > simulate DoFn serialization and object re-use situations locally, even at
> > the cost of extra runtime overhead, I would be happy.
>
> Yeah, that sounds like a good plan -- it could probably be done via an
> overload of MemPipeline.getInstance() that would take a boolean flag
> to replication MR or not.
>
>
> >
> >
> > On Wed, Jun 4, 2014 at 4:27 AM, Gabriel Reid <gabriel.reid@gmail.com>
> wrote:
> >
> >> Hi David,
> >>
> >> Yeah, the whole object reuse situation in MapReduce is quite a drag.
> >> By the way, this isn't specific to Avro -- it's just as big of an
> >> issue with Writables.
> >>
> >> I'm kind of torn on the idea of building this into MemPipeline. On the
> >> one hand, the closer the behavior of different pipeline
> >> implementations are, the better, and I think MemPipeline is currently
> >> indeed largely used for fast testing of MR workflows.
> >>
> >> On the other hand, the MemPipeline can also be seen as an optimized
> >> implementation for running a pipeline faster if everything fits in
> >> memory, so adding in extra serialization/deserialization just to be
> >> similar to the MRPipeline would be unfortunately. Additionally,
> >> there's also the SparkPipeline implementation to consider -- I assume
> >> that spark doesn't have this same object reuse situation, although I'm
> >> not actually sure.
> >>
> >> I think that my general feeling is that I'd rather not add object
> >> reuse into MemPipeline, but I do think it would be great if we could
> >> bring the behavior of the two implementations in line. Any other
> >> thoughts on how we could do this?
> >>
> >> - Gabriel
> >>
> >>
> >> On Wed, Jun 4, 2014 at 11:51 AM, David Whiting <davidwhiting@gmail.com>
> >> wrote:
> >> > We are facing some problems where errors are not found with
> MemPipeline
> >> > tests which do happen on the cluster due to the way Avro reuses
> objects
> >> for
> >> > reading; meaning that if you do a parallelDo and a PGroupedTable, then
> >> the
> >> > Iterable of values you get is actually the same object returned to you
> >> with
> >> > different contents. If you then try and store certain instances of
> this
> >> to
> >> > be emitted later, then you will get unexpected results without taking
> a
> >> > copy.
> >> >
> >> > To try and catch these problems on the local side, we've written a
> >> wrapper
> >> > for Iterable<? extends SpecificRecord> which uses the
> SpecificDatumWriter
> >> > and SpecificDatumReader to write to and from ByteBuffers for each
> >> iteration
> >> > and offering the last record for reuse. This allows us to run the
> >> offending
> >> > MapFn or DoFn in isolation with a wrapped Iterable as input to
> identify
> >> and
> >> > fix the problem.
> >> >
> >> > This, however, is a little awkward and it would be much neater if this
> >> was
> >> > driven from the Crunch side. At the point in the MapShuffler where the
> >> > Iterable is wrapped in a SingleUseIterable, it could also be wrapped
> in
> >> one
> >> > of these AvroReadSimulatingIterable, meaning that people could find
> these
> >> > problems before even running them live for the first time. However,
> this
> >> is
> >> > a problem that is specific to Avro so it obviously makes no sense to
> hack
> >> > it right into HFunction.
> >> >
> >> > Anyone have any thoughts on how this could be integrated nicely, or
> >> indeed
> >> > if it should be integrated at all?
> >>
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message