Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@crunch.apache.org
Received-SPF: pass (nike.apache.org: domain of davidwhiting@gmail.com
 designates 74.125.82.174 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAH29n6OaONYHX7t715BfE5a1cXS02QiWW3ESuoJRacBAUyho2A@mail.gmail.com>
References: 
 <CAGUrFMDWvNRB3SCdUdtWWJfdDgyvqwDYbA94JGNBEX=6SJiOAg@mail.gmail.com>
	<CAA5C_pudcRSBvCYYOKLKaNMYhKThGp66mzzbpHa0zTNb+Z9sFg@mail.gmail.com>
	<CANb5z2KCPHVQtMF04a_h_tw8dZEao3iK_cCbKy_xXXq_9_RSSw@mail.gmail.com>
	<CAA5C_pvTeJADn05FMkXVUorq3TSd60=Vj=-_3QOwSvCME8Nvtw@mail.gmail.com>
	<CAH29n6OaONYHX7t715BfE5a1cXS02QiWW3ESuoJRacBAUyho2A@mail.gmail.com>
Date: Wed, 4 Jun 2014 18:02:33 +0200
Message-ID: 
 <CAGUrFMBiy9JtERY3y6Q6ksM02JDhU6hcoT2y7vEL1OkHUw--fg@mail.gmail.com>
Subject: Re: Simulating Avro reads in MemPipeline
From: David Whiting <davidwhiting@gmail.com>
To: dev@crunch.apache.org
Content-Type: multipart/alternative; boundary=e89a8f8389959ae26804fb04c1de

--e89a8f8389959ae26804fb04c1de
Content-Type: text/plain; charset=UTF-8

Let me see if I can transform the thing I've got already into a more
suitable shape


On 4 June 2014 17:10, Josh Wills <jwills@cloudera.com> wrote:

> Tracking this here: https://issues.apache.org/jira/browse/CRUNCH-412
>
> David, do you have a patch that would be easy to adapt here?
>
>
> On Wed, Jun 4, 2014 at 8:05 AM, Gabriel Reid <gabriel.reid@gmail.com>
> wrote:
>
> > On Wed, Jun 4, 2014 at 4:53 PM, Josh Wills <josh.wills@gmail.com> wrote:
> > > As long as we're on the subject, the fact that we don't
> > > serialize/deserialize DoFns before we run MemPipelines has burned me a
> > few
> > > times when I would make a change and forget about the serialization
> > > implications until I tried to run the job. One of the things I like
> about
> > > the local Spark implementation is that it does this serialization check
> > for
> > > you. So if we were to have some mode that could be enabled to allow us
> to
> > > simulate DoFn serialization and object re-use situations locally, even
> at
> > > the cost of extra runtime overhead, I would be happy.
> >
> > Yeah, that sounds like a good plan -- it could probably be done via an
> > overload of MemPipeline.getInstance() that would take a boolean flag
> > to replication MR or not.
> >
> >
> > >
> > >
> > > On Wed, Jun 4, 2014 at 4:27 AM, Gabriel Reid <gabriel.reid@gmail.com>
> > wrote:
> > >
> > >> Hi David,
> > >>
> > >> Yeah, the whole object reuse situation in MapReduce is quite a drag.
> > >> By the way, this isn't specific to Avro -- it's just as big of an
> > >> issue with Writables.
> > >>
> > >> I'm kind of torn on the idea of building this into MemPipeline. On the
> > >> one hand, the closer the behavior of different pipeline
> > >> implementations are, the better, and I think MemPipeline is currently
> > >> indeed largely used for fast testing of MR workflows.
> > >>
> > >> On the other hand, the MemPipeline can also be seen as an optimized
> > >> implementation for running a pipeline faster if everything fits in
> > >> memory, so adding in extra serialization/deserialization just to be
> > >> similar to the MRPipeline would be unfortunately. Additionally,
> > >> there's also the SparkPipeline implementation to consider -- I assume
> > >> that spark doesn't have this same object reuse situation, although I'm
> > >> not actually sure.
> > >>
> > >> I think that my general feeling is that I'd rather not add object
> > >> reuse into MemPipeline, but I do think it would be great if we could
> > >> bring the behavior of the two implementations in line. Any other
> > >> thoughts on how we could do this?
> > >>
> > >> - Gabriel
> > >>
> > >>
> > >> On Wed, Jun 4, 2014 at 11:51 AM, David Whiting <
> davidwhiting@gmail.com>
> > >> wrote:
> > >> > We are facing some problems where errors are not found with
> > MemPipeline
> > >> > tests which do happen on the cluster due to the way Avro reuses
> > objects
> > >> for
> > >> > reading; meaning that if you do a parallelDo and a PGroupedTable,
> then
> > >> the
> > >> > Iterable of values you get is actually the same object returned to
> you
> > >> with
> > >> > different contents. If you then try and store certain instances of
> > this
> > >> to
> > >> > be emitted later, then you will get unexpected results without
> taking
> > a
> > >> > copy.
> > >> >
> > >> > To try and catch these problems on the local side, we've written a
> > >> wrapper
> > >> > for Iterable<? extends SpecificRecord> which uses the
> > SpecificDatumWriter
> > >> > and SpecificDatumReader to write to and from ByteBuffers for each
> > >> iteration
> > >> > and offering the last record for reuse. This allows us to run the
> > >> offending
> > >> > MapFn or DoFn in isolation with a wrapped Iterable as input to
> > identify
> > >> and
> > >> > fix the problem.
> > >> >
> > >> > This, however, is a little awkward and it would be much neater if
> this
> > >> was
> > >> > driven from the Crunch side. At the point in the MapShuffler where
> the
> > >> > Iterable is wrapped in a SingleUseIterable, it could also be wrapped
> > in
> > >> one
> > >> > of these AvroReadSimulatingIterable, meaning that people could find
> > these
> > >> > problems before even running them live for the first time. However,
> > this
> > >> is
> > >> > a problem that is specific to Avro so it obviously makes no sense to
> > hack
> > >> > it right into HFunction.
> > >> >
> > >> > Anyone have any thoughts on how this could be integrated nicely, or
> > >> indeed
> > >> > if it should be integrated at all?
> > >>
> >
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

--e89a8f8389959ae26804fb04c1de--