Mailing-List: contact crunch-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: crunch-dev@incubator.apache.org
Received-SPF: pass (nike.apache.org: domain of gabriel.reid@gmail.com
 designates 74.125.83.47 as permitted sender)
Date: Thu, 20 Sep 2012 19:50:17 +0200
From: Gabriel Reid <gabriel.reid@gmail.com>
To: crunch-dev@incubator.apache.org
Message-ID: <295334AF026D415B9DE765A6DB81CD44@gmail.com>
In-Reply-To: 
 <CAH29n6O19JqsXq+e-w7O=x9kkbh5we-D7eGU7h_QyKLzSqXDbQ@mail.gmail.com>
References: <9DE954B8CC2F46CABABE4ABFB81C67D5@gmail.com>
 <CAH29n6NTDz+hwEb1SEoUHyzDiye578mSjx4HAuVdsvnv_N4HcQ@mail.gmail.com>
 <CAA5C_pt89_K+E4cTJMdbyeA-JXyPGArpmpY9S6CSmosxhi53jQ@mail.gmail.com>
 <7547A06B3AC741E9AA773826899FF6A3@gmail.com>
 <CAH29n6O19JqsXq+e-w7O=x9kkbh5we-D7eGU7h_QyKLzSqXDbQ@mail.gmail.com>
Subject: Re: Checkpointing in pipelines
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline


On Thursday 20 September 2012 at 16:50, Josh Wills wrote:

> Hey Gabriel (and others),
> =20
> I think we are on the same page-- you're basically talking about
> creating a way to send hints (or perhaps, orders) to the optimizer in
> terms of how it should decide how to break a job up. I am very much on
> board with this.
> =20
> J

Ok, cool, I'll try to put something together for this.

- Gabriel

 =20
> =20
> On Thu, Sep 20, 2012 at 5:40 AM, Gabriel Reid <gabriel.reid=40gmail.com=
 (mailto:gabriel.reid=40gmail.com)> wrote:
> > Hi Josh (and others),
> > =20
> > I'm not sure if we were on the same page about this or not -- any tho=
ughts on it in the meantime=3F
> > =20
> > - Gabriel
> > =20
> > =20
> > On Thursday 6 September 2012 at 16:18, Gabriel Reid wrote:
> > =20
> > > Hi Josh,
> > > =20
> > > The last thing I would be doing after completing a trans-atlantic
> > > flight is checking developer mailing lists ;-)
> > > =20
> > > What you're talking about (having a kind of rollback for job failur=
es
> > > somewhere along the pipeline) could be facilitated with what I was
> > > talking about here, but it's not what I was trying to accomplish (I=

> > > think you realize that, but I'm just making sure). However, it does=

> > > kind of show that the name =22checkpoint=22 isn't that descriptive =
for the
> > > specific use case that I was talking about (which is what I was a b=
it
> > > worried about).
> > > =20
> > > To clarify, I'm talking about making it possible to have specify th=
at
> > > a node in the execution graph of the pipeline shouldn't be merged i=
n
> > > between two other nodes (for example, an output or a GBK). The
> > > specific use case that I'm going for is customizing the execution p=
lan
> > > for performance, and not for failure recovery.
> > > =20
> > > I think we're on the same page here, but just referring to two
> > > different use cases, right=3F
> > > =20
> > > - Gabriel
> > > =20
> > > =20
> > > On Thu, Sep 6, 2012 at 4:00 PM, Josh Wills <jwills=40cloudera.com (=
mailto:jwills=40cloudera.com)> wrote:
> > > > I grok the concept and see the use case, but I was expecting that=
 this
> > > > email was going to be about checkpointing in the sense of having =
Crunch
> > > > save state about the intermediate outputs of a processing pipelin=
e and then
> > > > supporting the ability to restart a failed pipeline from a checkp=
ointed
> > > > stage-- does that notion line up with what you had in mind here, =
or am I
> > > > just sleep deprived=3F
> > > > =20
> > > > Josh, who just arrived in London
> > > > =20
> > > > On Wed, Sep 5, 2012 at 9:16 PM, Gabriel Reid <gabriel.reid=40gmai=
l.com (mailto:gabriel.reid=40gmail.com)> wrote:
> > > > =20
> > > > > Hi guys,
> > > > > =20
> > > > > In some instances, we want to do some kind of iterative process=
ing in
> > > > > Crunch, and run the same (or a similar) Do=46n on the same PCol=
lection
> > > > > multiple times.
> > > > > =20
> > > > > =46or example, let's say we've got a PCollection of =22grid=22 =
objects, and we
> > > > > want to iteratively divide each of these grids into four sub-gr=
ids, leading
> > > > > to exponential growth of the data. The naive way to do this wou=
ld be to do
> > > > > the following:
> > > > > =20
> > > > > PCollection<Grid> grids =3D =E2=80=A6;
> > > > > for (=E2=80=A6)=7B
> > > > > grids =3D grids.parallelDo(new Subdivide=46n());
> > > > > =7D
> > > > > =20
> > > > > However, the above code would be optimized into a single string=
 of Do=46ns,
> > > > > and not increasing the number of mappers we've got per iteratio=
n, which of
> > > > > course wouldn't work well with the exponential growth of data.
> > > > > =20
> > > > > The current way of getting around this is to add a call to
> > > > > materialize().iterator() on the PCollection in each iteration (=
this is also
> > > > > done in the PageRankIT integration test).
> > > > > =20
> > > > > What I propose is adding a =22checkpoint=22 method to PCollecti=
on to signify
> > > > > that this should be an actual step in processing. This could wo=
rk as
> > > > > follows:
> > > > > =20
> > > > > PCollection<Grid> grids =3D =E2=80=A6;
> > > > > for (=E2=80=A6)=7B
> > > > > grids =3D grids.parallelDo(new Subdivide=46n()).checkpoint();
> > > > > =7D
> > > > > =20
> > > > > =20
> > > > > In the short term this could even be implemented as just a call=
 to
> > > > > materialize().iterator(), but putting encapsulating it in a met=
hod like
> > > > > this would allow us to work more efficiently with it in the fut=
ure,
> > > > > especially once CRUNCH-34 is merged.
> > > > > =20
> > > > > Any thoughts on this=3F The actual name of the method is my big=
gest concern,
> > > > > I'm not sure if =22checkpoint=22 is the best name for it, but I=
 can't think of
> > > > > anything better at the moment.
> > > > > =20
> > > > > - Gabriel
> > > > =20
> > > > =20
> > > > --
> > > > Director of Data Science
> > > > Cloudera <http://www.cloudera.com>
> > > > Twitter: =40josh=5Fwills <http://twitter.com/josh=5Fwills>
> > > =20
> > =20
> =20
> =20
> =20
> =20
> =20
> -- =20
> Director of Data Science
> Cloudera
> Twitter: =40josh=5Fwills