Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@crunch.apache.org
Received-SPF: pass (nike.apache.org: domain of jwills@cloudera.com designates
 209.85.216.176 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAAZ8Dxo_-7T0oggyT3AUQV-eDEqmHOauRaACrrtjGe2vmMchdw@mail.gmail.com>
References: 
 <CAAZ8DxqFxty_XJcrzGaUJjq3oG85_5nHjmTzSn4k=_X3CxO2rQ@mail.gmail.com>
 <CAH29n6P_j1v3wfxptQkTqKn-cN8-RWUqmSgNrWgG+5Kdv_Upug@mail.gmail.com>
 <CAAZ8DxoMq4F5m0EdiTd7BzuWo=PUai+Z-OEs+nFoTJoDZD-ivw@mail.gmail.com>
 <CANb5z2K-RHpEXozWBLEoMcrd6UR_VyZUEWfEst-bpmRiKAo5AQ@mail.gmail.com>
 <CAAZ8DxqpfLVvNNXTgK2Mywm_oUqwFn4haeR5X-DdeB-09Oboyw@mail.gmail.com>
 <CAH29n6NCNEF4WC0Cz+jEjLm2Gqc5JqHgmopp0pvpCFY8Z_qrEA@mail.gmail.com>
 <CAAZ8Dxp=Rj9-=-mjpUuFTzzLzpR1899ZYFfc7jkW+Onm3UG3=A@mail.gmail.com>
 <CAAZ8DxqnDhOhofeVzLtts_zDtWqLk5uyfHC5XDbwc61U3LK4-Q@mail.gmail.com>
 <CAAZ8DxqNNq0NO5T-2GJB1wm=LXez1A2poR27=y2QCmSmOMeZFQ@mail.gmail.com>
 <CAH29n6NneAUk7jgmp+QP1CMbsEp021aA8OQU+foLK+yThySAnQ@mail.gmail.com>
 <CAAZ8DxpejZ7NTaryMV-Gj7g1ppcFM4h3QEf4Ee79Rpg_az3X+Q@mail.gmail.com>
 <CANb5z2J1bLx3U2=1hB-RttFXJhUK-zyy=xL1AVYZ6s96RQBVPA@mail.gmail.com>
 <CAAZ8Dxo_-7T0oggyT3AUQV-eDEqmHOauRaACrrtjGe2vmMchdw@mail.gmail.com>
From: Josh Wills <jwills@cloudera.com>
Date: Thu, 27 Feb 2014 12:30:40 -0800
Message-ID: 
 <CAH29n6OYNOhqVGD1r66W_HA8MZwj-X6fHNXtSDv6wG4PMg7SgA@mail.gmail.com>
Subject: Re: Illegal State Exception when doing a union
To: dev <dev@crunch.apache.org>
Content-Type: multipart/alternative; boundary=001a11c153f60f93ca04f369337d

--001a11c153f60f93ca04f369337d
Content-Type: text/plain; charset=ISO-8859-1

Oh, absolutely-- go right ahead.

J


On Thu, Feb 27, 2014 at 12:26 PM, Jinal Shah <jinalshah2007@gmail.com>wrote:

> Can we atleast log a jira for that? So in that case who ever is available
> and interested can work on it.
>
>
> On Thu, Feb 27, 2014 at 1:48 PM, Josh Wills <josh.wills@gmail.com> wrote:
>
> > Yeah, but it will require changing code to do that; there isn't a way to
> do
> > it as currently implemented. My hypothesis would be that we would need to
> > modify Sources to check to see if they were SourceTargets that didn't
> exist
> > yet, figure out which job was writing them, and then add the sourceTarget
> > dependency automatically, and be able to do the size planning for the job
> > based on the estimated size of the PCollection(s) that were populating
> that
> > target. It's not obviously a trivial change (at least, it's not obvious
> to
> > me yet), and I wouldn't consider it a priority while Pipeline.run()
> exists
> > as a workaround.
> >
> > J
> >
> >
> > On Thu, Feb 27, 2014 at 11:38 AM, Jinal Shah <jinalshah2007@gmail.com
> > >wrote:
> >
> > > Hey Josh, Is there no way of telling the planner when it is trying to
> do
> > > union or co-group or some operation where it is trying to find the size
> > > from the location we are reading from after doing a write for planning
> >  to
> > > do a run till there if the source is something that needs to be
> generated
> > > through the processing prior to planning it ahead of time. May be I'm
> > > completely wrong but it was just a thought.
> > >
> > >
> > > On Wed, Feb 26, 2014 at 9:00 PM, Josh Wills <jwills@cloudera.com>
> wrote:
> > >
> > > > Hey Jinal,
> > > >
> > > > Been thinking about it off-and-on all day, and I don't have a better
> > > > solution right now than pipeline.run()...
> > > >
> > > > J
> > > >
> > > >
> > > > On Wed, Feb 26, 2014 at 6:46 PM, Jinal Shah <jinalshah2007@gmail.com
> >
> > > > wrote:
> > > >
> > > > > So Josh what do you think can be done?
> > > > >
> > > > >
> > > > > On Wed, Feb 26, 2014 at 10:37 AM, Jinal Shah <
> > jinalshah2007@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > As well as it is trying to run it in parallel so now it is
> failing
> > on
> > > > > that.
> > > > > >
> > > > > >
> > > > > > On Wed, Feb 26, 2014 at 10:30 AM, Jinal Shah <
> > > jinalshah2007@gmail.com
> > > > > >wrote:
> > > > > >
> > > > > >> I did as you said but now it is running the DoFn twice since
> after
> > > > that
> > > > > >> parallel do I'm writing that output to HDFS so it divided that
> > both
> > > > work
> > > > > >> into 2 once while storing the output it is running it in the
> > reduce
> > > > > phase
> > > > > >> and then while doing the union it is running it in the map
> phase.
> > > > > >>
> > > > > >>
> > > > > >> On Tue, Feb 25, 2014 at 7:41 PM, Josh Wills <
> jwills@cloudera.com>
> > > > > wrote:
> > > > > >>
> > > > > >>> So my thought would be that if the DoFn in this step:
> > > > > >>>
> > > > > >>> beforeWrite.parallelDo(DoFn, U, ParallelDoOptions.builder().
> > > > > >>> sources(target).build());
> > > > > >>>
> > > > > >>> signaled that it was going to write a lot of data with a large
> > > > > >>> scaleFactor,
> > > > > >>> then the planner would use the output from beforeWrite as a
> > > > checkpoint,
> > > > > >>> and
> > > > > >>> save the DoFn processing for the map phase.
> > > > > >>>
> > > > > >>>
> > > > > >>> On Tue, Feb 25, 2014 at 5:08 PM, Jinal Shah <
> > > jinalshah2007@gmail.com
> > > > >
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>> > Yup this is to avoid .run() ;-) . But I want the beforeWrite
> > > output
> > > > > to
> > > > > >>> be
> > > > > >>> > stored. So how do I apply the scaleFactor method and how will
> > > help
> > > > to
> > > > > >>> make
> > > > > >>> > the DoFn for afterWrite run in Mapside.
> > > > > >>> >
> > > > > >>> >
> > > > > >>> > On Tue, Feb 25, 2014 at 6:58 PM, Josh Wills <
> > > josh.wills@gmail.com>
> > > > > >>> wrote:
> > > > > >>> >
> > > > > >>> > > Okay. Out of curiosity, if you override the float
> > scaleFactor()
> > > > > >>> method
> > > > > >>> > that
> > > > > >>> > > you apply here:
> > > > > >>> > >
> > > > > >>> > > PCollection<U> afterParallelDo =
> afterWrite.parallelDo(DoFn,
> > U,
> > > > > >>> > > ParallelDoOptions.builder().sources(target).build());
> > > > > >>> > >
> > > > > >>> > > and apply it to beforeWrite, does it still insist on
> writing
> > > out
> > > > > >>> > > beforeWrite on the reduce side?
> > > > > >>> > >
> > > > > >>> > > BTW, I'm assuming there is (again) some reason not to
> force a
> > > > run()
> > > > > >>> here.
> > > > > >>> > > ;-)
> > > > > >>> > >
> > > > > >>> > >
> > > > > >>> > >
> > > > > >>> > > On Tue, Feb 25, 2014 at 4:51 PM, Jinal Shah <
> > > > > jinalshah2007@gmail.com
> > > > > >>> >
> > > > > >>> > > wrote:
> > > > > >>> > >
> > > > > >>> > > > I wanted to run that in the map phase instead of reduce.
> > If I
> > > > > >>> don't do
> > > > > >>> > > that
> > > > > >>> > > > it will run in the reduce phase.
> > > > > >>> > > >
> > > > > >>> > > >
> > > > > >>> > > > On Tue, Feb 25, 2014 at 5:38 PM, Josh Wills <
> > > > jwills@cloudera.com
> > > > > >
> > > > > >>> > wrote:
> > > > > >>> > > >
> > > > > >>> > > > > On Tue, Feb 25, 2014 at 3:04 PM, Jinal Shah <
> > > > > >>> jinalshah2007@gmail.com
> > > > > >>> > >
> > > > > >>> > > > > wrote:
> > > > > >>> > > > >
> > > > > >>> > > > > > Hi,
> > > > > >>> > > > > >
> > > > > >>> > > > > > I'm trying to do an union of 3 PTables but I'm
> getting
> > > this
> > > > > >>> error
> > > > > >>> > > > > > http://pastebin.com/TkMPunJu
> > > > > >>> > > > > >
> > > > > >>> > > > > > this is where it is throwing it
> > > > > >>> > > > > >
> > > > > >>> > > > > >
> > > > > >>> > > > >
> > > > > >>> > > >
> > > > > >>> > >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/impl/dist/collect/BaseInputCollection.java#L66
> > > > > >>> > > > > >
> > > > > >>> > > > > > this is what I'm trying to do
> > > > > >>> > > > > >
> > > > > >>> > > > > > PCollection<U> beforeWrite = someOperation();
> > > > > >>> > > > > >
> > > > > >>> > > > > > SourceTarget<U> target = new
> > > > > AvroFileTarget().asSourceTaget(U);
> > > > > >>> > > > > >
> > > > > >>> > > > > > pipeline.write(beforeWrite, target);
> > > > > >>> > > > > >
> > > > > >>> > > > > > PCollection<U> afterWrite = pipeline.read(target);
> > > > > >>> > > > > >
> > > > > >>> > > > >
> > > > > >>> > > > > Why are you creating afterWrite here, instead of doing
> > the
> > > > > >>> processing
> > > > > >>> > > in
> > > > > >>> > > > > the next step (the one that yields afterParallelDo)
> > against
> > > > > >>> > > beforeWrite?
> > > > > >>> > > > >
> > > > > >>> > > > >
> > > > > >>> > > > > > PCollection<U> afterParallelDo =
> > > > afterWrite.parallelDo(DoFn,
> > > > > U,
> > > > > >>> > > > > > ParallelDoOptions.builder().sources(target).build());
> > > > > >>> > > > > >
> > > > > >>> > > > > > PTable<K,U> afterSomeOperation = someOperations();
> > > > > >>> > > > > >
> > > > > >>> > > > > > PTable<K,U> thatNeedsToBeAdded = comingFromHbase();
> > > > > >>> > > > > >
> > > > > >>> > > > > > PTable<K,U> unionNeeded =
> > > > > >>> > > >  afterSomeOperation.union(thatNeedsToBeAdded);
> > > > > >>> > > > > //
> > > > > >>> > > > > > this is where it fails for some reason since it is
> > > looking
> > > > > for
> > > > > >>> the
> > > > > >>> > > > target
> > > > > >>> > > > > > which is not generated yet.
> > > > > >>> > > > > >
> > > > > >>> > > > > >
> > > > > >>> > > > > > Can anyone help me in understanding why this is
> > > happening?
> > > > > >>> > > > > >
> > > > > >>> > > > > > Thanks
> > > > > >>> > > > > > Jinal
> > > > > >>> > > > > >
> > > > > >>> > > > >
> > > > > >>> > > > >
> > > > > >>> > > > >
> > > > > >>> > > > > --
> > > > > >>> > > > > Director of Data Science
> > > > > >>> > > > > Cloudera <http://www.cloudera.com>
> > > > > >>> > > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > > > >>> > > > >
> > > > > >>> > > >
> > > > > >>> > >
> > > > > >>> >
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> --
> > > > > >>> Director of Data Science
> > > > > >>> Cloudera <http://www.cloudera.com>
> > > > > >>> Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Director of Data Science
> > > > Cloudera <http://www.cloudera.com>
> > > > Twitter: @josh_wills <http://twitter.com/josh_wills>
> > > >
> > >
> >
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

--001a11c153f60f93ca04f369337d--