crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinal Shah <jinalshah2...@gmail.com>
Subject Re: Execution Control
Date Wed, 12 Feb 2014 17:15:54 GMT
Can I get some comment on this?


On Thu, Feb 6, 2014 at 11:00 AM, Jinal Shah <jinalshah2007@gmail.com> wrote:

> Hi Josh and Micah,
>
> In both the scenerios I can easily do a Pipeline.run() and get going from
> there. But my main question would be why should I do a pipeline.run() in
> between just to make the planner run something in a sequential format
> rather than the way it would have planned otherwise. What I'm getting at is
> that there should some mechanism that will tell the Planner to do something
> in a certain way to some extend like you can take example of Apache Hive,
> till 0.7 release Hive use to provide a mechanism called HINT which would
> tell the query planner to run something as indicated in the HINT rather
> than the way it would have been otherwise. I know that you might say it
> might not create optimized plan but at this point the consumer is more
> focused on the way it should be planned rather than the optimization.
>
> May be there might be option already there in Crunch that I might have not
> explored but just wanted to put my point out there. If there is an option I
> would love to learn about it.
>
>
> On Thu, Feb 6, 2014 at 10:44 AM, Josh Wills <josh.wills@gmail.com> wrote:
>
>> Hey Jinal,
>>
>> On scenario 2, the easiest way to do this is to force a run() between the
>> write and the second read, ala:
>>
>> HBase.read()
>> doSomeChangesOnData
>> HBase.write()
>> Pipeline.run()
>> HBase.read()
>>
>> If that isn't possible for some reason, you'll need to add an output file
>> to the first phase that can be used to indicate that the HBase.write is
>> complete, and then have the second read depend on that file existing
>> before
>> it can run, which can be done via ParallelDoOptions, e.g.,
>>
>> SourceTarget marker = ...;
>> HBase.read()
>> doSomeChangesOnData
>> dummyDoFnToCreateMarker
>> HBase.write()
>> marker.write()
>> HBase.read().parallelDo(DoFn, PType,
>> ParallelDoOptions.builder().sourceTarget(marker).build());
>>
>> but that's obviously uglier and more complicated.
>>
>> J
>>
>>
>> On Wed, Feb 5, 2014 at 7:14 PM, Jinal Shah <jinalshah2007@gmail.com>
>> wrote:
>>
>> > Hi Josh,
>> >
>> > Here is a small example of what I am looking for. So here is what I'm
>> doing
>> >
>> > Scenario 1:
>> >
>> > PCollection<Something> s = FunctionDoingSomething();
>> > pipeline.write(s, path);
>> > doSomeFilteringOn(s);
>> >
>> > I want that when I do some filtering this should be done in the map
>> phase
>> > instead it is doing it in the Reduce phase due to which I have to
>> introduce
>> > a pipeline.run() and now this is what the code looks like
>> >
>> > PCollection<Something> s = FunctionDoingSomething();
>> > pipeline.write(s, path);
>> > pipeline.run()
>> > doSomeFilteringOn(s);
>> >
>> > Scenerio 2:
>> >
>> > I'm doing an operation on HBase and here is how it looks.
>> >
>> > Hbase.read()
>> > doSomeChangesOnData
>> > HBase.write()
>> > HBase.read()
>> >
>> > Now Crunch at this points considers both the reads as separate and
>> tries to
>> > run it in parallel so now before I even write my changes it reads those
>> > changes so I have to again put a pipeline.run() in order to break it
>> into 2
>> > separate flow and execute them in sequence.
>> >
>> > So I'm asking is there any way to send an HINT to the Planner that how
>> it
>> > create the Plan instead of it deciding by itself or someway to have more
>> > control how to make a planner understand in certain situations.
>> >
>> > Thanks
>> > Jinal
>> >
>> >
>> > On Thu, Jan 30, 2014 at 11:10 AM, Josh Wills <jwills@cloudera.com>
>> wrote:
>> >
>> > > On Thu, Jan 30, 2014 at 7:09 AM, Jinal Shah <jinalshah2007@gmail.com>
>> > > wrote:
>> > >
>> > > > Hi everyone,
>> > > >
>> > > > This is Jinal Shah, I'm new to the group. I had a question about
>> > > Execution
>> > > > Control in Crunch. Is there any way we can force Crunch to do
>> certain
>> > > > operations in parallel or certain operations in sequential ways. For
>> > > > example, let's say if we want the pipeline to executed a particular
>> > DoFn
>> > > > function in the Map phase instead of the Reduce phase or
>> vice-versa. Or
>> > > > Execute a particular Flow only after a particular flow is completed
>> as
>> > > > oppose to running it in parallel.
>> > > >
>> > >
>> > > Forcing a DoFn to operate in a map or reduce phase is tough for the
>> > planner
>> > > to do right now; we sort of rely on the developer to have a mental
>> model
>> > of
>> > > how the jobs will proceed. The place where you usually want to force a
>> > DoFn
>> > > to execute in the reduce vs. the map phase is when you have dependent
>> > > groupByKey operations, and you can use cache() or materialize() on the
>> > > intermediate output that you want to split on, and the planner will
>> > respect
>> > > that.
>> > >
>> > > On the latter question, the thing to look for is
>> > > org.apache.crunch.ParallelDoOptions, which isn't something I've doc'd
>> in
>> > > the user guide yet (it's on the todo list, I promise.) You can give a
>> > > parallelDo call an additional argument that specifies one or more
>> > > SourceTargets that have to exist before a particular DoFn is allowed
>> to
>> > > run. In this way, you can force aspects of the pipeline to be
>> sequential
>> > > instead of parallel. We make use of ParallelDoOptions inside of the
>> > > MapsideJoinStrategy code, to ensure that the data set that we'll be
>> > loading
>> > > in-memory actually exists in the file system before we run the code
>> that
>> > > reads it into memory.
>> > >
>> > >
>> > > >
>> > > > Maybe this might be asked before so sorry if it came again. If you
>> guys
>> > > > have further question on the details do let me know
>> > > >
>> > > >
>> > > > Thanks everyone and Have a great day.
>> > > >
>> > > > Thanks
>> > > > Jinal
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Director of Data Science
>> > > Cloudera <http://www.cloudera.com>
>> > > Twitter: @josh_wills <http://twitter.com/josh_wills>
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message