crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: [jira] [Commented] (CRUNCH-294) Cost-based job planning
Date Wed, 20 Nov 2013 15:42:30 GMT
Getting on a plane. :)

Not the way materialize works for unions- at least, not currently. I could
change how materialize works and see if it's not too invasive. I have some
ideas.
On Nov 20, 2013 7:23 AM, "Gabriel Reid (JIRA)" <jira@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/CRUNCH-294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13827755#comment-13827755]
>
> Gabriel Reid commented on CRUNCH-294:
> -------------------------------------
>
> The logic for choosing splits sounds good to me.
>
> About the breakpoint() method: is it not possible to get the same
> functionality by calling materialize() on a PCollection? I'm also a little
> bit worried about the name -- I think it could cause some confusion in
> terms of debugger breakpoints (but maybe that's just me). The only other
> name I can think of for that is checkpoint(), which maybe also brings its
> share of confusion along with it.
>
> And one other small thing: I noticed a Cloudera file header in
> BreakpointIT.java.
>
> > Cost-based job planning
> > -----------------------
> >
> >                 Key: CRUNCH-294
> >                 URL: https://issues.apache.org/jira/browse/CRUNCH-294
> >             Project: Crunch
> >          Issue Type: Improvement
> >          Components: Core
> >            Reporter: Josh Wills
> >            Assignee: Josh Wills
> >         Attachments: CRUNCH-294.patch, CRUNCH-294b.patch,
> jobplan-default-new.png, jobplan-default-old.png, jobplan-large_s2_s3.png,
> jobplan-lopsided.png
> >
> >
> > A bug report on the user list drove me to revisit some of the core
> planning logic, particularly around how we decide where to split up DoFns
> between two dependent MapReduce jobs.
> > I found an old TODO about using the scale factor from a DoFn to decide
> where to split up the nodes between dependent GBKs, so I implemented a new
> version of the split algorithm that takes advantage of how we've propagated
> support for multiple outputs on both the map and reduce sides of a job to
> do finer-grained splits that use information from the scaleFactor
> calculations to make smarter split decisions.
> > One high-level change along with this: I changed the default
> scaleFactor() value in DoFn to 0.99f to slightly prefer writes that occur
> later in a pipeline flow by default.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.1#6144)
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message