crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-294) Cost-based job planning
Date Sun, 17 Nov 2013 20:25:22 GMT


Gabriel Reid commented on CRUNCH-294:

Yes, that all sounds very right to me, and I think that sticking with the simple rules for
now sounds like a good plan.

One thing I'm thinking is that there might be a need for information on the size of records
in a PCollection somehow. A lot of operations will have a cpu footprint per record that is
constant, independent of the size of the record -- having an indicator of the mean size of
records would allow estimating the number of records in a PCollection.

Once we get to the point where both IO and CPU are all being taken into account by the planner
then it could also be interesting to allow configuring some kind of thresholds for a job so
that you can for example say "don't worry about optimizing IO because I'm running on SSDs"
or something like that.

BTW, maybe memoryFootprint() and cpuFootprint() would be better method names on DoFn.

I also find the idea of a "learning" planner really interesting, although I worry a bit about
the implications of a pipeline that might use a different plan in development and in production.
That being said, I think that something like this could just be disabled if needed.

> Cost-based job planning
> -----------------------
>                 Key: CRUNCH-294
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Josh Wills
>            Assignee: Josh Wills
>         Attachments: CRUNCH-294.patch, jobplan-default-new.png, jobplan-default-old.png,
jobplan-large_s2_s3.png, jobplan-lopsided.png
> A bug report on the user list drove me to revisit some of the core planning logic, particularly
around how we decide where to split up DoFns between two dependent MapReduce jobs.
> I found an old TODO about using the scale factor from a DoFn to decide where to split
up the nodes between dependent GBKs, so I implemented a new version of the split algorithm
that takes advantage of how we've propagated support for multiple outputs on both the map
and reduce sides of a job to do finer-grained splits that use information from the scaleFactor
calculations to make smarter split decisions.
> One high-level change along with this: I changed the default scaleFactor() value in DoFn
to 0.99f to slightly prefer writes that occur later in a pipeline flow by default.

This message was sent by Atlassian JIRA

View raw message