crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <>
Subject [jira] [Updated] (CRUNCH-320) Materialize several PObject & PCollection objects in parallel (deferred materialization)
Date Tue, 07 Jan 2014 18:51:50 GMT


Josh Wills updated CRUNCH-320:

    Attachment: CRUNCH-320.patch

Here's a patch for this-- thanks for digging this up, and sorry for the trouble.

As a workaround for your example, you can call materialize() on rawInput and Sample.sample(rawInput,
0.5) directly, and then call the PObject methods to get their length. We'll only materialize
the collection once, and that should signal the outputs to the planner. (If you're using Crunch
0.9.0 or 0.8.2, we added a cache() method to PCollection that makes this process more literate,
s.t. you could do:

Sample.sample(rawInput, 0.5).cache().length();

to make the workaround a little bit cleaner.

> Materialize several PObject & PCollection objects in parallel (deferred materialization)
> ----------------------------------------------------------------------------------------
>                 Key: CRUNCH-320
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jason Gauci
>            Assignee: Josh Wills
>         Attachments: CRUNCH-320.patch
> Currently, Crunch blocks and materializes PCollections (through foo.materialize()) and
PObjects (through foo.getValue()) on demand, but it would be a significant performance improvement
if we could mark several of these objects as to be materialized, and then materialize all
of them in parallel as part of a call.

This message was sent by Atlassian JIRA

View raw message