crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <>
Subject [jira] [Updated] (CRUNCH-601) Short PCollections in SparkPipeline get length null.
Date Wed, 24 Aug 2016 18:02:21 GMT


Josh Wills updated CRUNCH-601:
    Attachment: CRUNCH-601-jw.patch

My take on this, which is marginally different than Mikael's take on it-- I get the reason
for why the parentSize check is necessary now, it breaks materializing empty PCollections
in a backwards incompatible way that is not good. I'm good with the overall approach and will
defer to [~mkwhitacre] on which version to commit.

> Short PCollections in SparkPipeline get length null.
> ----------------------------------------------------
>                 Key: CRUNCH-601
>                 URL:
>             Project: Crunch
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 0.13.0
>         Environment: Running in local mode on Mac as well as in a ubuntu 14.04 docker
>            Reporter: Mikael Goldmann
>            Assignee: Micah Whitacre
>            Priority: Minor
>         Attachments: CRUNCH-601-jw.patch, CRUNCH-601.patch, CRUNCH-601b.patch, CRUNCH-601c.patch,
> I'll attach a file with a test that I would expect to pass but which fails.
> It creates five PCollection<String> of lengths 0, 1, 2, 3, 4 gets the lengths,
runs the pipeline and prints the lengths. Finally it asserts that all lengths are non-null.
> I would expect it to print lengths 0, 1, 2, 3, 4 and pass.
> What it does is print lengths null, null, null, 3, 4 and fail.
> I think the underlying reason is the use of getSize() on an unmaterialized object and
assuming that when the estimate that getSize() returns is 0, then the PCollection is guaranteed
to be empty, which is false in some cases.

This message was sent by Atlassian JIRA

View raw message