crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Whitacre (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-601) Short PCollections in SparkPipeline get length null.
Date Fri, 19 Aug 2016 01:31:20 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427462#comment-15427462
] 

Micah Whitacre commented on CRUNCH-601:
---------------------------------------

[~migoldmann], it's been awhile since I played with Spark directly but is that test scenario
even valid for Spark?  For MapReduce an equivalent would be an input directory that doesn't
exist and that wouldn't produce the value of 1.  I suppose if you had an empty file in that
directory it might kick off a job to actually process but that's not as likely.

I can't recall the behavior of Spark when it comes to calling an action on an empty RDD. 
Does it kick off a stage?  If it does not then I'm not sure that behavior is valid.

> Short PCollections in SparkPipeline get length null.
> ----------------------------------------------------
>
>                 Key: CRUNCH-601
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-601
>             Project: Crunch
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 0.13.0
>         Environment: Running in local mode on Mac as well as in a ubuntu 14.04 docker
container
>            Reporter: Mikael Goldmann
>            Assignee: Micah Whitacre
>            Priority: Minor
>         Attachments: CRUNCH-601.patch, CRUNCH-601b.patch, SmallCollectionLengthTest.java
>
>
> I'll attach a file with a test that I would expect to pass but which fails.
> It creates five PCollection<String> of lengths 0, 1, 2, 3, 4 gets the lengths,
runs the pipeline and prints the lengths. Finally it asserts that all lengths are non-null.
> I would expect it to print lengths 0, 1, 2, 3, 4 and pass.
> What it does is print lengths null, null, null, 3, 4 and fail.
> I think the underlying reason is the use of getSize() on an unmaterialized object and
assuming that when the estimate that getSize() returns is 0, then the PCollection is guaranteed
to be empty, which is false in some cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message