crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mikael Goldmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-601) Short PCollections in SparkPipeline get length null.
Date Sat, 20 Aug 2016 23:12:20 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429533#comment-15429533
] 

Mikael Goldmann commented on CRUNCH-601:
----------------------------------------

If I understand correctly
* It is important that p.getSize() > 0 if p is not empty, or processing might be skipped
incorrectly.
* Unless p.getSize() == 0 at least sometimes, the branches that skip computation are never
taken and could be removed.

So assume that p is empty and p.getSize() == 0.
Form q = p.parallelDo(dofn); 
where process(x, emitter) simply does emitter.emit(x) and there is a cleanup(emitter) that
does emitter.emit(something).

Now, q is not empty since it consists of 'something'.

It seems like it would be a bug if q.getSize() == 0. However, it seems like the current implementation,
even when this patch is applied would give q.getSize() == 0.

Am I missing something in my assumptions?

> Short PCollections in SparkPipeline get length null.
> ----------------------------------------------------
>
>                 Key: CRUNCH-601
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-601
>             Project: Crunch
>          Issue Type: Bug
>          Components: Spark
>    Affects Versions: 0.13.0
>         Environment: Running in local mode on Mac as well as in a ubuntu 14.04 docker
container
>            Reporter: Mikael Goldmann
>            Assignee: Micah Whitacre
>            Priority: Minor
>         Attachments: CRUNCH-601.patch, CRUNCH-601b.patch, CRUNCH-601c.patch, SmallCollectionLengthTest.java
>
>
> I'll attach a file with a test that I would expect to pass but which fails.
> It creates five PCollection<String> of lengths 0, 1, 2, 3, 4 gets the lengths,
runs the pipeline and prints the lengths. Finally it asserts that all lengths are non-null.
> I would expect it to print lengths 0, 1, 2, 3, 4 and pass.
> What it does is print lengths null, null, null, 3, 4 and fail.
> I think the underlying reason is the use of getSize() on an unmaterialized object and
assuming that when the estimate that getSize() returns is 0, then the PCollection is guaranteed
to be empty, which is false in some cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message