crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CRUNCH-494) Unable to union large number of PCollections
Date Fri, 30 Jan 2015 07:24:35 GMT

     [ https://issues.apache.org/jira/browse/CRUNCH-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Josh Wills updated CRUNCH-494:
------------------------------
    Attachment: CRUNCH-494.patch

I'm guessing that what you're doing is iterating through some data set of unknown length,
and you're continually unioning the newest PCollection you create with the previous one you
just made, so something like:

PCollection unioned = ...;
while (someCondition) {
  unioned = unioned.union(newPCollection);
}

...and if you do that enough times, things just get really deep and hence the stack overflow.
I'm not sure I can easily change the way the union chaining works w/o altering other behavior,
but it's pretty easy to add Pipeline.union methods (one for PCollection, one for PTable) as
I did in the attached patch which let you create a List<PCollection<S>> and pass
it to Pipeline.union in order to get a single, unioned PCollection<S> that won't have
the stack overflow problem.

> Unable to union large number of PCollections 
> ---------------------------------------------
>
>                 Key: CRUNCH-494
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-494
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.3
>            Reporter: Surbhi Mungre
>            Assignee: Josh Wills
>            Priority: Minor
>         Attachments: CRUNCH-494.patch
>
>
> If you try to union large number of PCollections(~5K), then Crunch throws StackOverflowError
exception. 
> {noformat}
> java.lang.StackOverflowError
> 	at com.google.common.collect.AbstractIndexedListIterator.<init>(AbstractIndexedListIterator.java:68)
> 	at com.google.common.collect.AbstractIndexedListIterator.<init>(AbstractIndexedListIterator.java:54)
> 	at com.google.common.collect.Iterators$12.<init>(Iterators.java:1072)
> 	at com.google.common.collect.Iterators.forArray(Iterators.java:1072)
> 	at com.google.common.collect.RegularImmutableList.iterator(RegularImmutableList.java:68)
> 	at com.google.common.collect.RegularImmutableList.iterator(RegularImmutableList.java:31)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:291)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> 	at org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> {noformat}
> Here is a simple test which can reproduce the issue. 
> https://gist.github.com/anonymous/22f08511604341d0ffda



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message