crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Whiting (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CRUNCH-483) Scrunch .map does not allow mapping to a PCollection[(A,B)]
Date Thu, 18 Dec 2014 12:57:13 GMT

     [ https://issues.apache.org/jira/browse/CRUNCH-483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

David Whiting updated CRUNCH-483:
---------------------------------
    Attachment: 0001-Add-asPCollection-method-to-PTable-and-corresponding.patch

Attached patch for the "second best" option, as making PTable a PCollection is indeed problematic.

> Scrunch .map does not allow mapping to a PCollection[(A,B)]
> -----------------------------------------------------------
>
>                 Key: CRUNCH-483
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-483
>             Project: Crunch
>          Issue Type: Bug
>          Components: Scrunch
>    Affects Versions: 0.11.0
>            Reporter: David Whiting
>            Priority: Minor
>         Attachments: 0001-Add-asPCollection-method-to-PTable-and-corresponding.patch
>
>
> When using Scrunch PCollections and attempting to map to a pair of values, the keyvalue
implicit function in CanParallelDo will "upgrade" the result to a PTable[K, V]. This is often
the desired behaviour, but as Scrunch PTable is not an extension of Scrunch PCollection, then
there are cases where this is not what is wanted.
> Concrete example from music land: I am trying to count the number of plays for each track
in each country. I want to do this:
> trackPlayedMessage(tpm => (tpm.track, tpm.country)).count()
> However because of the implicit CanParallelTransform that is substituted, I cannot call
.count() because what I get is a PTable and not a PCollection.
> There are a number of possible remedies that I'm happy to have a go at, but I'd like
some input as to which would be best:
> - Make PTable[K,V] a real extension of PCollection[(K, V)] (analagous to how it works
in Crunch)
> - Add an "asPCollection" method to PTable which "downgrades" the PTable[K, V] to a PCollection[(K,
V)].
> - Make mapToTable and flatMapToTable distinct from map and flatMap to make the choice
explicity (warning: breaks existing API).
> - Expose an equivalent to LowPriorityParallelTransforms.single to be invoked explicitly
to get a collection instead of a table using .map(fn)(implicitly, single)
> - Something else



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message