crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Wills (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-483) Scrunch .map does not allow mapping to a PCollection[(A,B)]
Date Wed, 17 Dec 2014 03:29:13 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14249408#comment-14249408
] 

Josh Wills commented on CRUNCH-483:
-----------------------------------

[~davw] IIRC, making PTable a real extension of PCollection had some complexity associated
with it that seemed more hassle than it was worth, but I may be remembering that wrong. If
it doesn't look like an actual problem, that might be the best bet.

Second best in my mind would be an explicit asPCollection method in conjunction with an implicit
PTable[K, V] -> PCollection[(K, V)] method that was defined inside of the Conversions object.

> Scrunch .map does not allow mapping to a PCollection[(A,B)]
> -----------------------------------------------------------
>
>                 Key: CRUNCH-483
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-483
>             Project: Crunch
>          Issue Type: Bug
>          Components: Scrunch
>    Affects Versions: 0.11.0
>            Reporter: David Whiting
>            Priority: Minor
>
> When using Scrunch PCollections and attempting to map to a pair of values, the keyvalue
implicit function in CanParallelDo will "upgrade" the result to a PTable[K, V]. This is often
the desired behaviour, but as Scrunch PTable is not an extension of Scrunch PCollection, then
there are cases where this is not what is wanted.
> Concrete example from music land: I am trying to count the number of plays for each track
in each country. I want to do this:
> trackPlayedMessage(tpm => (tpm.track, tpm.country)).count()
> However because of the implicit CanParallelTransform that is substituted, I cannot call
.count() because what I get is a PTable and not a PCollection.
> There are a number of possible remedies that I'm happy to have a go at, but I'd like
some input as to which would be best:
> - Make PTable[K,V] a real extension of PCollection[(K, V)] (analagous to how it works
in Crunch)
> - Add an "asPCollection" method to PTable which "downgrades" the PTable[K, V] to a PCollection[(K,
V)].
> - Make mapToTable and flatMapToTable distinct from map and flatMap to make the choice
explicity (warning: breaks existing API).
> - Expose an equivalent to LowPriorityParallelTransforms.single to be invoked explicitly
to get a collection instead of a table using .map(fn)(implicitly, single)
> - Something else



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message