incubator-crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-57) Add a length function to PCollection
Date Sat, 08 Sep 2012 05:53:07 GMT


Gabriel Reid commented on CRUNCH-57:

@Rahul, I think you're confusing the use of sorting with min/max calculations. Min and max
aren't calculated using sort, they're calculated by doing an initial filter in the mappers
and then a final filter in a second reducer.

This approach will always be more efficient than sorting the PCollection first. Outside of
the MR context, calculating min/max performance is O(n), and sort is O(n log n) in the average
case -- this translates into resource usage in MR, although of course things are done in parallel.
> Add a length function to PCollection
> ------------------------------------
>                 Key: CRUNCH-57
>                 URL:
>             Project: Crunch
>          Issue Type: New Feature
>          Components: Core
>    Affects Versions: 0.3.0
>            Reporter: Kiyan Ahmadizadeh
>            Assignee: Josh Wills
>         Attachments: CRUNCH-57.patch
> Sometimes it's useful and interesting to compute the number of elements in a PCollection.
> For example, suppose there was an initial PCollection that was then filtered into another.
 If I'm interested in how many elements of the original PCollection matched the filter, I'll
have to write extra code to compute this.
> PCollections should have a length method that, when called, computes the number of elements
in the PCollection and returns the result. 

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message