crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rahul Sharma (JIRA)" <>
Subject [jira] [Updated] (CRUNCH-57) Add a length function to PCollection
Date Sun, 09 Sep 2012 03:30:07 GMT


Rahul Sharma updated CRUNCH-57:

    Attachment: minver2.patch

@Gabriel, yes you are right that the approach outside MR context will be faster, but in MR
we have  few things that come into play like eg when we are using a reducer after groupByKey
then MR will put sorting in place, if we use it or not that's secondary but on reducer the
output will be sorted always.

I have created a version for min function that tries to use things from MR and following the
same principle. I tested it against the avro data in aggregate test. It is a bit faster that
the current min function like the best result clocked 9% faster and in worst result it was
the same. Another important aspect is it doesn't rely on user classes being comparable. 
> Add a length function to PCollection
> ------------------------------------
>                 Key: CRUNCH-57
>                 URL:
>             Project: Crunch
>          Issue Type: New Feature
>          Components: Core
>    Affects Versions: 0.3.0
>            Reporter: Kiyan Ahmadizadeh
>            Assignee: Josh Wills
>         Attachments: CRUNCH-57.patch, minver2.patch
> Sometimes it's useful and interesting to compute the number of elements in a PCollection.
> For example, suppose there was an initial PCollection that was then filtered into another.
 If I'm interested in how many elements of the original PCollection matched the filter, I'll
have to write extra code to compute this.
> PCollections should have a length method that, when called, computes the number of elements
in the PCollection and returns the result. 

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message