crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Patel,Stephen" <Stephen.Pa...@Cerner.com>
Subject ExtractKeyFn scaleFactor seems to be incorrect
Date Thu, 21 May 2015 17:48:11 GMT
I was looking at the PCollectionImpl.by method[0] today, and I think that the ExtractKeyFn[1]
it's using may not be calculating scaleFactor correctly.  The ExtractKeyFn is using the default
scaleFactor for a MapFn (1.0), but shouldn't it have a scaleFactor of 1 + the input MapFn's
scaleFactor?

As an example, if you had a Pcollection<T> and you call by with the IdentifyFn, the
returned table should have a size of 2 * the original collections size, but as it stands now,
it will have the same size as the original.

Assuming we later group a table that we constructed with by, won't we use (potentially) far
fewer reducers than we actually should be?

[0]: https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/impl/dist/collect/PCollectionImpl.java#L270
[1]: https://github.com/apache/crunch/blob/master/crunch-core/src/main/java/org/apache/crunch/fn/ExtractKeyFn.java

CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation
and are intended only for the addressee. The information contained in this message is confidential
and may constitute inside or non-public information under international, federal, or state
securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such
information is strictly prohibited and may be unlawful. If you are not the addressee, please
promptly delete this message and notify the sender of the delivery error by e-mail or you
may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.

Mime
View raw message