crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gabriel Reid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-351) Improve performance of Shard#shard on large records
Date Sat, 22 Feb 2014 21:19:21 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13909543#comment-13909543
] 

Gabriel Reid commented on CRUNCH-351:
-------------------------------------

Looks to me like this will indeed work a lot better when there are a lot of unique elements
with a costly compare. I guess it will also be slower if the input PCollection contains a
large number of duplicate items with a low-cost compare, as the shuffle would then become
extremely lightweight in terms of IO in the current version, while this patch would mean that
all the duplicate elements would still go through the full shuffle.

I'm guessing that Shard is going to be much more commonly used for heavier-weight and mostly-unique
values, so it makes sense to optimize for that use case, so I'd say this one sounds like a
good idea to me.

About the use of the random long as the key, I was thinking it might be even better to use
a limited-range int as the generated key, limiting the range of the generated keys to be just
large enough that we're sure that they'll be decently distributed over the partitions. That
way the shuffle will (I think) become even lighter weight because there would be fewer unique
keys to sort. Does that sound right?

Also on the subject of the random key generation, is there any drawback to using a constant
value as the seed for the Random? If not, it might be better to just use a constant to avoid
the dependency on the MapReduce framework.

> Improve performance of Shard#shard on large records
> ---------------------------------------------------
>
>                 Key: CRUNCH-351
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-351
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Chao Shi
>            Assignee: Chao Shi
>         Attachments: crunch-351.patch
>
>
>     This avoids sorting on the input data, which may be long and make
>     shuffle phase slow. The improvement is to sort on pseudo-random numbers.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message