crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leen Toelen <toe...@gmail.com>
Subject Re: performance impact of batching emit(...)
Date Fri, 10 Jan 2014 12:01:54 GMT
OK, thanks.


On Fri, Jan 10, 2014 at 12:37 AM, Josh Wills <jwills@cloudera.com> wrote:

> Hey Leen,
>
> I don't have a better idea than trial and error at this point, since the
> best choice of flushEvery would depend on a combination of how much memory
> is available to the tasks, how large the cached objects are, and a rough
> estimate of how many unique elements there are in the data set. It's the
> sort of thing that our much-discussed-but-not-implemented-yet framework for
> tracking stats on runtime metrics for optimizing pipelines should track.
>
> J
>
>
> On Thu, Jan 9, 2014 at 1:30 PM, Leen Toelen <toelen@gmail.com> wrote:
>
>> Hi,
>>
>> when looking at PreDistinct I notice that calls to emitter.emit(...) are
>> stored in memory until more than 'flushEvery' records are found. How does
>> this batching impact performance, since the calls to emit(...) are not
>> batched in the cleanup method but called in a loop?
>>
>> Is there an easy way to find the best size for 'flushEvery' other than
>> try and error?
>>
>> Best regards,
>> Leen
>>
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
View raw message