samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Riccomini <criccom...@linkedin.com.INVALID>
Subject Re: Samza threads issues
Date Wed, 29 Oct 2014 17:12:55 GMT
Hey Dotan,

> should we increase the kafka topic sizes to accommodate incoming data
>during these time gaps as opposed to the parallel GC?

You'll have to experiment and see. I doubt it, though.

> Or on a broader aspect - What are the best practices to measure and set
>the right size for the kafka topics? Can anyone share his experience on
>that?

There's a lot that goes into this. Some to consider:

1. Peak bytes/sec throughput.
2. Retention policy for the topic.
3. Parallelism requirements for consumers.

At LinkedIn, we start with a default of 8, and size up as needed. The need
could be that partitions are running too hot (either on reads or writes),
that the partitions are too large on disk (retention policy), or that the
downstream consumers can't keep up because their processing is slower than
the messages/sec on the partition.

Cheers,
Chris

On 10/28/14 11:59 PM, "Dotan Patrich" <dotanp@fortscale.com> wrote:

>Thanks Chris,
>We will test our product using SerialGC to see how it behave.
>
>One concern that I have is regarding the kafka topic sizes - Assuming
>"stop-the-world" GC stops will more noticable using SerialGC should we
>increase the kafka topic sizes to accommodate incoming data during these
>time gaps as opposed to the parallel GC?
>Or on a broader aspect - What are the best practices to measure and set
>the
>right size for the kafka topics? Can anyone share his experience on that?
>
>Thanks,
>Dotan
>
>On Tue, Oct 28, 2014 at 5:53 PM, Chris Riccomini <
>criccomini@linkedin.com.invalid> wrote:
>
>> Hey Dotan,
>>
>> We run all of our jobs using SerialGC by default. For a few of our
>> higher-throughput jobs, we've had better luck with parallel GC or G1,
>>but
>> in general, serial works fine.
>>
>> Cheers,
>> Chris
>>
>> On 10/28/14 8:34 AM, "Dotan Patrich" <dotanp@fortscale.com> wrote:
>>
>> >Hi All,
>> >
>> >I encountered some issues caused by having too many threads for a user
>>on
>> >linux CentOS. Investigating this deeper, it turned out that the JVM
>>spawn
>> >over 31 threads per process for GC. Having about 18 Samza processes
>> >running
>> >on the machine we soon got near to the 1000 threads limit per user.
>> >I was thinking of running the Samza JVM with SerialGC instead of
>>parallel
>> >GC to avoid having so many threads in the environment. In addition,
>> >theoretically this seems to be better fitted for situations where we
>> >prefer
>> >throughput over latency in a single-core environments (this is roughly
>> >what
>> >we Samza tasks is assigned with).
>> >
>> >Before doing so, I would really appreciate you insights - did anyone
>> >encountered this issue before? Does changing the GC to be serial is a
>>good
>> >solution?
>> >
>> >Thanks,
>> >Dotan
>>
>>


Mime
View raw message