flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From neo21 zerro <neo21_ze...@yahoo.com>
Subject Re: Flink Kafka more consumers than partitions
Date Wed, 03 Aug 2016 09:57:08 GMT
Hi guys, 

Thanks for getting back to me.

So to clarify: 
    Topology wise flink kafka source (does avro deserialization and small map) -> window
operator which does batching for 3 seconds -> hbase sink 


1. flink source: parallelism 40 (20 idle tasks) -> window operator: parallelism 160 ->
hbase sink: parallelism 160
    - roughly 10.000 requests/sec on hbase
2. flink source: parallelism 20 -> window operator: parallelism 160 -> hbase sink: parallelism
    - roughly 100.000 requests/sec on hbase (10x more)

@Stephan as described below the parallelism of the sink was kept the same. I agree with you
that there is nothing to backpressue on the source ;) However, my understanding right now
is that only backpressure can be the explanation for this situation. Since we just change
the source parallelism, other things like hbase parallelism  are kept the same. 
@Sameer all of those things are valid points. We make sure that we reduce the row locking
by partitioning the data on the hbase sinks. We are just after why this limitation is happening.
And since the same setup is used but just the source parallelism is changed I don't expect
this to be a hbase issue. 

Thanks guys!

On Wednesday, August 3, 2016 11:38 AM, Sameer Wadkar <sameer@axiomine.com> wrote:
What is the parallelism of the sink or the operator which writes to the sinks in the first
case. HBase puts are constrained by the following:
1. How your regions are distributed. Are you pre-splitting your regions for the table. Do
you know the number of regions your Hbase tables are split into. 
2. Are all the sinks writing to all the regions. Meaning are you getting records in the sink
operator which could potentially go to any region. This can become a big bottleneck if you
have 40 sinks talking to all regions. I pre-split my regions based on key salting and use
custom partitioning to ensure each sink operator writes to only a few regions and my performance
shot up by several orders. 
3. You are probably doing this but adding puts in batches helps in general but having each
batch contain puts for too many regions hurts. 

If the source parallelism is the same as the parallelism of other operators the 40 sinks communicating
to all regions might be a problem. When you go down to 20 sinks you actually might be getting
better performance due to lesser resource contention on HBase. 

Sent from my iPhone

> On Aug 3, 2016, at 4:14 AM, neo21 zerro <neo21_zerro@yahoo.com> wrote:
> Hello everybody, 
> I'm using Flink Kafka consumer 0.8.x with kafka 0.8.2 and flink 1.0.3 on YARN.
> In kafka I have a topic which have 20 partitions and my flink topology reads from kafka
(source) and writes to hbase (sink).
> when: 
>     1. flink source has parallelism set to 40 (20 of the tasks are idle) I see 10.000
requests/sec on hbase
>     2. flink source has parallelism set to 20 (exact number of partitions) I see 100.0000
requests/sec on hbase (so a 10x improvement)
> It's clear that hbase is not the limiting factor in my topology. 
> Assumption: Flink backpressure mechanism kicks in in the 1. case more aggressively and
it's limiting the ingestion of tuples in the topology. 
> The question: In the first case, why are those 20 sources which are sitting idle contributing
so much to the backpressure? 
> Thanks guys!

View raw message