flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Metzger <rmetz...@apache.org>
Subject Re: Flink performance tuning
Date Fri, 13 May 2016 13:25:40 GMT
Hi,

Can you try running the job with 8 slots, 7 GB (maybe you need to go down
to 6 GB) and only three TaskManagers (-n 3) ?

I'm suggesting this, because you have many small JVMs running on your
machines. On such small machines you can probably get much more use out of
your available memory by running a few big task managers (which can share
all the common management infra).
Another plus of running a few JVMs is that you are deducing network
overhead, because communication can happen within the process, and less
network transfer is required.

Another big factor for performance are the datatypes used. How do you
represent your data in Flink? (Are you using the TupleX types? or POJOs?)
How do you select the key for the grouping?

Regards,
Robert


On Fri, May 13, 2016 at 11:25 AM, Serhiy Boychenko <serhiy.boychenko@cern.ch
> wrote:

> Hey,
>
>
>
> I have successfully integrated Flink into our very small test cluster (3
> machines with 8 cores, 8GBytes of memory and 2x1TB disks). Basically I am
> started the session to use YARN as RM and the data is being read from HDFS.
>
> /yarn-session.sh -n 21 -s 1 -jm 1024 -tm 1024
>
>
>
> My code is very simple, flatMap is being done on the CSV data, so I
> extract the signal name and value, I group by signal name and performing
> group reduce on the data in order to calculate max, min and average on the
> collected values.
>
>
>
> I have observed on 3 nodes, the average processing rate is around
> 11Mbytes/second. I have compared the results with MR execution(without any
> kind of tuning) and I am quite surprised, since the performance of Hadoop
> is 85Mybtes/second when executing the same query on the same data. I have
> read few reports claiming that Flink is better in comparison to MR and
> other tools. I am wondering what is wrong? Any clue?
>
>
>
> The processing rate is calculated according to the following formula:
>
> Overall processing rate = sum of total amount of data read per job/sum of
> total time the job was running (including staging periods)
>
>
>
> Best regards,
>
> Serhiy.
>

Mime
View raw message