flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chesnay Schepler <ches...@apache.org>
Subject Re: Flink performance tuning on operators
Date Fri, 15 May 2020 08:17:26 GMT
Generally there should be no difference.
Can you check whether the maps are running as a chain (as a single task)?
If they are running in a chain, then I would suspect that /something/ 
else is skewing your results.
If not, then the added network/serialization pressure would explain it.

I will assume that the mismatch in variable names inĀ  your second 
example (JsonData vs rawDataAsJson) is just a typo.

On 15/05/2020 04:29, Ivan Yang wrote:
> Hi,
>
> We have a Flink job that reads data from an input stream, then converts each event from
JSON string Avro object, finally writes to parquet files using StreamingFileSink with OnCheckPointRollingPolicy
of 5 mins. Basically a stateless job. Initially, we use one map operator to convert Json string
to Avro object, Inside the map function, it goes form String -> JsonObject -> Avro object.
>
> 	DataStream<AvroSchema> avroData = data.map(new JsonToAVRO());
>
> When we try to break the map operator to two, one for String to JsonObject, another for
JsonObject to Avro.
>
>          DataStream<JsonObject> JsonData = data.map(new StringToJson());
>          DataStream<AvroSchema> avroData = rawDataAsJson.map(new JsonToAvroSchema())
>
> The benchmark shows significant performance hit when breaking down to two operators.
We try to understand the Flink internal on why such a big difference. The setup is using state
backend = filesystem. Checkpoint = s3 bucket. Our event object has 300+ attributes.
>
>
> Thanks
> Ivan



Mime
View raw message