flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fabian Hueske <fhue...@gmail.com>
Subject Re: Dataset column statistics
Date Thu, 29 Nov 2018 08:44:56 GMT
Hi,

You could try to enable object reuse.
Alternatively you can give more heap memory or fine tune the GC parameters.

I would not consider it a bug in Flink, but might be something that could
be improved.

Fabian


Am Mi., 28. Nov. 2018 um 18:19 Uhr schrieb Flavio Pompermaier <
pompermaier@okkam.it>:

> Hi to all,
> I have a batch dataset  and I want to get some standard info about its
> columns (like min, max, avg etc).
> In order to achieve this I wrote a simple program that use SQL on table
> API like the following:
>
> SELECT
> MAX(col1), MIN(col1), AVG(col1),
> MAX(col2), MIN(col2), AVG(col2),
> MAX(col3), MIN(col3), AVG(col3)
> FROM MYTABLE
>
> In my dataset I have about 50 fields and the query becomes quite big (and
> the job plan too).
> It seems that this kind of job cause the cluster to crash (too much
> garbage collection).
> Is there any smarter way to achieve this goal (apart from running a job
> per column)?
> Is this "normal" or is this a bug of Flink?
>
> Best,
> Flavio
>

Mime
View raw message