flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Pompermaier <pomperma...@okkam.it>
Subject Re: Dataset column statistics
Date Tue, 18 Dec 2018 08:45:37 GMT
Great, thanks!

On Tue, Dec 18, 2018 at 3:26 AM Kurt Young <ykt836@gmail.com> wrote:

> Hi,
>
> We have implemented ANALYZE TABLE in our internal version of Flink, and we
> will try to contribute back to the community.
>
> Best,
> Kurt
>
>
> On Thu, Nov 29, 2018 at 9:23 PM Fabian Hueske <fhueske@gmail.com> wrote:
>
>> I'd try to tune it in a single query.
>> If that does not work, go for as few queries as possible, splitting by
>> column for better projection push-down.
>>
>> This is the first time I hear somebody requesting ANALYZE TABLE.
>> I don't see a reason why it shouldn't be added in the future.
>>
>>
>>
>> Am Do., 29. Nov. 2018 um 12:08 Uhr schrieb Flavio Pompermaier <
>> pompermaier@okkam.it>:
>>
>>> What do you advice to compute column stats?
>>> Should I run multiple job (one per column) or try to compute all at once?
>>>
>>> Are you ever going to consider supporting ANALYZE TABLE (like in Hive or
>>> Spark) in Flink Table API?
>>>
>>> Best,
>>> Flavio
>>>
>>> On Thu, Nov 29, 2018 at 9:45 AM Fabian Hueske <fhueske@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> You could try to enable object reuse.
>>>> Alternatively you can give more heap memory or fine tune the GC
>>>> parameters.
>>>>
>>>> I would not consider it a bug in Flink, but might be something that
>>>> could be improved.
>>>>
>>>> Fabian
>>>>
>>>>
>>>> Am Mi., 28. Nov. 2018 um 18:19 Uhr schrieb Flavio Pompermaier <
>>>> pompermaier@okkam.it>:
>>>>
>>>>> Hi to all,
>>>>> I have a batch dataset  and I want to get some standard info about its
>>>>> columns (like min, max, avg etc).
>>>>> In order to achieve this I wrote a simple program that use SQL on
>>>>> table API like the following:
>>>>>
>>>>> SELECT
>>>>> MAX(col1), MIN(col1), AVG(col1),
>>>>> MAX(col2), MIN(col2), AVG(col2),
>>>>> MAX(col3), MIN(col3), AVG(col3)
>>>>> FROM MYTABLE
>>>>>
>>>>> In my dataset I have about 50 fields and the query becomes quite big
>>>>> (and the job plan too).
>>>>> It seems that this kind of job cause the cluster to crash (too much
>>>>> garbage collection).
>>>>> Is there any smarter way to achieve this goal (apart from running a
>>>>> job per column)?
>>>>> Is this "normal" or is this a bug of Flink?
>>>>>
>>>>> Best,
>>>>> Flavio
>>>>>
>>>>
>>>

Mime
View raw message