impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Behm <alex.b...@cloudera.com>
Subject Re: Computing stats on big partitioned parquet tables
Date Fri, 19 Jan 2018 06:30:44 GMT
The documentation has good overview of the limitations and caveats:
https://impala.apache.org/docs/build/html/topics/impala_perf_stats.html#perf_stats_incremental

On Thu, Jan 18, 2018 at 7:29 PM, Fawze Abujaber <fawzeaj@gmail.com> wrote:

> Hi,
>
> I didn’t in the documentation of the incremental compute stats any
> limitations,
>
> Is it size limit or memory limit ( 200 MB)?
>
> Why should compute stats successes and incremental compute stats not?
>
> I’m upgrading my cluster at Sunday as the incremental compute stats was
> one of the incentives :(
>
> On Fri, 19 Jan 2018 at 4:13 Mostafa Mokhtar <mmokhtar@cloudera.com> wrote:
>
>> Hi,
>>
>> Do you mind sharing the query profile for the query that failed with OOM?
>> there should be some clues on to why the OOM is happening.
>>
>> Thanks
>> Mostafa
>>
>>
>> On Thu, Jan 18, 2018 at 5:54 PM, Thoralf Gutierrez <
>> thoralfgutierrez@gmail.com> wrote:
>>
>>> Hello everybody!
>>>
>>> (I am using Impala 2.8.0, out of Cloudera Express 5.11.1)
>>>
>>> I now understand that we are _highly_ recommended to compute stats for
>>> our tables so I have decided to make sure we do.
>>>
>>> On my quest to do so, I started with a first `COMPUTE INCREMENTAL STATS
>>> my_big_partitioned_parquet_table` and ran into :
>>>
>>> > HiveServer2Error: AnalysisException: Incremental stats size estimate
>>> exceeds 200.00MB. Please try COMPUTE STATS instead.
>>>
>>> I found out that we could increase this limit, so I set
>>> inc_stats_size_limit_bytes to 1073741824 (1GB)
>>>
>>> > HiveServer2Error: AnalysisException: Incremental stats size estimate
>>> exceeds 1.00GB. Please try COMPUTE STATS instead.
>>>
>>> So I ended up trying to COMPUTE STATS for the whole table instead of
>>> incrementally, but I still hit memory limits when computing counts with my
>>> mem_limit at 34359738368 (32GB)
>>>
>>> > Process: memory limit exceeded. Limit=32.00 GB Total=48.87 GB
>>> Peak=51.97 GB
>>>
>>> 1. Am I correct to assume that even if I did not have enough memory, the
>>> query should spill to disk and just be slower instead of OOMing?
>>> 2. Any other recommendation on how else I could go about computing some
>>> stats on my big partitioned parquet table?
>>>
>>> Thanks a lot!
>>> Thoralf
>>>
>>>
>>

Mime
View raw message