impala-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Behm <alex.b...@cloudera.com>
Subject Re: Incremental stats and catalogd data serialization
Date Thu, 08 Mar 2018 16:58:06 GMT
Hi Miguel,

the memory requirement stems from the incremental stats needed to compute
the number of distinct values for each column incrementally.

If you set the stats manually via ALTER TABLE, then no such incremental
stats exist, so there's no memory issue.

For manually setting stats, I'd recommend setting the per-partition row
counts as well as the table-level row counts.

If you know of primary keys in your data you should be careful to set the
NDV column statistics for the primary keys as accurately as possible.

Impala uses the table-level row count statistic and the column NDVs to
heuristically detect primary keys for better join ordering, so it's
generally good to keep the table-level row count and column NDVs "in sync"
for primary key detection to work.

Hth

Alex

On Thu, Mar 8, 2018 at 7:54 AM, Miguel Figueiredo <ollliegator@gmail.com>
wrote:

> Hi Alex,
>
> Thanks for the feedback.
>
> I will the new version and the new way of computing stats when possible.
> In the meantime we are thinking of computing the stats manually. If we
> compute stats per partition and for the whole table, will we encounter the
> same memory limit?
> Should we compute stats for the whole table and disregard partitions stats?
>
> Best regards,
> Miguel
>
> On Wed, Mar 7, 2018 at 6:14 PM, Alexander Behm <alex.behm@cloudera.com>
> wrote:
>
>> Using incremental stats in your scenario is extremely dangerous and I
>> highly recommend against it. That limitation was put in place to guard
>> clusters against service downtime due to serializing huge tables and
>> hitting JVM limits like the 2GB max array size.
>>
>> Even if the catalogd and impalads stay up, having such huge metadata will
>> negatively impact the health and performance of your cluster.
>>
>> There's a blurb about this in the Impala docs, CTRL+F for "For a table
>> with a huge number of partitions"
>> https://impala.apache.org/docs/build/html/topics/impala_comp
>> ute_stats.html
>>
>> Impala caches and disseminates metadata at the table granularity. If
>> anything in a table changes, the whole updated table is sent out to all
>> coordinating impalads. Each such impalad caches the entire metadata of a
>> table including the incremental stats for all columns and all partitions.
>>
>>
>> The below is copied from a different discussion thread discussing
>> alternatives to incremental stats:
>>
>> Btw, you should also know that the following improvements in the upcoming
>> 2.12 release might make "compute stats" more palatable on your huge tables.
>> We'd love your feedback on COMPUTE STATS with TABLESAMPLE, in particular.
>>
>> COMPUTE STATS with TABLESAMPLE
>> https://issues.apache.org/jira/browse/IMPALA-5310
>>
>> COMPUTE STATS on a subset of columns
>> https://issues.apache.org/jira/browse/IMPALA-3562
>>
>> The following improvement should allow you to COMPUTE STATS less
>> frequently by extrapolating the row count of partitions that were added or
>> modified since the last COMPUTE STATS.
>> https://issues.apache.org/jira/browse/IMPALA-2373
>> https://issues.apache.org/jira/browse/IMPALA-6228
>>
>> In general, would be great to get your feedback/ideas on how to make
>> computing stats more practical for you.
>>
>>
>>
>> On Wed, Mar 7, 2018 at 8:48 AM, Miguel Figueiredo <ollliegator@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have bumped into the 200MB limit when calculating incremental stats (
>>> https://issues.apache.org/jira/browse/IMPALA-3552).
>>>
>>> I don't understand which data catalogd sends to the impalad each time
>>> the incremental stats are calculated. Does it send only the new information
>>> calculated for new partitions or all the statistics data?
>>>
>>> In my case I have 387 tables with 2550 columns. I am creating a new
>>> partition for each table every hour and calculating incremental stats for
>>> these new partitions. If catalogd is sending serialized data for the new
>>> partitions and columns, it shouldn't amount to 200MB.
>>>
>>> I would appreciate if someone can help me understand this concept or
>>> point me to some documentation.
>>>
>>> Best regards,
>>> Miguel
>>>
>>> --
>>> Miguel Figueiredo
>>> Software Developer
>>>
>>> "I'm a pretty lazy person and am prepared to work quite hard in order to
>>> avoid work."
>>> -- Martin Fowler
>>>
>>
>>
>
>
> --
> Miguel Figueiredo
> Software Developer
>
> "I'm a pretty lazy person and am prepared to work quite hard in order to
> avoid work."
> -- Martin Fowler
>

Mime
View raw message