hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ruilong Huo <r...@pivotal.io>
Subject Re: HAWQ runaway detector cancels all queries when set runaway_detector_activation_percent 100
Date Sat, 11 Mar 2017 10:38:16 GMT
Hi Stanley,

Had a check with runaway implementation, I think your node has 128G memory
or larger and runaway does not initialize redZoneChunks correctly.

The basic idea of redzone is: if used memory on one segment is larger than
memory quota for that segment multiply with
runaway_detector_activation_percent, it will trigger runway terminator to
select the query used most of the memory and then cancel it. Here the
memory quota for segment is the physical memory on the node if it is in
None mode (i.e., hawq_global_rm_type = None); otherwise it is dynamically
calculated by resource manager based on running workload.

Clarification on the runaway_detector_activation_percent: for any
runaway_detector_activation_percent in [0, 99], runway is activated, where
it is a special case with 0 which means all query should be out of memory
since redzone is 0; if it is 100, runaway is not enforced.

A quick fix is as below at first glance, while it need to spend more time
and make sure it is addressed in HAWQ-1384
<https://issues.apache.org/jira/browse/HAWQ-1384>.
* 97     /* red zone chunk less than 0 means disable red-zone completely*
*           * Note that 0 is a special case for runaway with which all
query should be out of memory*
*           */*
* 98     if (redZoneChunks < 0)*
* 99     {*
*100         redZoneChunks = INT32_MAX;*
*101     }*

Refer to below snippet for details:

*# cat src/backend/utils/mmgr/redzone_handler.c*
* 69 /**
* 70  * Returns the red-zone cut-off in "chunks" unit*
* 71  */*
* 72 int32*
* 73 RedZoneHandler_GetRedZoneLimitChunks()*
* 74 {*
* 75     /**
* 76      * runaway_detector_activation_percent = 100% is reserved for not
enforcing runaway*
* 77      * detection by setting the redZoneChunks to an artificially high
value. Also, during*
* 78      * gpinitsystem we may start a QD without initializing the
hawq_re_memory_overcommit_max.*
* 79      * This may result in 0 vmem protect limit. In such case, we
ensure that the*
* 80      * redZoneChunks is set to a large value.*
* 81      */*
* 82     if (runaway_detector_activation_percent != 100)*
* 83     {*
* 84         /**
* 85          * Calculate red zone threshold in MB, and then convert MB to
"chunks"*
* 86          * using chunk size for efficient comparison to detect red
zone*
* 87          */*
* 88         if (lastSegmentVmemQuotaChunks != *segmentVmemQuotaChunks)*
* 89         {*
* 90             lastSegmentVmemQuotaChunks = *segmentVmemQuotaChunks;*
* 91             redZoneChunks = (int)(lastSegmentVmemQuotaChunks **
* 92                                   ((float)
runaway_detector_activation_percent) /*
* 93                                   ((float) 100));*
* 94         }*
* 95     }*
* 96*
* 97     /* 0 means disable red-zone completely */*
* 98     if (redZoneChunks == 0)*
* 99     {*
*100         redZoneChunks = INT32_MAX;*
*101     }*
*102*
*103     return redZoneChunks;*
*104 }*


Best regards,
Ruilong Huo

On Wed, Mar 8, 2017 at 12:40 PM, Stanley Sung <ysung@pivotal.io> wrote:

> after set runaway_detector_activation_percent to 100 (disable runaway
> detector), all queries are cancelled.
>
> Note that query used 64MB, available 81856MB and red zone is "-16MB".
>
> red zone calculation bug?
>
>
>
> ERROR:  Canceling query because of high VMEM usage. Used: 64MB, available
> 81856MB, red zone: -16MB (runaway_cleaner.c:152)  (seg9
> sclmsdn02apd-hdp.sdc.vzwcorp.com:40000 pid=646507) (dispatcher.c:1801)
> CONTEXT:  SQL statement "
>                                              create table bp_full.nodes as
>                                              select i
>                                              from (
>                                              select i from bp_full.g
>                                                             union all
>                                              select j as i from bp_full.g
>                                              ) foo
>                                              group by 1
>                                              distributed by (i);
>
>
> --
> Regards,
> --
> Stanley Sung | Pivotal Data Engineering
> +1-443-515-0205
> http://www.pivotal.io
> http://bcert.me/sfqygvsl
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message