hawq-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jon Roberts <jrobe...@pivotal.io>
Subject Re: hawq_rm_stmt_vseg_memory and hawq_rm_nvseg_perquery_perseg_limit
Date Mon, 23 Jan 2017 16:21:46 GMT
It is common in Greenplum to use "explain analyze" to look at the query
plan and if the query spills to disk, then increasing the statement memory
will typically increase the speed of the query.  With hash distributed
tables in Hawq, this will be true too and the only way, short of rewriting
the query, to make it perform better.

For the really large Greenplum clusters out there, we don't see benefits
from running 8 segments per host.  Instead, we dial this back to 4 segments
because the network traffic in the interconnect becomes the bottleneck.
For Hawq where the clusters will likely be much bigger, I think we will
have the same network limitations.  So with random distribution, we will
probably reduce the number of vsegs to get better performance on the
network side and then use more memory per vseg to prevent spilling to disk
and use more of the available resources.  That is why we need the ability
to allocate more than 16GB of RAM per vseg.

On a different but related note, I would like to see
hawq_rm_nvseg_perquery_limit be changed to not determine the number of
vsegs per query across the entire cluster but at the datanode level.  That
would make it much easier to expand and shrink a cluster without having to
adjust GUCs each time.



Jon Roberts
Principal Engineer | jroberts@pivotal.io | 615-426-8661

On Sun, Jan 22, 2017 at 10:05 PM, Yi Jin <yjin@pivotal.io> wrote:

> Hi Jon,
>
> For scenario 1, it is true. However, do you think it necessary to make one
> vseg having so high memory quota setting? Of course, I think if we force
> hawq_rm_stmt_nvseg a value different with bucket number, optimizer will
> give different plan. This is way to force high concurrency and high
> resource occupation.
>
> For scenario 2, I think if we want to release more concurrency in executing
> one query, we should advice customer to
> increase hawq_rm_nvseg_perquery_limit (default 512)
> and hawq_rm_nvseg_perquery_perseg_limit(default 6), then we choose an
> appropriate vseg mem quota, as a result HAWQ will make decision to leverage
> more concurrency and more resource for big queries automatically.  I mean
> HAWQ will choose the number of virtual segments automatically, not always
> high number of virtual segments.
>
> For scenario 3, It is recommended not to assign all system memory to HAWQ
> in the segment side. It is a good idea to leave some for system and for the
> other applications.
>
> I think it is fine to support higher memory limit for one virtual segment,
> but still I have no idea why it is necessary.
>
> As HAWQ uses different memory allocation strategy comparing with Greenplum,
> it is true, the external behavior of HAWQ is different. Virtual segment
> (VSEG) is a critical concept in HAWQ to manage resource consumption and
> query QE concurrency, in fact for Greenplum, the QE concurrency is fixed,
> i.e. the number of QEs for one query execution is fixed, this is why
> Greenplum user just needs to think of memory consumption in one segment, as
> all segments consume the same amount of memory for the same query. While
> back to HAWQ, this number is dynamic, you can easily change resource queue
> definition or statement resource definition to change it, you don't have to
> install and config more segments in those nodes to force the QE
> concurrency.
>
> Consider a scenario that there are a lot of small scale queries
> concurrently arrived, HAWQ just allocate a small number of vsegs for each
> query without resource waste and this makes HAWQ able to support high
> concurrency much better than traditional MPP-architecture database to
> achieve high throughput. Opposite, for those large scale queries, HAWQ can
> automatically increase number of vseg to a very high number if you set
> those limit gucs high as well.
>
> Best,
> Yi
>
> On Mon, Jan 23, 2017 at 12:30 PM, Jon Roberts <jroberts@pivotal.io> wrote:
>
> > I've been thinking about these scenarios:
> >
> > 1.  Hash distributed tables with fixed number of buckets.
> > If the tables were built using the defaults, buckets = 6 * number of
> > nodes.  So you basically have 6 vsegs per host.  Multiply that by 16GB
> and
> > you only can use 96GB of the 256GB of RAM per node.
> >
> > 2.  A user has random tables but doesn't understand they can increase the
> > number of vsegs.
> > This will be common for users that come from Greenplum.  They again can
> > only set statement member to 16GB so they are stuck with a max of 96GB of
> > RAM usage.
> >
> > 3.  User increases vsegs and statement memory.
> > Possibly run out of memory if too aggressive with settings.
> >
> > - I think we should be able to specify statement memory higher than 16GB.
> > Maybe the limit should be something much higher such as 1TB.
> >
> > - The optimizer should limit the number of vsegs based on statement
> memory
> > setting to prevent OOM.  You could do the opposite too.  (limit memory
> and
> > use the vseg setting provided)  Greenplum can limit the amount of memory
> > but we have two dials to adjust with vsegs and memory.
> >
> >
> > Jon Roberts
> >
> > On Sun, Jan 22, 2017 at 5:20 PM, Yi Jin <yjin@pivotal.io> wrote:
> >
> > > Hi Jon,
> > >
> > > That guc setting limit means for one virtual segment, the maximum
> > > consumable memory is 16GB, for one segment/node, there maybe multiple
> > vsegs
> > > allocated to run queries, so if a node has 256gb expected to be
> consumed
> > by
> > > HAWQ, it will have at most 16 vsegs running concurrently.
> > >
> > > hawq_rm_stmt_vseg_memory is for setting statement level vseg memory
> > > consumption, it is required to specify hawq_rm_stmt_nvseg as well, only
> > > when hawq_rm_stmt_nvseg is greater than 0, hawq_rm_stmt_vseg_memory is
> > > activated regardless the original target resource queue vseg resource
> > quota
> > > definition. For example, you can set hawq_rm_stmt_vseg_memory as 16gb,
> > > hawq_rm_stmt_nvseg
> > > as 256, if you have a cluster having 256gb * 16 nodes and your target
> > > resource queue can use 100% cluster resource, you will have 16 vsegs
> > > running per node to consume all memory resource for this query.
> > >
> > > Best,
> > > Yi
> > >
> > > On Sat, Jan 21, 2017 at 3:40 PM, Lei Chang <lei_chang@apache.org>
> wrote:
> > >
> > > > hawq_rm_stmt_vseg_memory and hawq_rm_stmt_nvseg need to be used
> > together
> > > to
> > > > set the specific number of segments and the vseg memory. And
> > > > hawq_rm_stmt_nvseg should be less than hawq_rm_nvseg_perquery_perseg_
> > > > limit.
> > > >
> > > > set hawq_rm_stmt_vseg_memory = '2GB';set hawq_rm_stmt_nvseg = 6;
> > > >
> > > > looks 16GB is somewhat small for big dedicated machines: if 16GB is
> per
> > > > virtual segment memory, if 8 segment is used, it only use 128GB.
> > > >
> > > > Cheers
> > > > Lei
> > > >
> > > >
> > > > On Fri, Jan 20, 2017 at 9:11 PM, Jon Roberts <jroberts@pivotal.io>
> > > wrote:
> > > >
> > > > > Why is there a limit of 16GB for hawq_rm_stmt_vseg_memory?  A
> cluster
> > > > with
> > > > > 256GB per node and dedicated for HAWQ may certainly want to utilize
> > > more
> > > > > memory per segment.  Is there something I'm missing regarding
> > statement
> > > > > memory?
> > > > >
> > > > > Secondly, does the number of vsegs for a query get influenced by
> the
> > > > > statement memory or does it just look at the plan and
> > > > > hawq_rm_nvseg_perquery_perseg_limit?
> > > > >
> > > > >
> > > > > Jon Roberts
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message