drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DRILL-5593) Modernize Drill's memory allocator to reflect current usage
Date Sun, 18 Jun 2017 19:13:00 GMT

     [ https://issues.apache.org/jira/browse/DRILL-5593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Paul Rogers updated DRILL-5593:
-------------------------------
    Description: 
Drill's memory allocator is quite sophisticated. But, as Drill moves toward improved resource
management, the design of the current allocator no longer aligns well with the overall resource
management design.

The current allocator:

* Provides a separate allocator and accountant for each operator.
* Enforces a hard memory limit for each operator, causing an OOM error when the operator exceeds
the per-operator limit.
* Provides a complex transfer mechanism that moves memory ownership from one operator to another
as batches move downstream.
* Allows a buffer to be shared by multiple allocators, with one allocator being the "owning"
allocator.
* Allows a memory block to be shared by multiple buffers (as occurs when deserializing a record
batch from the wire.)
* Provides a tree of allocators in which child allocators can ask parents for more memory
and parents provide that memory out of their own allocation.

The current design appears to have been an attempt to allow operators to negotiate among themselves
for memory usage. The idea seems to be that any given operator uses its assigned memory. If
it needs more, it asks the parent allocator for more. If the parent can't provide more, the
child operator sends a {{OUT_OF_MEMORY}} signal downstream and some downstream operator must
give up some of its memory (perhaps by spilling) so that the upstream operator can proceed.

The challenge is that only the framework was implemented, not the intended negotiation mechanisms.
As a result, the current allocator presents challenges:

* Drill is moving toward a planned memory allocation system: the planner assigns memory limits
to each fragment (for the in-flight batch overhead) and to each buffering operator.
* Memory is then managed at the fragment level, and per-opeartor, but only for buffering operators.
* Memory for other operators (scan, select, project, etc.) is completely determined by batch
size, th operators have no way to deal with OOM conditions.
* The {{OUT_OF_MEMORY}} iterator status never worked. (It is hard to imagine how, say, a scan
operator would run out of memory on column d within (a, b, c, d, e, f), remember its state,
hold onto the d value, send the signal downstream, then resume where it left off. The code
would become even more complex than it already is.
* Code now must rediscover the memory used by each batch just to ensure that it never exceeds
the per-operator memory limits. The sort, in particular is infamous for OOM on SV2 allocation
because a batch is so large that it fills up the allocator, causing the next allocation (the
SV2) to fail -- but only for accounting reasons.

One very important part of the current allocator to be retained is the "fresh" (one buffer
per vector) and deserialized (shared buffer for all vectors) modes. Also, the ability for
a single deserialized buffer to be shared by multiple fragments.

As a result, this is a complex design task, not a simple bug fix.

  was:
Drill's memory allocator is quite sophisticated. But, as Drill moves toward improved resource
management, the design of the current allocator no longer aligns well with the overall resource
management design.

The current allocator:

* Provides a separate allocator and accountant for each operator.
* Enforces a hard memory limit for each operator, causing an OOM error when the operator exceeds
the per-operator limit.
* Provides a complex transfer mechanism that moves memory ownership from one operator to another
as batches move downstream.
* Allows a buffer to be shared by multiple allocators, with one allocator being the "owing"
allocator.
* Allows a memory block to be shared by multiple buffers (as occurs when deserializing a record
batch from the wire.)
* Provides a tree of allocators in which child allocators can ask parents for more memory
and parents provide that memory out of their own allocation.

The current design appears to have been an attempt to allow operators to negotiate among themselves
for memory usage. The idea seems to be that any given operator uses its assigned memory. If
it needs more, it asks the parent allocator for more. If the parent can't provide more, the
child operator sends a {{OUT_OF_MEMORY}} signal downstream and some downstream operator must
give up some of its memory (perhaps by spilling) so that the upstream operator can proceed.

The challenge is that only the framework was implemented, not the intended negotiation mechanisms.
As a result, the current allocator presents challenges:

* Drill is moving toward a planned memory allocation system: the planner assigns memory limits
to each fragment (for the in-flight batch overhead) and to each buffering operator.
* Memory is then managed at the fragment level, and per-opeartor, but only for buffering operators.
* Memory for other operators (scan, select, project, etc.) is completely determined by batch
size, th operators have no way to deal with OOM conditions.
* The {{OUT_OF_MEMORY}} iterator status never worked. (It is hard to imagine how, say, a scan
operator would run out of memory on column d within (a, b, c, d, e, f), remember its state,
hold onto the d value, send the signal downstream, then resume where it left off. The code
would become even more complex than it already is.
* Code now must rediscover the memory used by each batch just to ensure that it never exceeds
the per-operator memory limits. The sort, in particular is infamous for OOM on SV2 allocation
because a batch is so large that it fills up the allocator, causing the next allocation (the
SV2) to fail -- but only for accounting reasons.

One very important part of the current allocator to be retained is the "fresh" (one buffer
per vector) and deserialized (shared buffer for all vectors) modes. Also, the ability for
a single deserialized buffer to be shared by multiple fragments.

As a result, this is a complex design task, not a simple bug fix.


> Modernize Drill's memory allocator to reflect current usage
> -----------------------------------------------------------
>
>                 Key: DRILL-5593
>                 URL: https://issues.apache.org/jira/browse/DRILL-5593
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>
> Drill's memory allocator is quite sophisticated. But, as Drill moves toward improved
resource management, the design of the current allocator no longer aligns well with the overall
resource management design.
> The current allocator:
> * Provides a separate allocator and accountant for each operator.
> * Enforces a hard memory limit for each operator, causing an OOM error when the operator
exceeds the per-operator limit.
> * Provides a complex transfer mechanism that moves memory ownership from one operator
to another as batches move downstream.
> * Allows a buffer to be shared by multiple allocators, with one allocator being the "owning"
allocator.
> * Allows a memory block to be shared by multiple buffers (as occurs when deserializing
a record batch from the wire.)
> * Provides a tree of allocators in which child allocators can ask parents for more memory
and parents provide that memory out of their own allocation.
> The current design appears to have been an attempt to allow operators to negotiate among
themselves for memory usage. The idea seems to be that any given operator uses its assigned
memory. If it needs more, it asks the parent allocator for more. If the parent can't provide
more, the child operator sends a {{OUT_OF_MEMORY}} signal downstream and some downstream operator
must give up some of its memory (perhaps by spilling) so that the upstream operator can proceed.
> The challenge is that only the framework was implemented, not the intended negotiation
mechanisms. As a result, the current allocator presents challenges:
> * Drill is moving toward a planned memory allocation system: the planner assigns memory
limits to each fragment (for the in-flight batch overhead) and to each buffering operator.
> * Memory is then managed at the fragment level, and per-opeartor, but only for buffering
operators.
> * Memory for other operators (scan, select, project, etc.) is completely determined by
batch size, th operators have no way to deal with OOM conditions.
> * The {{OUT_OF_MEMORY}} iterator status never worked. (It is hard to imagine how, say,
a scan operator would run out of memory on column d within (a, b, c, d, e, f), remember its
state, hold onto the d value, send the signal downstream, then resume where it left off. The
code would become even more complex than it already is.
> * Code now must rediscover the memory used by each batch just to ensure that it never
exceeds the per-operator memory limits. The sort, in particular is infamous for OOM on SV2
allocation because a batch is so large that it fills up the allocator, causing the next allocation
(the SV2) to fail -- but only for accounting reasons.
> One very important part of the current allocator to be retained is the "fresh" (one buffer
per vector) and deserialized (shared buffer for all vectors) modes. Also, the ability for
a single deserialized buffer to be shared by multiple fragments.
> As a result, this is a complex design task, not a simple bug fix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message