impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Armstrong (Code Review)" <>
Subject [Impala-ASF-CR] IMPALA-3200: [DOCS] Document user-facing aspects of new buffer pool
Date Thu, 05 Oct 2017 21:22:45 GMT
Tim Armstrong has posted comments on this change. ( )

Change subject: IMPALA-3200: [DOCS] Document user-facing aspects of new buffer pool

Patch Set 5:

File docs/topics/impala_buffer_pool_limit.xml:
PS5, Line 53: (This default is represented by an empty value,
            :       sometimes incorrectly reported as 0 in the output of the <codeph>SET</codeph>
            :       command.)
This area of behaviour in the code is a bit of a mess and subject to change. The output will
be fixed in 5.14 and you can't "set buffer_pool_limit=" until 5.14, so I think this creates
potential for further confusion.

 Maybe best to leave it undocumented or say something like "(This default takes affect if
the options is unset)".
File docs/topics/impala_default_spillable_buffer_size.xml:
PS5, Line 72:       spill-to-disk operations. If spill-to-disk operations result in memory
The last sentence I think creates a lot of potential for confusion. The conditions where this
can make a difference do not necessarily involve spilling to disk and don't necessary involve
a memory constraint on the node itself - it could prevent the query hitting a mem_limit even
if there is plenty of free memory.

It seems sufficient to me to say that reducing the value may reduce memory consumption - spelling
out the exactly circumstances where it will helpful would require a lot of words and be subject
to change.
PS5, Line 73: DataNodes
Impala daemons?
File docs/topics/impala_max_row_size.xml:
PS5, Line 40: Applies when unpacking column values from
            :       a row read from disk, or when constructing intermediate or final rows
in the
            :       result set. Setting an upper limit prevents out-of-control memory use
            :       accessing columns containing huge strings.
This summary is wrong (the behaviour is subtle). The query option has zero effect on scans
and the maximum row size isn't necessarily enforced. The point is to force Impala to reserve
enough memory to process rows of this size. If the rows are larger 

It's a lower bound on the upper bound, rather than an upper bound.

If it's set to a high value then Impala may reserve more memory than is necessary to execute
the query.
PS5, Line 69:  so it is possible
This is correct but I'm concerned some users might take it the wrong way. To me this sounds
like it's "possible but unlikely", whereas I think in many case that queries will succeed.
My concern is that some users might see this and start worrying about their queries. Maybe
something like "in many cases"
File docs/topics/impala_min_spillable_buffer_size.xml:
PS5, Line 49: 
Same comments as default_spillable_buffer_size.
File docs/topics/impala_scalability.xml:
PS4, Line 701:               If your queries experience substantial performance overhead due
to spilling, enable the
> OK, I'll skip over the output piece and see if users ask for more details t
I'm ok with this - it seems like we need more of a systematic effort to pin down the semantics
of the profile and fdocument it.
File docs/topics/impala_scalability.xml:
PS5, Line 570:         <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>,
and joins, use memory.
I think this background is helpful for users to understand the implications of spill to disk.
I had a go at conveying similar information with the new infrastructure:

        <codeph>GROUP BY</codeph>, <codeph>DISTINCT</codeph>, and
joins, use memory.
        On each host that participates in the query, each such operator in a query requires
        to store rows of data and other data structures. Impala reserves a certain amount
of memory
        up-front for each operator that supports spill-to-disk that is sufficient to execute
        operator. If an operator accumulates more data than can fit in the reserved memory,
        can either reserve more memory to continue processing data in memory or start spilling
        data to temporary scratch files on disk. Thus operators with spill-to-disk support
        can adapt to different memory constraints by using however much memory is available
        to speed up execution yet tolerate low memory conditions by spilling data to disk.

        The amount data depends on the portion of the data being handled by that host, and
        the operator may end up consuming different amounts of memory on different hosts.
PS5, Line 572:         while building the data structure to process the aggregation or join
operation. The amount
Could we keep this around but make clear that it describes the behaviour in 2.9 and earlier?
It seems useful for people looking at the latest docs or for contrasting behaviour.
PS5, Line 627:               <codeph>WriteIoBytes</codeph> counter reports how
much data was written to disk for each operator
Should we mention the old names parenthetically (it was ScratchBytesWritten in 2.9 and BytesWritten
earlier) so that the docs are still usable for older Impala versions.

To view, visit
To unsubscribe, visit

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I49323f8ffbff3e195058e88762eedbb1fcb1bc0e
Gerrit-Change-Number: 8003
Gerrit-PatchSet: 5
Gerrit-Owner: John Russell <>
Gerrit-Reviewer: Dan Hecht <>
Gerrit-Reviewer: John Russell <>
Gerrit-Reviewer: Mostafa Mokhtar <>
Gerrit-Reviewer: Tim Armstrong <>
Gerrit-Comment-Date: Thu, 05 Oct 2017 21:22:45 +0000
Gerrit-HasComments: Yes

  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message