impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Tauber-Marshall (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs
Date Mon, 17 Jul 2017 17:56:47 GMT
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7267

to look at the new patch set (#7).

Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs
......................................................................

IMPALA-5498: Support for partial sorts in Kudu INSERTs

Impala currently supports total sorts (the entire set of data
is sorted) and top-n sorts (only the highest/lowest n elements
are sorted). This patch adds the ability to do partial sorts,
where the data is divided up into some number of subsets, each
of which is sorted individually.

It accomplishes this by adding a new exec node, PartialSortNode.
When PartialSortNode::GetNext() is called, it retrieves input
up to the query memory limit, uses the existing Sorter class to sort
it, and outputs it. This is faster than a total sort with SortNode
as it avoids the need to spill if the input is larger than the
memory limit.

Future work will look into setting a more restrictive memory limit
on the PartialSortNode. (IMPALA-5669)

In the planner, the SortNode plan node is used, with an enum value
indicating if it is a total or partial sort.

This also adds a new counter 'RunSize' to the runtime profile which
tracks the min, max, and avg size of the generated runs, in tuples.

As a first use case, partial sort is used where a total sort was
used previously for inserts/upserts into Kudu tables only. Future
work can extend this to other table sinks. (IMPALA-5649)

Testing:
- E2E test with a large INSERT into a Kudu table with a mem limit.
  Checks that no spills occurred.
- Updated planner tests.
- Existing E2E tests and stress test verify correctness of INSERT.
- Perf tests on the 10 node cluster: inserting tpch_100.lineitem
  into a Kudu table with mem_limit=3gb:
  Previously: 5 runs are spilled, sort took 7m33s
  Now: no spills, sort takes 6m19s, for ~18% speedup

Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
---
M be/src/exec/CMakeLists.txt
M be/src/exec/exec-node.cc
A be/src/exec/partial-sort-node.cc
A be/src/exec/partial-sort-node.h
M be/src/exec/sort-node.h
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M be/src/util/runtime-profile-counters.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java
M fe/src/main/java/org/apache/impala/planner/Planner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M testdata/workloads/functional-query/queries/QueryTest/kudu_insert.test
16 files changed, 459 insertions(+), 70 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/67/7267/7
-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dhecht@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokhtar@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tmarshall@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstrong@cloudera.com>
Gerrit-Reviewer: Zach Amsden <zamsden@cloudera.com>

Mime
View raw message