impala-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Impala Public Jenkins (Code Review)" <ger...@cloudera.org>
Subject [Impala-ASF-CR] IMPALA-6070: Parallel data load.
Date Wed, 25 Oct 2017 00:00:25 GMT
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8320
)

Change subject: IMPALA-6070: Parallel data load.
......................................................................

IMPALA-6070: Parallel data load.

This commit loads functional-query, TPC-H data, and TPC-DS data in
parallel. In parallel, these take about 37 minutes, dominated by
functional-query. Serially, these take about 30 minutes more, namely the
13 minutes of tpcds and 16 minutes of tpcds. This works out nicely
because CPU usage during data load is very low in aggregate. (We don't
sustain more than 1 CPU of load, whereas build machines are likely to
have many CPUs.)

To do this, I added support to run-step.sh to have a notion of a
backgroundable task, and support waiting for all tasks.

I also increased the heapsize of our HiveServer2 server. When datasets
were being loaded in parallel, we ran out of memory at 256MB of heap.

The resulting log output is currently like so (but without the
timestamps):

15:58:04  Started Loading functional-query data in background; pid 8105.
15:58:04  Started Loading TPC-H data in background; pid 8106.
15:58:04  Loading functional-query data (logging to /home/impdev/Impala/logs/data_loading/load-functional-query.log)...
15:58:04  Started Loading TPC-DS data in background; pid 8107.
15:58:04  Loading TPC-H data (logging to /home/impdev/Impala/logs/data_loading/load-tpch.log)...
15:58:04  Loading TPC-DS data (logging to /home/impdev/Impala/logs/data_loading/load-tpcds.log)...
16:11:31    Loading workload 'tpch' using exploration strategy 'core' OK (Took: 13 min 27
sec)
16:14:33    Loading workload 'tpcds' using exploration strategy 'core' OK (Took: 16 min 29
sec)
16:35:08    Loading workload 'functional-query' using exploration strategy 'exhaustive' OK
(Took: 37 min 4 sec)

I tested dataloading with the following command on an 8-core, 32GB
machine. I saw 19GB of available memory during my run:
  ./buildall.sh -testdata -build_shared_libs -start_minicluster -start_impala_cluster -format

Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Reviewed-on: http://gerrit.cloudera.org:8080/8320
Reviewed-by: Jim Apple <jbapple-impala@apache.org>
Reviewed-by: Michael Brown <mikeb@cloudera.com>
Reviewed-by: Alex Behm <alex.behm@cloudera.com>
Tested-by: Impala Public Jenkins
---
M testdata/bin/create-load-data.sh
M testdata/bin/run-hive-server.sh
M testdata/bin/run-step.sh
3 files changed, 44 insertions(+), 5 deletions(-)

Approvals:
  Jim Apple: Looks good to me, but someone else must approve
  Michael Brown: Looks good to me, but someone else must approve
  Alex Behm: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/8320
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Gerrit-Change-Number: 8320
Gerrit-PatchSet: 3
Gerrit-Owner: Philip Zeyliger <philip@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.behm@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Jim Apple <jbapple-impala@apache.org>
Gerrit-Reviewer: Joe McDonnell <joemcdonnell@cloudera.com>
Gerrit-Reviewer: Michael Brown <mikeb@cloudera.com>
Gerrit-Reviewer: Philip Zeyliger <philip@cloudera.com>
Gerrit-Reviewer: Zach Amsden <zamsden@cloudera.com>

Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message