Return-Path: X-Original-To: apmail-parquet-commits-archive@minotaur.apache.org Delivered-To: apmail-parquet-commits-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3241A19B6F for ; Sun, 17 Apr 2016 00:25:37 +0000 (UTC) Received: (qmail 80227 invoked by uid 500); 17 Apr 2016 00:25:37 -0000 Delivered-To: apmail-parquet-commits-archive@parquet.apache.org Received: (qmail 80193 invoked by uid 500); 17 Apr 2016 00:25:37 -0000 Mailing-List: contact commits-help@parquet.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@parquet.apache.org Delivered-To: mailing list commits@parquet.apache.org Received: (qmail 80184 invoked by uid 99); 17 Apr 2016 00:25:37 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 17 Apr 2016 00:25:37 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id DF8F9DFC13; Sun, 17 Apr 2016 00:25:36 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: blue@apache.org To: commits@parquet.apache.org Message-Id: X-Mailer: ASF-Git Admin Mailer Subject: parquet-mr git commit: PARQUET-580: Switch int[] initialization in IntList to be lazy Date: Sun, 17 Apr 2016 00:25:36 +0000 (UTC) Repository: parquet-mr Updated Branches: refs/heads/master d40214875 -> ac62c1c29 PARQUET-580: Switch int[] initialization in IntList to be lazy Noticed that for a dataset that we were trying to import that had a lot of columns (few thousand) that weren't being used, we ended up allocating a lot of unnecessary int arrays (each 64K in size). Heap footprint for all those int[]s turned out to be around 2GB or so (and results in some jobs OOMing). This seems unnecessary for columns that might not be used. The changes in this PR switch over to initialize the int[] only when it being used for the first time. Also wondering if 64K is the right size to start off with. Wondering if a potential improvement is if we could allocate these int[]s in IntList in a way that slowly ramps up their size. So rather than create arrays of size 64K at a time (which is potentially wasteful if there are only a few hundred bytes), we could create say a 4K int[], then when it fills up an 8K[] and so on till we reach 64K (at which point the behavior is the same as the current implementation). If this sounds like a reasonable idea, I can update this PR to do that as well. Wasn't sure if there was some historical context around that.. Author: Piyush Narang Closes #339 from piyushnarang/master and squashes the following commits: 3ecc577 [Piyush Narang] Remove redundant IntList ctor f7dfd5f [Piyush Narang] Switch int[] initialization in IntList to be lazy Project: http://git-wip-us.apache.org/repos/asf/parquet-mr/repo Commit: http://git-wip-us.apache.org/repos/asf/parquet-mr/commit/ac62c1c2 Tree: http://git-wip-us.apache.org/repos/asf/parquet-mr/tree/ac62c1c2 Diff: http://git-wip-us.apache.org/repos/asf/parquet-mr/diff/ac62c1c2 Branch: refs/heads/master Commit: ac62c1c29f319a97a2552c39f32c8e6acd70c9e1 Parents: d402148 Author: Piyush Narang Authored: Sat Apr 16 17:25:31 2016 -0700 Committer: Ryan Blue Committed: Sat Apr 16 17:25:31 2016 -0700 ---------------------------------------------------------------------- .../column/values/dictionary/IntList.java | 21 +++++++++++--------- 1 file changed, 12 insertions(+), 9 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/ac62c1c2/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/IntList.java ---------------------------------------------------------------------- diff --git a/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/IntList.java b/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/IntList.java index 3201072..8e6228a 100644 --- a/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/IntList.java +++ b/parquet-column/src/main/java/org/apache/parquet/column/values/dictionary/IntList.java @@ -58,7 +58,7 @@ public class IntList { } /** - * @return wether there is a next value + * @return whether there is a next value */ public boolean hasNext() { return current < count; @@ -76,16 +76,12 @@ public class IntList { } private List slabs = new ArrayList(); + + // Lazy initialize currentSlab only when needed to save on memory in cases where items might + // not be added private int[] currentSlab; private int currentSlabPos; - /** - * construct an empty list - */ - public IntList() { - initSlab(); - } - private void initSlab() { currentSlab = new int[SLAB_SIZE]; currentSlabPos = 0; @@ -95,10 +91,13 @@ public class IntList { * @param i value to append to the end of the list */ public void add(int i) { - if (currentSlabPos == currentSlab.length) { + if (currentSlab == null) { + initSlab(); + } else if (currentSlabPos == currentSlab.length) { slabs.add(currentSlab); initSlab(); } + currentSlab[currentSlabPos] = i; ++ currentSlabPos; } @@ -108,6 +107,10 @@ public class IntList { * @return an IntIterator on the content */ public IntIterator iterator() { + if (currentSlab == null) { + initSlab(); + } + int[][] itSlabs = slabs.toArray(new int[slabs.size() + 1][]); itSlabs[slabs.size()] = currentSlab; return new IntIterator(itSlabs, SLAB_SIZE * slabs.size() + currentSlabPos);