Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 19467F634 for ; Wed, 1 May 2013 18:16:17 +0000 (UTC) Received: (qmail 10144 invoked by uid 500); 1 May 2013 18:16:16 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 10091 invoked by uid 500); 1 May 2013 18:16:16 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 10067 invoked by uid 500); 1 May 2013 18:16:16 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 10041 invoked by uid 99); 1 May 2013 18:16:16 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 May 2013 18:16:16 +0000 Date: Wed, 1 May 2013 18:16:16 +0000 (UTC) From: "Phabricator (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (HIVE-4421) Improve memory usage by ORC dictionaries MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-4421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13646783#comment-13646783 ] Phabricator commented on HIVE-4421: ----------------------------------- ashutoshc has requested changes to the revision "HIVE-4421 [jira] Improve memory usage by ORC dictionaries". Logic in patch mostly looks good. Just requesting for more comments, though ORC is already have pretty good comments. Also, I didn't understand changes in RedBlackTree. I assume you have improved memory accounting for it. But it will be great if you can spell out what was the problem earlier which you are fixing in this patch. INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:44 "added to.." is repeated. ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:47 I think its better to define this in HiveConf as well, so that we can up or down this value without needing to recompile Hive. Specially, since size of row is unbounded. Size of 5K rows are very much data dependent. e.g., I recently saw a table which had more than 100 string columns. ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:141 I think it will be good to add a note in comment about usage of synchronized keyword, ie the scenario where this method might be invoked from different threads. ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:94 It will be good to add a comment on when oldVal could possibly be null. On the first reading of code, it wasn't obvious to me. ql/src/java/org/apache/hadoop/hive/ql/io/orc/DynamicIntArray.java:138 Better name : getSizeInBytes ? ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:142 It will be good to add a comment saying that every 5000 rows added across all writers we request each writer to flush their content to disk if they are using memory beyond their quota. ql/src/java/org/apache/hadoop/hive/ql/io/orc/OutStream.java:233 I didnt get when current ByteBuffer could be null. It will always be non-null when this method is invoked. Isnt it? Will be good to add a comment if the case is otherwise. ql/src/java/org/apache/hadoop/hive/ql/io/orc/OutStream.java:236 Just for my own clarity, this will be null when compression is off, right ? ql/src/java/org/apache/hadoop/hive/ql/io/orc/OutStream.java:37 It will be good to add a comment for all these 3 ByteBuffers for what kind of data are they holding. ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java:687 Pardon my ignorance. I didn't get what countOutput was meant for earlier and why it is no longer required. REVISION DETAIL https://reviews.facebook.net/D10545 BRANCH h-4421 ARCANIST PROJECT hive To: JIRA, ashutoshc, omalley > Improve memory usage by ORC dictionaries > ---------------------------------------- > > Key: HIVE-4421 > URL: https://issues.apache.org/jira/browse/HIVE-4421 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers > Reporter: Owen O'Malley > Assignee: Owen O'Malley > Fix For: 0.11.0 > > Attachments: HIVE-4421.D10545.1.patch, HIVE-4421.D10545.2.patch, HIVE-4421.D10545.3.patch > > > Currently, for tables with many string columns, it is possible to significantly underestimate the memory used by the ORC dictionaries and cause the query to run out of memory in the task. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira