Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Wed, 1 May 2013 18:16:16 +0000 (UTC)
From: "Phabricator (JIRA)" <jira@apache.org>
To: hive-dev@hadoop.apache.org
Message-ID: <JIRA.12644521.1366861058936.255870.1367432176328@arcas>
In-Reply-To: <JIRA.12644521.1366861058936@arcas>
References: <JIRA.12644521.1366861058936@arcas>
Subject: [jira] [Commented] (HIVE-4421) Improve memory usage by ORC
 dictionaries
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HIVE-4421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13646783#comment-13646783 ] 

Phabricator commented on HIVE-4421:
-----------------------------------

ashutoshc has requested changes to the revision "HIVE-4421 [jira] Improve memory usage by ORC dictionaries".

  Logic in patch mostly looks good. Just requesting for more comments, though ORC is already have pretty good comments. Also, I didn't understand changes in RedBlackTree. I assume you have improved memory accounting for it. But it will be great if you can spell out what was the problem earlier which you are fixing in this patch.

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:44 "added to.." is repeated.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:47 I think its better to define this in HiveConf as well, so that we can up or down this value without needing to recompile Hive. Specially, since size of row is unbounded. Size of 5K rows are very much data dependent. e.g., I recently saw a table which had more than 100 string columns.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:141 I think it will be good to add a note in comment about usage of synchronized keyword, ie the scenario where this method might be invoked from different threads.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:94 It will be good to add a comment on when oldVal could possibly be null. On the first reading of code, it wasn't obvious to me.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/DynamicIntArray.java:138 Better name : getSizeInBytes ?
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:142 It will be good to add a comment saying that every 5000 rows added across all writers we request each writer to flush their content to disk if they are using memory beyond their quota.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/OutStream.java:233 I didnt get when current ByteBuffer could be null. It will always be non-null when this method is invoked. Isnt it? Will be good to add a comment if the case is otherwise.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/OutStream.java:236 Just for my own clarity, this will be null when compression is off, right ?
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/OutStream.java:37 It will be good to add a comment for all these 3 ByteBuffers for what kind of data are they holding.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java:687 Pardon my ignorance. I didn't get what countOutput was meant for earlier and why it is no longer required.

REVISION DETAIL
  https://reviews.facebook.net/D10545

BRANCH
  h-4421

ARCANIST PROJECT
  hive

To: JIRA, ashutoshc, omalley

                
> Improve memory usage by ORC dictionaries
> ----------------------------------------
>
>                 Key: HIVE-4421
>                 URL: https://issues.apache.org/jira/browse/HIVE-4421
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>             Fix For: 0.11.0
>
>         Attachments: HIVE-4421.D10545.1.patch, HIVE-4421.D10545.2.patch, HIVE-4421.D10545.3.patch
>
>
> Currently, for tables with many string columns, it is possible to significantly underestimate the memory used by the ORC dictionaries and cause the query to run out of memory in the task. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira