hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Phabricator (JIRA)" <>
Subject [jira] [Commented] (HIVE-4421) Improve memory usage by ORC dictionaries
Date Wed, 01 May 2013 18:16:16 GMT


Phabricator commented on HIVE-4421:

ashutoshc has requested changes to the revision "HIVE-4421 [jira] Improve memory usage by
ORC dictionaries".

  Logic in patch mostly looks good. Just requesting for more comments, though ORC is already
have pretty good comments. Also, I didn't understand changes in RedBlackTree. I assume you
have improved memory accounting for it. But it will be great if you can spell out what was
the problem earlier which you are fixing in this patch.

  ql/src/java/org/apache/hadoop/hive/ql/io/orc/ "added to.." is repeated.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/ I think its better to
define this in HiveConf as well, so that we can up or down this value without needing to recompile
Hive. Specially, since size of row is unbounded. Size of 5K rows are very much data dependent.
e.g., I recently saw a table which had more than 100 string columns.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/ I think it will be good
to add a note in comment about usage of synchronized keyword, ie the scenario where this method
might be invoked from different threads.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/ It will be good to add
a comment on when oldVal could possibly be null. On the first reading of code, it wasn't obvious
to me.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/ Better name : getSizeInBytes
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/ It will be good to add
a comment saying that every 5000 rows added across all writers we request each writer to flush
their content to disk if they are using memory beyond their quota.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/ I didnt get when current
ByteBuffer could be null. It will always be non-null when this method is invoked. Isnt it?
Will be good to add a comment if the case is otherwise.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/ Just for my own clarity,
this will be null when compression is off, right ?
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/ It will be good to add a
comment for all these 3 ByteBuffers for what kind of data are they holding.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/ Pardon my ignorance. I
didn't get what countOutput was meant for earlier and why it is no longer required.




To: JIRA, ashutoshc, omalley

> Improve memory usage by ORC dictionaries
> ----------------------------------------
>                 Key: HIVE-4421
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>             Fix For: 0.11.0
>         Attachments: HIVE-4421.D10545.1.patch, HIVE-4421.D10545.2.patch, HIVE-4421.D10545.3.patch
> Currently, for tables with many string columns, it is possible to significantly underestimate
the memory used by the ORC dictionaries and cause the query to run out of memory in the task.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message