hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Misha Dmitriev (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-19937) Intern JobConf objects in Spark tasks
Date Fri, 29 Jun 2018 02:24:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16527045#comment-16527045
] 

Misha Dmitriev edited comment on HIVE-19937 at 6/29/18 2:23 AM:
----------------------------------------------------------------

I took a quick look, and I am not sure this is done correctly. The code below
{code:java}
jobConf.forEach(entry -> {
  StringInternUtils.internIfNotNull(entry.getKey());
  StringInternUtils.internIfNotNull(entry.getValue());
}){code}
goes over each table entry and just invokes intern() for each key and value. {{intern()}}
returns an existing, "canonical" string for each string that is duplicate. But the code doesn't
store the returned strings back into the table. To intern both keys and values in a hashtable,
you typically need to create a new table and effectively "intern and transfer" the contents
from the old table to the new table. Sometimes it may be possible to be more creative and
actually create a table with interned contents right away. Here it probably could be done
if you added some custom kryo deserialization code for such tables. But maybe that's too big
an effort.

As always, it would be good to see how much memory was wasted before this change and saved
after it. This helps to prevent errors and to see how much was actually achieved.

If {{jobConf}} is an instance of {{java.lang.Properties}}, and there are many duplicates of
such tables, then memory is wasted by both string contents of these tables and by tables themselves
(each table uses many extra Java objects internally). So you may consider checking the {{org.apache.hadoop.hive.common.CopyOnFirstWriteProperties}}
class that I once added for a somewhat similar use case.


was (Author: misha@cloudera.com):
I took a quick look, and I am not sure this is done correctly. The code below
{code:java}
jobConf.forEach(entry -> {
  StringInternUtils.internIfNotNull(entry.getKey());
  StringInternUtils.internIfNotNull(entry.getValue());
}){code}

> Intern JobConf objects in Spark tasks
> -------------------------------------
>
>                 Key: HIVE-19937
>                 URL: https://issues.apache.org/jira/browse/HIVE-19937
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>         Attachments: HIVE-19937.1.patch
>
>
> When fixing HIVE-16395, we decided that each new Spark task should clone the {{JobConf}}
object to prevent any {{ConcurrentModificationException}} from being thrown. However, setting
this variable comes at a cost of storing a duplicate {{JobConf}} object for each Spark task.
These objects can take up a significant amount of memory, we should intern them so that Spark
tasks running in the same JVM don't store duplicate copies.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message