From issues-return-126414-archive-asf-public=cust-asf.ponee.io@hive.apache.org  Fri Jun 29 04:24:04 2018
Return-Path: <issues-return-126414-archive-asf-public=cust-asf.ponee.io@hive.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id A38EF180662
	for <archive-asf-public@cust-asf.ponee.io>; Fri, 29 Jun 2018 04:24:03 +0200 (CEST)
Received: (qmail 35598 invoked by uid 500); 29 Jun 2018 02:24:02 -0000
Mailing-List: contact issues-help@hive.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:issues-help@hive.apache.org>
List-Unsubscribe: <mailto:issues-unsubscribe@hive.apache.org>
List-Post: <mailto:issues@hive.apache.org>
List-Id: <issues.hive.apache.org>
Reply-To: dev@hive.apache.org
Delivered-To: mailing list issues@hive.apache.org
Received: (qmail 35588 invoked by uid 99); 29 Jun 2018 02:24:02 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jun 2018 02:24:02 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 346C0182426
	for <issues@hive.apache.org>; Fri, 29 Jun 2018 02:24:02 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: -109.511
X-Spam-Level:
X-Spam-Status: No, score=-109.511 tagged_above=-999 required=6.31
	tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8,
	RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01,
	USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled
Received: from mx1-lw-us.apache.org ([10.40.0.8])
	by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024)
	with ESMTP id yPzX1WRTZDvZ for <issues@hive.apache.org>;
	Fri, 29 Jun 2018 02:24:01 +0000 (UTC)
Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139])
	by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id C2ADD5F300
	for <issues@hive.apache.org>; Fri, 29 Jun 2018 02:24:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 5B806E0CA4
	for <issues@hive.apache.org>; Fri, 29 Jun 2018 02:24:00 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1CB5F23F99
	for <issues@hive.apache.org>; Fri, 29 Jun 2018 02:24:00 +0000 (UTC)
Date: Fri, 29 Jun 2018 02:24:00 +0000 (UTC)
From: "Misha Dmitriev (JIRA)" <jira@apache.org>
To: issues@hive.apache.org
Message-ID: <JIRA.13166791.1529354206000.37144.1530239040115@Atlassian.JIRA>
In-Reply-To: <JIRA.13166791.1529354206000@Atlassian.JIRA>
References: <JIRA.13166791.1529354206000@Atlassian.JIRA> <JIRA.13166791.1529354206497@jira-lw-us.apache.org>
Subject: [jira] [Comment Edited] (HIVE-19937) Intern JobConf objects in
 Spark tasks
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


    [ https://issues.apache.org/jira/browse/HIVE-19937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16527045#comment-16527045 ] 

Misha Dmitriev edited comment on HIVE-19937 at 6/29/18 2:23 AM:
----------------------------------------------------------------

I took a quick look, and I am not sure this is done correctly. The code below
{code:java}
jobConf.forEach(entry -> {
  StringInternUtils.internIfNotNull(entry.getKey());
  StringInternUtils.internIfNotNull(entry.getValue());
}){code}
goes over each table entry and just invokes intern() for each key and value. {{intern()}} returns an existing, "canonical" string for each string that is duplicate. But the code doesn't store the returned strings back into the table. To intern both keys and values in a hashtable, you typically need to create a new table and effectively "intern and transfer" the contents from the old table to the new table. Sometimes it may be possible to be more creative and actually create a table with interned contents right away. Here it probably could be done if you added some custom kryo deserialization code for such tables. But maybe that's too big an effort.

As always, it would be good to see how much memory was wasted before this change and saved after it. This helps to prevent errors and to see how much was actually achieved.

If {{jobConf}} is an instance of {{java.lang.Properties}}, and there are many duplicates of such tables, then memory is wasted by both string contents of these tables and by tables themselves (each table uses many extra Java objects internally). So you may consider checking the {{org.apache.hadoop.hive.common.CopyOnFirstWriteProperties}} class that I once added for a somewhat similar use case.


was (Author: misha@cloudera.com):
I took a quick look, and I am not sure this is done correctly. The code below
{code:java}
jobConf.forEach(entry -> {
  StringInternUtils.internIfNotNull(entry.getKey());
  StringInternUtils.internIfNotNull(entry.getValue());
}){code}

> Intern JobConf objects in Spark tasks
> -------------------------------------
>
>                 Key: HIVE-19937
>                 URL: https://issues.apache.org/jira/browse/HIVE-19937
>             Project: Hive
>          Issue Type: Improvement
>          Components: Spark
>            Reporter: Sahil Takiar
>            Assignee: Sahil Takiar
>            Priority: Major
>         Attachments: HIVE-19937.1.patch
>
>
> When fixing HIVE-16395, we decided that each new Spark task should clone the {{JobConf}} object to prevent any {{ConcurrentModificationException}} from being thrown. However, setting this variable comes at a cost of storing a duplicate {{JobConf}} object for each Spark task. These objects can take up a significant amount of memory, we should intern them so that Spark tasks running in the same JVM don't store duplicate copies.


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)