Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 79912200C67 for ; Mon, 15 May 2017 21:56:09 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 782AB160BD0; Mon, 15 May 2017 19:56:09 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id BE937160BA9 for ; Mon, 15 May 2017 21:56:08 +0200 (CEST) Received: (qmail 98915 invoked by uid 500); 15 May 2017 19:56:07 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 98891 invoked by uid 99); 15 May 2017 19:56:07 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 15 May 2017 19:56:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id E753CC05B0 for ; Mon, 15 May 2017 19:56:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id IqUPwUXH9ftd for ; Mon, 15 May 2017 19:56:06 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 7C6EF5FC3A for ; Mon, 15 May 2017 19:56:05 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id CF6A1E0D50 for ; Mon, 15 May 2017 19:56:04 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 34A2D243A5 for ; Mon, 15 May 2017 19:56:04 +0000 (UTC) Date: Mon, 15 May 2017 19:56:04 +0000 (UTC) From: "Misha Dmitriev (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HDFS-11383) String duplication in org.apache.hadoop.fs.BlockLocation MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 15 May 2017 19:56:09 -0000 [ https://issues.apache.org/jira/browse/HDFS-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Misha Dmitriev updated HDFS-11383: ---------------------------------- Attachment: HDFS-11383.01.patch > String duplication in org.apache.hadoop.fs.BlockLocation > -------------------------------------------------------- > > Key: HDFS-11383 > URL: https://issues.apache.org/jira/browse/HDFS-11383 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Misha Dmitriev > Assignee: Misha Dmitriev > Attachments: HDFS-11383.01.patch > > > I am working on Hive performance, investigating the problem of high memory pressure when (a) a table consists of a high number (thousands) of partitions and (b) multiple queries run against it concurrently. It turns out that a lot of memory is wasted due to data duplication. One source of duplicate strings is class org.apache.hadoop.fs.BlockLocation. Its fields such as storageIds, topologyPaths, hosts, names, may collectively use up to 6% of memory in my benchmark, causing (together with other problematic classes) a huge memory spike. Of these 6% of memory taken by BlockLocation strings, more than 5% are wasted due to duplication. > I think we need to add calls to String.intern() in the BlockLocation constructor, like: > {code} > this.hosts = internStringsInArray(hosts); > ... > private void internStringsInArray(String[] sar) { > for (int i = 0; i < sar.length; i++) { > sar[i] = sar[i].intern(); > } > } > {code} > String.intern() performs very well starting from JDK 7. I've found some articles explaining the progress that was made by the HotSpot JVM developers in this area, verified that with benchmarks myself, and finally added quite a bit of interning to one of the Cloudera products without any issues. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org