Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 561C8200C0E for ; Wed, 1 Feb 2017 18:02:55 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 54B63160B46; Wed, 1 Feb 2017 17:02:55 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 9CD0B160B41 for ; Wed, 1 Feb 2017 18:02:54 +0100 (CET) Received: (qmail 99936 invoked by uid 500); 1 Feb 2017 17:02:53 -0000 Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list hdfs-issues@hadoop.apache.org Received: (qmail 99925 invoked by uid 99); 1 Feb 2017 17:02:53 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 Feb 2017 17:02:53 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 530AD1A050C for ; Wed, 1 Feb 2017 17:02:53 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -1.199 X-Spam-Level: X-Spam-Status: No, score=-1.199 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_LAZY_DOMAIN_SECURITY=1, RP_MATCHES_RCVD=-2.999] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id Jc0qAwAsG2Qi for ; Wed, 1 Feb 2017 17:02:52 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 4CD405F613 for ; Wed, 1 Feb 2017 17:02:52 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id CAF4AE0312 for ; Wed, 1 Feb 2017 17:02:51 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 85DC62528B for ; Wed, 1 Feb 2017 17:02:51 +0000 (UTC) Date: Wed, 1 Feb 2017 17:02:51 +0000 (UTC) From: "Manoj Govindassamy (JIRA)" To: hdfs-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Assigned] (HDFS-11383) String duplication in org.apache.hadoop.fs.BlockLocation MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 01 Feb 2017 17:02:55 -0000 [ https://issues.apache.org/jira/browse/HDFS-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Govindassamy reassigned HDFS-11383: ----------------------------------------- Assignee: Manoj Govindassamy > String duplication in org.apache.hadoop.fs.BlockLocation > -------------------------------------------------------- > > Key: HDFS-11383 > URL: https://issues.apache.org/jira/browse/HDFS-11383 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Misha Dmitriev > Assignee: Manoj Govindassamy > > I am working on Hive performance, investigating the problem of high memory pressure when (a) a table consists of a high number (thousands) of partitions and (b) multiple queries run against it concurrently. It turns out that a lot of memory is wasted due to data duplication. One source of duplicate strings is class org.apache.hadoop.fs.BlockLocation. Its fields such as storageIds, topologyPaths, hosts, names, may collectively use up to 6% of memory in my benchmark, causing (together with other problematic classes) a huge memory spike. Of these 6% of memory taken by BlockLocation strings, more than 5% are wasted due to duplication. > I think we need to add calls to String.intern() in the BlockLocation constructor, like: > {code} > this.hosts = internStringsInArray(hosts); > ... > private void internStringsInArray(String[] sar) { > for (int i = 0; i < sar.length; i++) { > sar[i] = sar[i].intern(); > } > } > {code} > String.intern() performs very well starting from JDK 7. I've found some articles explaining the progress that was made by the HotSpot JVM developers in this area, verified that with benchmarks myself, and finally added quite a bit of interning to one of the Cloudera products without any issues. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org