hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Misha Dmitriev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-11383) String duplication in org.apache.hadoop.fs.BlockLocation
Date Wed, 24 May 2017 02:05:05 GMT

    [ https://issues.apache.org/jira/browse/HDFS-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022204#comment-16022204
] 

Misha Dmitriev commented on HDFS-11383:
---------------------------------------

Hi Andrew,

I understand your concerns. Unit tests could be a good solution, but the problem is, to quantify
the effect of a change like that one would need, in principle, to first run some code that
uses BlockLocation unchanged and measure how much memory is consumed, then run the same code
with BlockLocation that has interning and measure memory again. There is also a problem of
how representative such a "pseudo-benchmark" would be, e.g. I can easily populate some data
structure with very big strings and then demonstrate that interning them would save a lot
of memory. But would that resemble real-life usage patterns?

So I suspect that some benchmark would be best, but indeed it's hard to revive my test cluster
right now. Maybe I can still convince you by:
- telling that String.intern() is proven to work well (I've already optimized several projects
at Cloudera with its help, and there I could definitely quantify the effect of the changes
- we can discuss all this offline if you would like)
- attaching the results from my old benchmark showing how much memory is wasted due to duplicate
strings in BlockLocation. I am attaching the full jxray report for one of the heap dumps that
I obtained in this benchmark, and here are the most relevant excerpts:

{code}
6. DUPLICATE STRINGS

Total strings: 172,451  Unique strings: 52,360  Duplicate values: 16,158  Overhead: 14,291K
(29.8%)

Top duplicate strings:
    Ovhd         Num char[]s   Num objs   Value

  1,398K (2.9%)    12791       12791      "host-10-17-101-14.coe.cloudera.com"
  1,163K (2.4%)     9926        9926      "host-10-17-101-14.coe.cloudera.com:8020"
    809K (1.7%)        6           6      "hdfs://host-10-17-101-14.coe.cloudera.com:8020/tmp/misha/misha-table-partition-1,hdf
...[length 82892]"
    465K (1.0%)     9923        9923      "hdfs"
    ....

7. REFERENCE CHAINS FOR DUPLICATE STRINGS

  595K (1.2%), 5088 dup strings (4 unique), 5088 dup backing arrays:
1696 of "DS-aab6ab0b-0b11-489f-b209-ab2c6412934c", 1149 of "DS-d47bdaca-50c5-4475-ac08-7f07e10cd0b6",
1132 of "DS-bf6046e6-d5e9-4ac2-a1af-ff8a88ab9d85", 1111 of "DS-d2c5088c-bd69-4500-b981-502819c1307a"
     <-- String[] <-- org.apache.hadoop.fs.BlockLocation.storageIds <-- org.apache.hadoop.fs.BlockLocation[]
<-- org.apache.hadoop.fs.LocatedFileStatus.locations <--  {j.u.ArrayList} <-- Java
Local@fd414328 (j.u.ArrayList)
 
 556K (1.2%), 5088 dup strings (4 unique), 5088 dup backing arrays:
1696 of "host-10-17-101-14.coe.cloudera.com", 1149 of "host-10-17-101-15.coe.cloudera.com",
1132 of "host-10-17-101-17.coe.cloudera.com", 1111 of "host-10-17-101-16.coe.cloudera.com"
     <-- String[] <-- org.apache.hadoop.fs.BlockLocation.hosts <-- org.apache.hadoop.fs.BlockLocation[]
<-- org.apache.hadoop.fs.LocatedFileStatus.locations <--  {j.u.ArrayList} <-- Java
Local@fd414328 (j.u.ArrayList)

  476K (1.0%), 5088 dup strings (4 unique), 5088 dup backing arrays:
1696 of "/default/10.17.101.14:50010", 1149 of "/default/10.17.101.15:50010", 1132 of "/default/10.17.101.17:50010",
1111 of "/default/10.17.101.16:50010"
     <-- String[] <-- org.apache.hadoop.fs.BlockLocation.topologyPaths <-- org.apache.hadoop.fs.BlockLocation[]
<-- org.apache.hadoop.fs.LocatedFileStatus.locations <--  {j.u.ArrayList} <-- Java
Local@fd414328 (j.u.ArrayList)

  409K (0.9%), 3492 dup strings (4 unique), 3492 dup backing arrays:
1164 of "DS-aab6ab0b-0b11-489f-b209-ab2c6412934c", 788 of "DS-d47bdaca-50c5-4475-ac08-7f07e10cd0b6",
770 of "DS-bf6046e6-d5e9-4ac2-a1af-ff8a88ab9d85", 770 of "DS-d2c5088c-bd69-4500-b981-502819c1307a"
     <-- String[] <-- org.apache.hadoop.fs.BlockLocation.storageIds <-- org.apache.hadoop.fs.BlockLocation[]
<-- org.apache.hadoop.fs.LocatedFileStatus.locations <--  {j.u.ArrayList} <-- Java
Local@fd67ae70 (j.u.ArrayList)

  397K (0.8%), 5088 dup strings (4 unique), 5088 dup backing arrays:
1696 of "10.17.101.14:50010", 1149 of "10.17.101.15:50010", 1132 of "10.17.101.17:50010",
1111 of "10.17.101.16:50010"
     <-- String[] <-- org.apache.hadoop.fs.BlockLocation.names <-- org.apache.hadoop.fs.BlockLocation[]
<-- org.apache.hadoop.fs.LocatedFileStatus.locations <--  {j.u.ArrayList} <-- Java
Local@fd414328 (j.u.ArrayList)

  381K (0.8%), 3492 dup strings (4 unique), 3492 dup backing arrays:
1164 of "host-10-17-101-14.coe.cloudera.com", 788 of "host-10-17-101-15.coe.cloudera.com",
770 of "host-10-17-101-17.coe.cloudera.com", 770 of "host-10-17-101-16.coe.cloudera.com"
     <-- String[] <-- org.apache.hadoop.fs.BlockLocation.hosts <-- org.apache.hadoop.fs.BlockLocation[]
<-- org.apache.hadoop.fs.LocatedFileStatus.locations <--  {j.u.ArrayList} <-- Java
Local@fd67ae70 (j.u.ArrayList)

....
{code}

> String duplication in org.apache.hadoop.fs.BlockLocation
> --------------------------------------------------------
>
>                 Key: HDFS-11383
>                 URL: https://issues.apache.org/jira/browse/HDFS-11383
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>         Attachments: HDFS-11383.01.patch
>
>
> I am working on Hive performance, investigating the problem of high memory pressure when
(a) a table consists of a high number (thousands) of partitions and (b) multiple queries run
against it concurrently. It turns out that a lot of memory is wasted due to data duplication.
One source of duplicate strings is class org.apache.hadoop.fs.BlockLocation. Its fields such
as storageIds, topologyPaths, hosts, names, may collectively use up to 6% of memory in my
benchmark, causing (together with other problematic classes) a huge memory spike. Of these
6% of memory taken by BlockLocation strings, more than 5% are wasted due to duplication.
> I think we need to add calls to String.intern() in the BlockLocation constructor, like:
> {code}
> this.hosts = internStringsInArray(hosts);
> ...
> private void internStringsInArray(String[] sar) {
>   for (int i = 0; i < sar.length; i++) {
>     sar[i] = sar[i].intern();
>   }
> }
> {code}
> String.intern() performs very well starting from JDK 7. I've found some articles explaining
the progress that was made by the HotSpot JVM developers in this area, verified that with
benchmarks myself, and finally added quite a bit of interning to one of the Cloudera products
without any issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message