hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Barnabas Maidics (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-13752) fs.Path stores file path in java.net.URI causes big memory waste
Date Sun, 12 Aug 2018 07:58:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-13752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16577447#comment-16577447

Barnabas Maidics commented on HDFS-13752:

[~misha@cloudera.com] yes, you're correct. 

But with my solution we can get the Path/Scheme/Authority/Fragment information from the path
directly. So it'd be much faster with the on demand URI creation (35x faster than the original).
But the simple toUri call, when we need the URI itself, it'd be 10x slower. 

I'll try to get a HDFS cluster and run some operations measuring the performance difference.
Obviously it'd be closer to a real use case. When I get the results I'll upload the results. 

> fs.Path stores file path in java.net.URI causes big memory waste
> ----------------------------------------------------------------
>                 Key: HDFS-13752
>                 URL: https://issues.apache.org/jira/browse/HDFS-13752
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 2.7.6
>         Environment: Hive 2.1.1 and hadoop 2.7.6 
>            Reporter: Barnabas Maidics
>            Priority: Major
>         Attachments: Screen Shot 2018-07-20 at 11.12.38.png, heapdump-100000partitions.html,
> I was looking at HiveServer2 memory usage, and a big percentage of this was because
of org.apache.hadoop.fs.Path, where you store file paths in a java.net.URI object. The URI
implementation stores the same string in 3 different objects (see the attached image). In
Hive when there are many partitions this cause a big memory usage. In my particular case 42%
of memory was used by java.net.URI so it could be reduced to 14%. 
> I wonder if the community is open to replace it with a more memory efficient implementation
and what other things should be considered here? It can be a huge memory improvement for
Hadoop and for Hive as well.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message