hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-4461) DirectoryScanner: volume path prefix takes up memory for every block that is scanned
Date Tue, 11 Mar 2014 19:11:47 GMT

     [ https://issues.apache.org/jira/browse/HDFS-4461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Kihwal Lee updated HDFS-4461:

    Attachment: HDFS-4461.branch-0.23.patch

I thought we can wait till 2.x, but some 0.23 users are creating a lot of small files (i.e.
small blocks) and DNs are running out of memory when DirectoryScanner runs. The peak heap
usage can be almost 2x or even 3x of the base usage, if one dir scan garbage survives until
the next scan.

The patch is a straight back-port of the trunk version. The difference comes from the fact
that a source file got split into multiple files in branch-2/trunk. Other than that the core
change is exactly the same.

> DirectoryScanner: volume path prefix takes up memory for every block that is scanned

> -------------------------------------------------------------------------------------
>                 Key: HDFS-4461
>                 URL: https://issues.apache.org/jira/browse/HDFS-4461
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 2.0.3-alpha
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>            Priority: Minor
>             Fix For: 2.1.0-beta
>         Attachments: HDFS-4461.002.patch, HDFS-4461.003.patch, HDFS-4461.004.patch, HDFS-4461.branch-0.23.patch,
HDFS-4661.006.patch, memory-analysis.png
> In the {{DirectoryScanner}}, we create a class {{ScanInfo}} for every block.  This object
contains two File objects-- one for the metadata file, and one for the block file.  Since
those File objects contain full paths, users who pick a lengthly path for their volume roots
will end up using an extra N_blocks * path_prefix bytes per block scanned.  We also don't
really need to store File objects-- storing strings and then creating File objects as needed
would be cheaper.  This would be a nice efficiency improvement.

This message was sent by Atlassian JIRA

View raw message