hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kihwal Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-7174) Support for more efficient large directories
Date Tue, 30 Sep 2014 22:41:33 GMT

    [ https://issues.apache.org/jira/browse/HDFS-7174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14153913#comment-14153913
] 

Kihwal Lee commented on HDFS-7174:
----------------------------------

I propose children list to be automatically switching to a more efficient implementation when
the number of children goes over a given threshold.  The one I came up with uses hash-partitioned
smaller ArrayLists. It will no longer completely sort the list, but for the applications that
need to create a large number (e.g. over 50K entries) will likely care more about performance
than the list being sorted.  By setting the threshold high, normal apps won't get affected.

Here is the result of some measurements.

{noformat}
Unit is in milliseconds

Single ArrayList
==================================================
      random add   random get    iterating
==================================================
32K        221            45          5
128K       977           178         10
512K    12,843           967         13
1M      49,551         2,214         20
3M     472,408         8,254         43
5M   1,442,971        15,091         66
==================================================

Hashed 1024 ArrayLists (sorted within each list)
==================================================
      random add   random get    iterating
==================================================
32K        166            51          8
128K       296           149         13
512K       970           809         19
1M       2,013         1,974         34
3M       7,698         6,997         93
5M      14,126        12,614        133
==================================================

Auto Switching at 32768
==================================================
      random add   random get    iterating
==================================================
32K        239            41          7
128K       369           158         13
512K     1,020           822         18
1M       2,162         1,947         34
3M       8,013         7,036         94
5M      15,224        13,509        130
The one time conversion took 28ms.
{noformat}


> Support for more efficient large directories
> --------------------------------------------
>
>                 Key: HDFS-7174
>                 URL: https://issues.apache.org/jira/browse/HDFS-7174
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>            Priority: Critical
>
> When the number of children under a directory grows very large, insertion becomes very
costly.  E.g. creating 1M entries takes 10s of minutes.  This is because the complexity of
an insertion is O\(n\). As the size of a list grows, the overhead grows n^2. (integral of
linear function).  It also causes allocations and copies of big arrays.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message