hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koji Noguchi (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-865) harchive: Reduce the number of open calls to _index and _masterindex
Date Tue, 18 Aug 2009 02:11:14 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744331#action_12744331

Koji Noguchi commented on MAPREDUCE-865:

Simple testing.
Created har file with 
/a/b/2000files/xaaaaa to xaadnj
and /a/b/2000files/2000files/xaaaaa to xaadnj

Created har archive called myarchive.har.

About 4500 files. 

Withot the patch, 
/usr/bin/time hadoop dfs -lsr har:///user/knoguchi/myarchive.har > /dev/null          
31.72user 5.23system *1:13.19* elapsed 50%CPU (0avgtext+0avgdata 0maxresident)

with 9000 open calls to Namenode. (_masterindex and _index) and also 4500 filestatus calls
to _index (I think).

With the patch, 
23.59user 0.58system *0:22.97* elapsed 105%CPU (0avgtext+0avgdata 0maxresident)

with one _master open call and five _index open calls.
Setting -Dfs.har.indexcache.num=1 changed the number of _index open calls  to 10 times, but
elapsed  time didn't change much.

The goal of the patch is more for reducing the load/calls to the namenode than speeding up
the 'ls' commands.

Note that since client caches the entire _masterindex and also caches each STORE(cache range)
it reads, initial call would be slower.

> harchive: Reduce the number of open calls  to _index and _masterindex 
> ----------------------------------------------------------------------
>                 Key: MAPREDUCE-865
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-865
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: harchive
>            Reporter: Koji Noguchi
>            Priority: Minor
>         Attachments: mapreduce-865-0.patch
> When I have har file with 1000 files in it, 
>    % hadoop dfs -lsr har:///user/knoguchi/myhar.har/
> would open/read/close the _index/_masterindex files 1000 times.
> This makes the client slow and add some load to the namenode as well.
> Any ways to reduce this number?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message