hadoop-mapreduce-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ramkumar Vadali (JIRA)" <j...@apache.org>
Subject [jira] Created: (MAPREDUCE-1892) RaidNode can identify processed files with lesser memory usage
Date Wed, 23 Jun 2010 22:16:51 GMT
RaidNode can identify processed files with lesser memory usage
--------------------------------------------------------------

                 Key: MAPREDUCE-1892
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1892
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
          Components: contrib/raid
            Reporter: Ramkumar Vadali


The RaidNode policy file can have policies that can cover a file more than once. To avoid
processing a file multiple times (for RAIDing), RaidNode maintains a list of processed files
that is used to avoid duplicate processing attempts.

This is problematic in that a large number of processed files could cause the RaidNode to
run out of memory.

This task proposes a better method of detecting processed files. The method is based on the
observation that a more selective policy will have a better match with a file name than a
less selective one. Specifically, the more selective policy will have a longer common prefix
with the file name.

So to detect if a file has already been processed, the RaidNode only needs to maintain a list
of processed policies and compare the lengths of the common prefixes. If the file has a longer
common prefix with one of the processed policies than with the current policy, it can be assumed
to be processed already.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message