hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3638) Cache the iFile index files in memory to reduce seeks during map output serving
Date Fri, 12 Sep 2008 02:21:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630461#action_12630461
] 

Chris Douglas commented on HADOOP-3638:
---------------------------------------

This looks good. A few suggestions:
* In IndexCache::getIndexInformation, the write to the timestamp isn't guranteed to be [atomic|http://java.sun.com/docs/books/jls/third_edition/html/memory.html#17.7].
Moving it into the synchronized block above it, using an AtomicLong, or setting it to be volatile
should all work.
* synchronizing on the intern'd value of a string isn't safe. String maintains its pool of
intern'd Strings until the JVM shuts down, so using it here will cause each mapId to add an
entry to this pool without ever removing it. Further, depending on the semantics of intern()
for thread safety seems risky. It makes more sense to use a more conventional synchronization
strategy.
* Why is DataOutputStream imported in IFileInputStream?
* The semantics of the comparator sorting the LRU cache aren't consistent. If o1 is null and
o2 is not, o1 < o2 and o2 < o1.
* Is it necessary to copy readLong and writeLong from java.io classes? Since both are Filter\*Stream
classes, isn't it sufficient to wrap each in the appropriate Data\*Stream?
* readIndexFile doesn't need to set the timeStamp; it's set in getIndexInformation after the
call to readIndexFileToCache, which is more appropriate. If the LRU cache were replaced with
some other metric, it would make sense to pass in an IndexInformation object from cache to
readIndexFile, and fill in the timestamp in a subclass associated with the cache. IndexInformation::readIndexFile
would fill in the index table, and mapId but setting the timestamp field could be (and effectively
is) left to the cache. We're unlikely to add new caches, but this keeps the timestamp associated
with the LRU cache and not the index reader.
* In IndexCache::readIndexFileToCache, the "excess" memory is calculated behind a block synchronized
on the IndexCache instance, then that amount is freed in a block synchronized on a member
field. It's possible, even likely, that multiple threads will free more memory than is necessary.
For example, let the cache have a large, cached index in memory M0 with the lowest timestamp.
If T1..Tn call readIndexFileToCache, say T1 is the first to grab the lock on cache in IndexCache::freeIndexInformation.
It's legal for T2..Tn to get blocked in freeIndexInformation before T1 frees M0. T2..Tn will
have calculated the number of bytes to free assuming M0 was still present, which is overly
aggressive.

> Cache the iFile index files in memory to reduce seeks during map output serving
> -------------------------------------------------------------------------------
>
>                 Key: HADOOP-3638
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3638
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Devaraj Das
>            Assignee: Jothi Padmanabhan
>             Fix For: 0.19.0
>
>         Attachments: hadoop-3638-v1.patch, hadoop-3638-v2.patch
>
>
> The iFile index files can be cached in memory to reduce seeks during map output serving.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message