hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-5810) Unify mmap cache and short-circuit file descriptor cache
Date Thu, 06 Feb 2014 19:22:12 GMT

    [ https://issues.apache.org/jira/browse/HDFS-5810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13893685#comment-13893685
] 

Colin Patrick McCabe commented on HDFS-5810:
--------------------------------------------

bq. I don't ask this lightly, but I really would like some patch splitting here. I see a lot
of nice refactors, but it's hard to separate out the intent of the changes. One potential
easy split is making BlockReader into a builder, since it's pretty mechanical and touches
a lot of files.

I think splitting that off would be a lot harder than you think.  Previously, the {{BlockReader}}
creation logic was all in {{DFSInputStream}}.  Pulling it out while using the old cache APIs
would be almost a full rewrite of big parts of this.

Similarly, splitting off the ClientContext stuff would be difficult, since the old FD cache
was not meant to be shared between threads.  I guess we could split it off and just put PeerCache
in there, but I don't think that would reduce the size of this patch that much.

bq. ClientContext#confAsString needs to be updated to match the ClientContext members, one
dupe too.

updated, thanks

bq. hdfs-default.xml, the default value of stale.threshold.ms of 3000000 does not match the
DFSConfigKeys default value of 1800000.

fixed, thanks

bq. SCReplica#Key is almost identical to FsDatasetCache#Key, but I guess that's okay unless
you want to make a shared BlockKey class.

It's a pretty small class, and we might want to add something to the key in the future, so
I think we should probably keep it as-is.

bq. How did you decide on the default mmap cache size of 1024? I feel like since we default
to 4096 max datanode transfer threads, that'd be a more appropriate number. If the default
FD ulimit of 1024 is your concern, seems like we should lower it below that for some headroom.

That's a very good question.  I guess the answer is that when we did HDFS-4953, {{dfs.client.mmap.cache.size}}
got set to 1024 as a nice round number.  I think you are right that it will cause some problems
with running out of file descriptors with the default max file descriptor ulimit of 1024.
 It is a shame to lower the default, but I think that's what we have to do to avoid a poor
user experience.

bq. These caches right now expire purely on a time basis, not actually based on pressure like,
say, the OS page cache. For the mmap cache at least, how about a hybrid: have a really long
timeout, but make eviction also pressure based? The danger of caching forever seems to be
holding on to a "bad fd" (e.g. DN deletes the file), but with the forthcoming shm communication
segment this is mitigated.

I guess the argument in favor of a long value for {{dfs.client.read.shortcircuit.streams.cache.expiry.ms}}
and {{dfs.client.mmap.cache.timeout.ms}} is that:
* applications should plan for max fd usage anyway, so keeping the cache closer to full for
a while should not be a problem;
* now that the cache is shared between input streams, a longer timeout will help more threads;
* now that we don't open multiple fds for the same replica, there will be less FD pressure
anyway;
* the kernel memory consumed by open fds is minimal; the page cache evicts mostly based on
usage, not on whether a file is open

argument against is:
* an application may stop using short-circuit reads for a while (or indeed HDFS entirely),
and want those fds back.  We have no way for the application to request them back right now.

Staleness I see as a separate issue.  We have two strategies for solving staleness: timeout-based
(for when talking to pre-HDFS-5182 DataNodes), and shm-based (for talking to newer DataNodes
that have HDFS-5182.)

On balance, I agree with your argument.  Let's lengthen these timeouts, and bump up {{dfs.client.read.shortcircuit.streams.cache.size}}
a little bit.  On the other hand, the default for {{dfs.client.mmap.cache.size}} needs to
be reduced a bit to avoid overflowing the default fd ulimit.

bq. s/mmaped/mmapped/g, good to be consistent. munmapped too, and case insensitive.

Fixed

bq. Can use java's built-in TimeUnit to convert rather than the new nanosecondsToMilliseconds
function

OK

bq. In CacheCleaner, rather than a while loop, why not a for each? It'll use the iterator
order, which should be ascending insertion time.

Iterators could get tricky here, since we're calling functions that modify the eviction maps.
 I'm not sure what the semantics are for if you call a function that deletes an element that
you're currently looking at an iterator to.  It seems safer just to call firstEntry.  The
current construction allows us to refactor this function to release the lock between evictions
later, if need be.

bq. Could probably split out the two eviction loops into a single parameterized function.

I feel like there would be too many parameters.  The mmap loop does slightly different stuff...

bq. ref, unref, fetchOrCreate javadoc say you need to hold the lock, but they themselves take
the cache lock.

Oops-- that's outdated javadoc.  Fixed.

bq. close, the try/finally lock idiom takes the lock inside the try, normally this is outside?

fixed

bq. Propagating a string tag for trace logging is ugly. Can we walk the stack instead to get
the caller, e.g. via ExceptionUtils#getFullStackTrace or other? I'm okay paying the stack
construction cost if it's only for trace logging.

This was something I needed during debugging.  For example, for {{unref}}, I needed to know
where the ultimate source was -- {{BlockReaderLocal#close}} or {{ClientMmap#unref}}.  You
can't get that just by looking at the top of the stack-- that will just tell you that {{ShortCircuitCache#purge}}
or {{ShortCircuitReplica#unref}}.  Maybe there's a better way to do this... but I can't think
of it right now.  I guess we could barf out the whole stack, but that seems to make tracing
a lot harder to read.

bq. allocMmap, no longer a toClose param mentioned in the javadoc

fixed

bq. unref, we could get rid of newRefCount, it doesn't really improve readability

OK

bq. ShortCircuitReplica#unref calls ShortCircuitCache#unref, but there's no purge. Will this
precondition abort?

The refCount can never be at 0 unless the replica has been purged.  Purging means that the
cache no longer contains the replica (and hence does not hold a reference to it).

On the other hand, purged replicas may have a refCount > 0 if someone is still using them.

bq. Some references to eviction "lists" but they look to be maps now

Hmm.  Quite a few, actually.  Fixed.

> Unify mmap cache and short-circuit file descriptor cache
> --------------------------------------------------------
>
>                 Key: HDFS-5810
>                 URL: https://issues.apache.org/jira/browse/HDFS-5810
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>    Affects Versions: 2.3.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-5810.001.patch, HDFS-5810.004.patch, HDFS-5810.006.patch, HDFS-5810.008.patch,
HDFS-5810.015.patch
>
>
> We should unify the client mmap cache and the client file descriptor cache.  Since mmaps
are granted corresponding to file descriptors in the cache (currently FileInputStreamCache),
they have to be tracked together to do "smarter" things like HDFS-5182.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message