hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stu24mail@yahoo.com" <stu24m...@yahoo.com>
Subject Re: fs cache giving me headaches
Date Mon, 06 Aug 2012 18:18:58 GMT
Oop! I replied too early in the morning for me!
You're not confused about close vs closeAll.

You're confused about the fact that the Filesystem is sort of a hybrid Singleton class that
defaults with performance and memory in mind, but allows you to force a new instance, for
say, multithreaded programs, etc.. notice the "newInstance" in your second snippet, that is
not in your first.

This is a trade-off between performance & conceptual clarity that is often the hardest
part of API design. I think hdfs did pretty good here - I/O will always be the bottle-neck,
esp with rotational media.

Take care,

----- Reply message -----
From: "Koert Kuipers" <koert@tresata.com>
To: <user@hadoop.apache.org>
Subject: fs cache giving me headaches
Date: Mon, Aug 6, 2012 10:32 am
---------- Forwarded message ----------
From: "Koert Kuipers" <koert@tresata.com>
Date: Aug 4, 2012 1:54 PM
Subject: fs cache giving me headaches

To:  <common-user@hadoop.apache.org>
nothing has confused me as much in hadoop as FileSystem.close().
any decent java programmer that sees that an object implements Closable writes code like this:

Final FileSystem fs = FileSystem.get(conf);
try {

    // do something with fs
} finally {

so i started out using hadoop FileSystem like this, and i ran into all sorts of weird errors
where FileSystems in unrelated code (sometimes not even my code) started misbehaving and streams
where unexpectedly shut. Then i realized that FileSystem uses a cache and close() closes it
for everyone! Not pretty in my opinion, but i can live with it. So i checked other code and
found that basically nobody closes FileSystems. Apparently the expected way of using FileSystems
is to simple never close them. So i adopted this approach (which i think is really contrary
to java conventions for a Closeable).

Lately i started working on some code for a daemon/server where many FileSystems objects are
created for different users (UGIs) that use the service. As it turns out other projects have
run into trouble with the FileSystem cache in situations like this (for example, Scribe and
Hoop). I imagine the cache can get very large and cause problems (i have not tested this myself).

Looking at the code for Hoop i noticed they simply turned off the FileSystem cache and made
sure to close every FileSystem. So here the suggested approach to deal with FileSystems seems
to be:
Final FileSystem fs = FileSystem.newInstance(conf); // or FileSystem.get(conf) but with caching
turned off in the conf

try {

    // do something with fs

} finally {



This code bypasses the cache if i understand it correctly, avoiding any cache size limitations.
However if i adopt this approach i basically can not re-use any existing code or libraries
that do not close FileSystems, splitting the codebase into two which is pretty ugly. And this
code is not efficient in situations where there are very few used FileSystem objects and a
cache would improve performance, so the split works both ways.

In short, there is so single way to code with FileSystem that works in both situations! Ideally
i would have liked fs.close() to do the right thing depending in the settings: if cache is
turned off it closes the FileSystem, and if it is turned on its a NOOP. That way i could always
use FileSystem.get(conf) and always close my filesystems, and the code would be usable irrespective
of whether the cache is turned on or off.

Any insights or suggestions? Thanks!
View raw message