hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koert Kuipers <ko...@tresata.com>
Subject Re: fs cache giving me headaches
Date Tue, 07 Aug 2012 15:15:33 GMT
The problem with FileSystem.closeAllForUGI(ugi) for me is that a server can
be multi-threaded, and a user could be doing multiple request at the same
time, so if i used closeAllForUGI isn't there a risk of shutting down the
other requests for the same user?

On Mon, Aug 6, 2012 at 2:52 PM, Daryn Sharp <daryn@yahoo-inc.com> wrote:

> Yes, the implementation of fs.close() leaves something to be desired.
>  There's actually been debate in the past about close being a no-op for a
> cached fs, but the idea was rejected by the majority of people.
> In the server case, you can use FileSystem.closeAllForUGI(ugi) at the end
> of a request to flush all the fs cache entries for the ugi.  You'll get the
> benefit of the cache during execution of the request, and be able to close
> the cached fs instances to prevent memory leaks. I hope this helps!
> Daryn
> On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote:
> ---------- Forwarded message ----------
> From: "Koert Kuipers" <koert@tresata.com>
> Date: Aug 4, 2012 1:54 PM
> Subject: fs cache giving me headaches
> To: <common-user@hadoop.apache.org>
> nothing has confused me as much in hadoop as FileSystem.close().
> any decent java programmer that sees that an object implements Closable
> writes code like this:
> Final FileSystem fs = FileSystem.get(conf);
> try {
>     // do something with fs
> } finally {
>     fs.close();
> }
> so i started out using hadoop FileSystem like this, and i ran into all
> sorts of weird errors where FileSystems in unrelated code (sometimes not
> even my code) started misbehaving and streams where unexpectedly shut. Then
> i realized that FileSystem uses a cache and close() closes it for everyone!
> Not pretty in my opinion, but i can live with it. So i checked other code
> and found that basically nobody closes FileSystems. Apparently the expected
> way of using FileSystems is to simple never close them. So i adopted this
> approach (which i think is really contrary to java conventions for a
> Closeable).
> Lately i started working on some code for a daemon/server where many
> FileSystems objects are created for different users (UGIs) that use the
> service. As it turns out other projects have run into trouble with the
> FileSystem cache in situations like this (for example, Scribe and Hoop). I
> imagine the cache can get very large and cause problems (i have not tested
> this myself).
> Looking at the code for Hoop i noticed they simply turned off the
> FileSystem cache and made sure to close every FileSystem. So here the
> suggested approach to deal with FileSystems seems to be:
> Final FileSystem fs = FileSystem.newInstance(conf); // or
> FileSystem.get(conf) but with caching turned off in the conf
> try {
>     // do something with fs
> } finally {
>     fs.close();
> }
> This code bypasses the cache if i understand it correctly, avoiding any
> cache size limitations. However if i adopt this approach i basically can
> not re-use any existing code or libraries that do not close FileSystems,
> splitting the codebase into two which is pretty ugly. And this code is not
> efficient in situations where there are very few used FileSystem objects
> and a cache would improve performance, so the split works both ways.
> In short, there is so single way to code with FileSystem that works in
> both situations! Ideally i would have liked fs.close() to do the right
> thing depending in the settings: if cache is turned off it closes the
> FileSystem, and if it is turned on its a NOOP. That way i could always use
> FileSystem.get(conf) and always close my filesystems, and the code would be
> usable irrespective of whether the cache is turned on or off.
> Any insights or suggestions? Thanks!

View raw message