Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: 216.145.54.173 is neither permitted
 nor denied by domain of daryn@yahoo-inc.com)
From: Daryn Sharp <daryn@yahoo-inc.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Date: Tue, 7 Aug 2012 08:25:29 -0700
Subject: Re: fs cache giving me headaches
Thread-Topic: fs cache giving me headaches
Thread-Index: Ac10sODo4hqbxzFVS3a+tZLAf37siw==
Message-ID: <FFD58989-E2CE-4063-9942-41EB57A619F2@yahoo-inc.com>
References: 
 <CANx3uAiK5BuPFZpPKchJBz68N3yyWsMQe89bD+bfTpS7Qy=qUA@mail.gmail.com>
 <CANx3uAhWyD3mi4wyNFyyUmMTibA2jdJJ6aS8BF8MGaEMYbFcBQ@mail.gmail.com>
 <1DA607DA-E9D9-43E5-B93F-654C1AA090BE@yahoo-inc.com>
 <CANx3uAj1jLQmAdZ9uX1wTBtTL-SQmk6qg1gFdD7Tqvu=tenNcg@mail.gmail.com>
In-Reply-To: 
 <CANx3uAj1jLQmAdZ9uX1wTBtTL-SQmk6qg1gFdD7Tqvu=tenNcg@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: multipart/alternative;
	boundary="_000_FFD58989E2CE4063994241EB57A619F2yahooinccom_"
MIME-Version: 1.0

--_000_FFD58989E2CE4063994241EB57A619F2yahooinccom_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

There is no UGI caching, so each request will receive a unique UGI even for=
 the same user.  Thus you can safely call FileSystem.closeAllForUGI(ugi) wh=
en the request is complete.  If however you spin off threads that continue =
to use the UGI even after the request is completed, then you'll have to det=
ermine for yourself when it's safe to close the filesystems.

I've been kicking around a few ways to transparently close cached filesyste=
ms for a ugi when that ugi goes out of scope.  I should probably file a jir=
a (if it stops going down) for discussion.

Daryn


On Aug 7, 2012, at 10:15 AM, Koert Kuipers wrote:

Daryn,
The problem with FileSystem.closeAllForUGI(ugi) for me is that a server can=
 be multi-threaded, and a user could be doing multiple request at the same =
time, so if i used closeAllForUGI isn't there a risk of shutting down the o=
ther requests for the same user?

On Mon, Aug 6, 2012 at 2:52 PM, Daryn Sharp <daryn@yahoo-inc.com<mailto:dar=
yn@yahoo-inc.com>> wrote:
Yes, the implementation of fs.close() leaves something to be desired.  Ther=
e's actually been debate in the past about close being a no-op for a cached=
 fs, but the idea was rejected by the majority of people.

In the server case, you can use FileSystem.closeAllForUGI(ugi) at the end o=
f a request to flush all the fs cache entries for the ugi.  You'll get the =
benefit of the cache during execution of the request, and be able to close =
the cached fs instances to prevent memory leaks. I hope this helps!

Daryn


On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote:

---------- Forwarded message ----------
From: "Koert Kuipers" <koert@tresata.com<mailto:koert@tresata.com>>
Date: Aug 4, 2012 1:54 PM
Subject: fs cache giving me headaches
To: <common-user@hadoop.apache.org<mailto:common-user@hadoop.apache.org>>

nothing has confused me as much in hadoop as FileSystem.close().
any decent java programmer that sees that an object implements Closable wri=
tes code like this:
Final FileSystem fs =3D FileSystem.get(conf);
try {
    // do something with fs
} finally {
    fs.close();
}

so i started out using hadoop FileSystem like this, and i ran into all sort=
s of weird errors where FileSystems in unrelated code (sometimes not even m=
y code) started misbehaving and streams where unexpectedly shut. Then i rea=
lized that FileSystem uses a cache and close() closes it for everyone! Not =
pretty in my opinion, but i can live with it. So i checked other code and f=
ound that basically nobody closes FileSystems. Apparently the expected way =
of using FileSystems is to simple never close them. So i adopted this appro=
ach (which i think is really contrary to java conventions for a Closeable).

Lately i started working on some code for a daemon/server where many FileSy=
stems objects are created for different users (UGIs) that use the service. =
As it turns out other projects have run into trouble with the FileSystem ca=
che in situations like this (for example, Scribe and Hoop). I imagine the c=
ache can get very large and cause problems (i have not tested this myself).

Looking at the code for Hoop i noticed they simply turned off the FileSyste=
m cache and made sure to close every FileSystem. So here the suggested appr=
oach to deal with FileSystems seems to be:
Final FileSystem fs =3D FileSystem.newInstance(conf); // or FileSystem.get(=
conf) but with caching turned off in the conf
try {
    // do something with fs
} finally {
    fs.close();
}

This code bypasses the cache if i understand it correctly, avoiding any cac=
he size limitations. However if i adopt this approach i basically can not r=
e-use any existing code or libraries that do not close FileSystems, splitti=
ng the codebase into two which is pretty ugly. And this code is not efficie=
nt in situations where there are very few used FileSystem objects and a cac=
he would improve performance, so the split works both ways.

In short, there is so single way to code with FileSystem that works in both=
 situations! Ideally i would have liked fs.close() to do the right thing de=
pending in the settings: if cache is turned off it closes the FileSystem, a=
nd if it is turned on its a NOOP. That way i could always use FileSystem.ge=
t(conf) and always close my filesystems, and the code would be usable irres=
pective of whether the cache is turned on or off.

Any insights or suggestions? Thanks!


--_000_FFD58989E2CE4063994241EB57A619F2yahooinccom_
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html><head></head><body style=3D"word-wrap: break-word; -webkit-nbsp-mode:=
 space; -webkit-line-break: after-white-space; ">There is no UGI caching, s=
o each request will receive a unique UGI even for the same user. &nbsp;Thus=
 you can safely call&nbsp;FileSystem.closeAllForUGI(ugi) when the request i=
s complete. &nbsp;If however you spin off threads that continue to use the =
UGI even after the request is completed, then you'll have to determine for =
yourself when it's safe to close the filesystems.<div><br></div><div>I've b=
een kicking around a few ways to transparently close cached filesystems for=
 a ugi when that ugi goes out of scope. &nbsp;I should probably file a jira=
 (if it stops going down) for discussion.<br><div><br><div>
<span class=3D"Apple-style-span" style=3D"border-collapse: separate; color:=
 rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: no=
rmal; font-weight: normal; letter-spacing: normal; line-height: normal; orp=
hans: 2; text-align: auto; text-indent: 0px; text-transform: none; white-sp=
ace: normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacin=
g: 0px; -webkit-border-vertical-spacing: 0px; -webkit-text-decorations-in-e=
ffect: none; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px=
; font-size: medium; ">Daryn<br><br></span>
</div>
<br><div><div>On Aug 7, 2012, at 10:15 AM, Koert Kuipers wrote:</div><br cl=
ass=3D"Apple-interchange-newline"><blockquote type=3D"cite">Daryn,<br>The p=
roblem with FileSystem.closeAllForUGI(ugi) for me is that a server can be m=
ulti-threaded, and a user could be doing multiple request at the same time,=
 so if i used closeAllForUGI isn't there a risk of shutting down the other =
requests for the same user?<br>
<br><div class=3D"gmail_quote">On Mon, Aug 6, 2012 at 2:52 PM, Daryn Sharp =
<span dir=3D"ltr">&lt;<a href=3D"mailto:daryn@yahoo-inc.com" target=3D"_bla=
nk">daryn@yahoo-inc.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail=
_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:=
1ex">
<div style=3D"word-wrap:break-word">Yes, the implementation of fs.close() l=
eaves something to be desired. &nbsp;There's actually been debate in the pa=
st about close being a no-op for a cached fs, but the idea was rejected by =
the majority of people.<div>
<br></div><div>In the server case, you can use FileSystem.closeAllForUGI(ug=
i) at the end of a request to flush all the fs cache entries for the ugi. &=
nbsp;You'll get the benefit of the cache during execution of the request, a=
nd be able to close the cached fs instances to prevent memory leaks. I hope=
 this helps!<div>
<span class=3D"HOEnZb"><font color=3D"#888888"><br><div>
<span style=3D"text-indent:0px;letter-spacing:normal;font-variant:normal;te=
xt-align:auto;font-style:normal;font-weight:normal;line-height:normal;borde=
r-collapse:separate;text-transform:none;font-size:medium;white-space:normal=
;font-family:Helvetica;word-spacing:0px">Daryn<br>
<br></span>
</div></font></span><div><div class=3D"h5">
<br><div><div>On Aug 6, 2012, at 12:32 PM, Koert Kuipers wrote:</div><br><b=
lockquote type=3D"cite"><div class=3D"gmail_quote">---------- Forwarded mes=
sage ----------<br>From: "Koert Kuipers" &lt;<a href=3D"mailto:koert@tresat=
a.com" target=3D"_blank">koert@tresata.com</a>&gt;<br>
Date: Aug 4, 2012 1:54 PM<br>Subject: fs cache giving me headaches<br>
To:  &lt;<a href=3D"mailto:common-user@hadoop.apache.org" target=3D"_blank"=
>common-user@hadoop.apache.org</a>&gt;<br><br type=3D"attribution">nothing =
has confused me as much in hadoop as FileSystem.close().<br>any decent java=
 programmer that sees that an object implements Closable writes code like t=
his:<br>

Final FileSystem fs =3D FileSystem.get(conf);<br>try {<br>
&nbsp;&nbsp;&nbsp; // do something with fs<br>} finally {<br>&nbsp;&nbsp;&n=
bsp; fs.close();<br>}<br><br>so i started out using hadoop FileSystem like =
this, and i ran into all sorts of weird errors where FileSystems in unrelat=
ed code (sometimes not even my code) started misbehaving and streams where =
unexpectedly shut. Then i realized that FileSystem uses a cache and close()=
 closes it for everyone! Not pretty in my opinion, but i can live with it. =
So i checked other code and found that basically nobody closes FileSystems.=
 Apparently the expected way of using FileSystems is to simple never close =
them. So i adopted this approach (which i think is really contrary to java =
conventions for a Closeable).<br>


<br>Lately i started working on some code for a daemon/server where many Fi=
leSystems objects are created for different users (UGIs) that use the servi=
ce. As it turns out other projects have run into trouble with the FileSyste=
m cache in situations like this (for example, Scribe and Hoop). I imagine t=
he cache can get very large and cause problems (i have not tested this myse=
lf).<br>


<br>Looking at the code for Hoop i noticed they simply turned off the FileS=
ystem cache and made sure to close every FileSystem. So here the suggested =
approach to deal with FileSystems seems to be:<br>Final FileSystem fs =3D F=
ileSystem.newInstance(conf); // or FileSystem.get(conf) but with caching tu=
rned off in the conf<br>


try {<br>
&nbsp;&nbsp;&nbsp; // do something with fs<br>
} finally {<br>
&nbsp;&nbsp;&nbsp; fs.close();<br>
}<br>
<br>This code bypasses the cache if i understand it correctly, avoiding any=
 cache size limitations. However if i adopt this approach i basically can n=
ot re-use any existing code or libraries that do not close FileSystems, spl=
itting the codebase into two which is pretty ugly. And this code is not eff=
icient in situations where there are very few used FileSystem objects and a=
 cache would improve performance, so the split works both ways.<br>


<br>In short, there is so single way to code with FileSystem that works in =
both situations! Ideally i would have liked fs.close() to do the right thin=
g depending in the settings: if cache is turned off it closes the FileSyste=
m, and if it is turned on its a NOOP. That way i could always use FileSyste=
m.get(conf) and always close my filesystems, and the code would be usable i=
rrespective of whether the cache is turned on or off.<br>


<br>Any insights or suggestions? Thanks!<br><br>
</div>
</blockquote></div><br></div></div></div></div></div></blockquote></div><br=
>
</blockquote></div><br></div></div></body></html>=

--_000_FFD58989E2CE4063994241EB57A619F2yahooinccom_--