Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: 
 <CAOcnVr16AcHfG2dn8q589wieCR14wAOZM1=Cj+3wFrxbrGsUaA@mail.gmail.com>
References: 
 <CAPQV63UV80T54mzwWbeyKkZET7gmpcLrS-heGHsdP0-TRRQ2Ng@mail.gmail.com>
 <F5153A42-3BA3-45E6-B432-8B80778C6726@hortonworks.com>
 <5BDD5440-304F-44C9-B512-56EDAD21BC39@apache.org>
 <CADE3u=ZOQ7hMuh60QNXdhDb8P7qFO1OdQchs6-2SX_2onMAJsA@mail.gmail.com>
 <CAPQV63WXzwU55C6K1XQn62A=0XvsH-XmZSApGNye9t+WbJwuoQ@mail.gmail.com>
 <4265D509-D8B9-48AA-B7B6-DBC19CA807CA@yahoo-inc.com>
 <CAPQV63WwrjHRmN+D+KTRXUo=6SaKQJ8co2+dXzQJjf9TbOR=+g@mail.gmail.com>
 <CAOcnVr16AcHfG2dn8q589wieCR14wAOZM1=Cj+3wFrxbrGsUaA@mail.gmail.com>
From: Jean-Marc Spaggiari <jean-marc@spaggiari.org>
Date: Thu, 28 Mar 2013 12:02:26 -0400
Message-ID: 
 <CAPQV63WudbjNRLpjF3HLHN7-0tsdcCP5X4pRA_+gkoAvcdLarA@mail.gmail.com>
Subject: Re: Auto clean DistCache?
To: user@hadoop.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Thanks Harsh. My issue was not related to the number of files/folders
but related to the total size of the DistributedCache. The directory
where it's stored only has 7GB available... So I will setup the limit
to 5GB with local.cache.size, or move it to the drives there I have
the dfs files stored.

Thanks,

JM

2013/3/28 Harsh J <harsh@cloudera.com>:
> The DistributedCache is cleaned automatically and no user intervention
> (aside of size limitation changes, which may be an administrative
> requirement) is generally required to delete the older distributed
> cache files.
>
> This is observable in code and is also noted in TDG, 2ed.:
>
> Tom White:
> """
> The tasktracker also maintains a reference count for the number of
> tasks using each file in the cache. Before the task has run, the
> file=E2=80=99s reference count is incremented by one; then after the task=
 has
> run, the count is decreased by one. Only when the count reaches zero
> it is eligible for deletion, since no tasks are using it. Files are
> deleted to make room for a new file when the cache exceeds a certain
> size=E2=80=9410 GB by default. The cache size may be changed by setting t=
he
> configuration property local.cache.size, which is measured in bytes.
> """
>
> And also, the maximum allowed dirs is also checked for automatically
> today, to not violate the OS's limits.
>
> On Wed, Mar 27, 2013 at 7:07 PM, Jean-Marc Spaggiari
> <jean-marc@spaggiari.org> wrote:
>> Oh! good to know! It keep tracks even of month old entries??? There is n=
o TTL?
>>
>> I was not able to find the documentation for  local.cache.size or
>> mapreduce.tasktracker.cache.local.size  in 1.0.x branch. Do you know
>> where I can found that?
>>
>> Thanks,
>>
>> JM
>>
>> 2013/3/27 Koji Noguchi <knoguchi@yahoo-inc.com>:
>>>> Else, I will go for a customed script to delete all directories (and c=
ontent) older than 2 or 3 days=E2=80=A6
>>>>
>>> TaskTracker (or NodeManager in 2.*) keeps the list of dist cache entrie=
s in memory.
>>> So if external process (like your script) start deleting dist cache fil=
es, there would be inconsistency and you'll start seeing task initializatio=
n failures due to no file found error.
>>>
>>> Koji
>>>
>>>
>>> On Mar 26, 2013, at 9:00 PM, Jean-Marc Spaggiari wrote:
>>>
>>>> For the situation I faced I was really a disk space issue, not related
>>>> to the number of files. It was writing on a small partition.
>>>>
>>>> I will try with local.cache.size or
>>>> mapreduce.tasktracker.cache.local.size to see if I can keep the final
>>>> total size under 5GB... Else, I will go for a customed script to
>>>> delete all directories (and content) older than 2 or 3 days...
>>>>
>>>> Thanks,
>>>>
>>>> JM
>>>>
>>>> 2013/3/26 Abdelrahman Shettia <ashettia@hortonworks.com>:
>>>>> Let me clarify , If there are lots of files or directories up to 32K =
(
>>>>> Depending on the user's # of files sys os config) in those distribute=
d cache
>>>>> dirs, The OS will not be able to create any more files/dirs, Thus M-R=
 jobs
>>>>> wont get initiated on those tasktracker machines. Hope this helps.
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>> On Tue, Mar 26, 2013 at 1:44 PM, Vinod Kumar Vavilapalli
>>>>> <vinodkv@hortonworks.com> wrote:
>>>>>>
>>>>>>
>>>>>> All the files are not opened at the same time ever, so you shouldn't=
 see
>>>>>> any "# of open files exceeds error".
>>>>>>
>>>>>> Thanks,
>>>>>> +Vinod Kumar Vavilapalli
>>>>>> Hortonworks Inc.
>>>>>> http://hortonworks.com/
>>>>>>
>>>>>> On Mar 26, 2013, at 12:53 PM, Abdelrhman Shettia wrote:
>>>>>>
>>>>>> Hi JM ,
>>>>>>
>>>>>> Actually these dirs need to be purged by a script that keeps the las=
t 2
>>>>>> days worth of files, Otherwise you may run into # of open files exce=
eds
>>>>>> error.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>> On Mar 25, 2013, at 5:16 PM, Jean-Marc Spaggiari <jean-marc@spaggiar=
i.org>
>>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> Each time my MR job is run, a directory is created on the TaskTracke=
r
>>>>>>
>>>>>> under mapred/local/taskTracker/hadoop/distcache (based on my
>>>>>>
>>>>>> configuration).
>>>>>>
>>>>>>
>>>>>> I looked at the directory today, and it's hosting thousands of
>>>>>>
>>>>>> directories and more than 8GB of data there.
>>>>>>
>>>>>>
>>>>>> Is there a way to automatically delete this directory when the job i=
s
>>>>>> done?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>> JM
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>
>
>
> --
> Harsh J