Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of yhemanth@gmail.com designates
 209.85.212.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=LpjhThA+fXRtm75d2isDRNa+8uJLtvk0xeLm0trTe3nInJHG+oPf81grJ0et8KuQsr
         ZOYtWE+m5Nt/K51nl3a0LbwZQWooXd7m9yBX3jC1hPwK7l+dHx1E9ly6Pne0cdujOF1p
         cIMbx5ZYrRI/dxYPhnDYvpDuIjIuBeAHoVMMY=
MIME-Version: 1.0
In-Reply-To: <361065.10321.qm@web15907.mail.cnb.yahoo.com>
References: <AANLkTinyG_oNjw+V=ivgU5REKyw3CoOa5sBUqDU9znZB@mail.gmail.com>
	<193354.45428.qm@web15903.mail.cnb.yahoo.com>
	<AANLkTimXbn9G7W86M8S7B4Z=mL0XfyWGVehuXqzpziWf@mail.gmail.com>
	<572943.58812.qm@web15908.mail.cnb.yahoo.com>
	<AANLkTinVRQ1VPTUoznr4oZ3NDbR5Yc8iAjuXY+RC5HAQ@mail.gmail.com>
	<361065.10321.qm@web15907.mail.cnb.yahoo.com>
Date: Tue, 3 Aug 2010 10:03:55 +0530
Message-ID: <AANLkTi=+8Hd22-bitmc_C7nuJ8B3uPB-tST7bvrqg8UZ@mail.gmail.com>
Subject: Re: reuse cached files
From: Hemanth Yamijala <yhemanth@gmail.com>
To: common-user@hadoop.apache.org
Content-Type: text/plain; charset=GB2312
Content-Transfer-Encoding: quoted-printable

Hi,

> I am actually doing some test to see the performance. I want to eliminate=
 the
> interference of distributed cache. I find there is method in the api to p=
urge
> the cache. That might be what I want.

So, you want to run multiple versions of a job (possibly different job
parameters) and measure them relatively. Is that correct ?

I can think of some options:
- Is it possible, not to use distributed cache at all ? You could
possibly bundle the files along with the job jar.
- You could run the job on fresh cluster instances (a more costly
option, nevertheless)
- You could change the timestamps of the distributed cache files on
DFS somehow before each invocation of the job. This will make Hadoop
believe that the files have been changed, and this will cause
distributed cache to fetch the files again.


The purgeCache API you are seeing is very mapreduce framework
specific. This is *not* to be used by client code, and is not
guaranteed to work. In the latter versions of Hadoop (0.21 and trunk),
these methods have been deprecated in the public API and will be
removed altogether.

Thanks
hemanth

>
> Thanks,
> -Gang
>
>
>
> ----- =D4=AD=CA=BC=D3=CA=BC=FE ----
> =B7=A2=BC=FE=C8=CB=A3=BA Hemanth Yamijala <yhemanth@gmail.com>
> =CA=D5=BC=FE=C8=CB=A3=BA common-user@hadoop.apache.org
> =B7=A2=CB=CD=C8=D5=C6=DA=A3=BA 2010/8/2 (=D6=DC=D2=BB) 12:56:25 =C9=CF=CE=
=E7
> =D6=F7   =CC=E2=A3=BA Re: reuse cached files
>
> Hi,
>
>> Thanks Hemanth. Is there any way to invalidate the reuse and ask Hadoop =
to
>> resent exactly the same files to cache for every job?
>
> I may be able to answer this better if I understand the use case. If
> you need the same files for every job, why would you need to send them
> afresh each time ? If something is cached, it can be reused, no ? I am
> sure I must be missing something in your requirement ...
>
> Thanks
> Hemanth
>
>
>
>
>