Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of
 publicnetworkservices@gmail.com designates 209.85.216.171 as permitted
 sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAOeWMMreqBq0ajjUJSrjtbww_wSKPu6w5-HbHu=UXJqxhd8dxQ@mail.gmail.com>
References: 
 <CAPwWofFGaErjKuvOx98igkMA82jXgQfTCbaYy4+OoX7BaRYDyA@mail.gmail.com>
	<CAOeWMMreqBq0ajjUJSrjtbww_wSKPu6w5-HbHu=UXJqxhd8dxQ@mail.gmail.com>
Date: Thu, 21 Feb 2013 20:09:32 -0800
Message-ID: 
 <CAPwWofH5rKg5OxmmPGqemzthN7oMTk0EnLcBae9nrwy6AOv6OQ@mail.gmail.com>
Subject: Re: MapReduce processing with extra (possibly non-serializable)
 configuration
From: Public Network Services <publicnetworkservices@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=047d7b6785fcc9a9fc04d6485b73

--047d7b6785fcc9a9fc04d6485b73
Content-Type: text/plain; charset=ISO-8859-1

I have considered the DistributedCache and will probably be using it, but
in order to have a file to cache I need to serialize the configuration
object first. :-)


On Thu, Feb 21, 2013 at 5:55 PM, feng lu <amuseme.lu@gmail.com> wrote:

> Hi
>
> May be you can see the useage of DistributedCache [0] , It's a facility
> provided by the MR framework  to cache files (text,archives, jars etc)
> needed by applications.
>
> [0]
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html
>
>
> On Fri, Feb 22, 2013 at 5:10 AM, Public Network Services <
> publicnetworkservices@gmail.com> wrote:
>
>> Hi...
>>
>> I am trying to put an existing file processing application into Hadoop
>> and need to find the best way of propagating some extra configuration per
>> split, in the form of complex and proprietary custom Java objects.
>>
>> The general idea is
>>
>>    1. A custom InputFormat splits the input data
>>    2. The same InputFormat prepares the appropriate configuration for
>>    each split
>>    3. Hadoop processes each split in MapReduce, using the split itself
>>    and the corresponding configuration
>>
>> The problem is that these configuration objects contain a lot of
>> properties and references to other complex objects, and so on, therefore it
>> will take a lot of work to cover all the possible combinations and make the
>> whole thing serializable (if it can be done in the first place).
>>
>> Most probably this is the only way forward, but if anyone has ever dealt
>> with this problem, please suggest the best approach to follow.
>>
>> Thanks!
>>
>>
>
>
> --
> Don't Grow Old, Grow Up... :-)
>

--047d7b6785fcc9a9fc04d6485b73
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I have considered the DistributedCache and will probably be using it, but i=
n order to have a file to cache I need to serialize the configuration objec=
t first. :-)<div><br><br><div class=3D"gmail_quote">On Thu, Feb 21, 2013 at=
 5:55 PM, feng lu <span dir=3D"ltr">&lt;<a href=3D"mailto:amuseme.lu@gmail.=
com" target=3D"_blank">amuseme.lu@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi=A0<div><br></div><div>Ma=
y be you can see the useage of DistributedCache [0] , It&#39;s a facility p=
rovided by the MR framework =A0to cache files (text,archives, jars etc) nee=
ded by applications.</div>
<div>

<br></div><div>[0]=A0<a href=3D"http://hadoop.apache.org/docs/current/api/o=
rg/apache/hadoop/filecache/DistributedCache.html" target=3D"_blank">http://=
hadoop.apache.org/docs/current/api/org/apache/hadoop/filecache/DistributedC=
ache.html</a></div>


</div><div class=3D"gmail_extra"><div><div class=3D"h5"><br><br><div class=
=3D"gmail_quote">On Fri, Feb 22, 2013 at 5:10 AM, Public Network Services <=
span dir=3D"ltr">&lt;<a href=3D"mailto:publicnetworkservices@gmail.com" tar=
get=3D"_blank">publicnetworkservices@gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">Hi...<div><br></div><div>I am trying to put =
an existing file processing application into Hadoop and need to find the be=
st way of propagating some extra configuration per split, in the form of co=
mplex and proprietary custom Java objects.</div>


<div><br></div><div>The general idea is</div><div><ol><li>A custom InputFor=
mat splits the input data</li><li>The same InputFormat prepares the appropr=
iate configuration for each split</li><li>Hadoop processes each split in Ma=
pReduce, using the split itself and the corresponding configuration</li>


</ol></div><div>The problem is that these configuration objects contain a l=
ot of properties and references to other complex objects, and so on, theref=
ore it will take a lot of work to cover all the possible combinations and m=
ake the whole thing serializable (if it can be done in the first place).</d=
iv>


<div><br></div><div>Most probably this is the only way forward, but if anyo=
ne has ever dealt with this problem, please suggest the best approach to fo=
llow.</div><div><br></div><div>Thanks!</div><div><br></div>
</blockquote></div><br><br clear=3D"all"><div><br></div></div></div><span c=
lass=3D"HOEnZb"><font color=3D"#888888">-- <br>Don&#39;t Grow Old, Grow Up.=
.. :-)
</font></span></div>
</blockquote></div><br></div>

--047d7b6785fcc9a9fc04d6485b73--