Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7B15AE9F8 for ; Fri, 22 Feb 2013 06:25:04 +0000 (UTC) Received: (qmail 7764 invoked by uid 500); 22 Feb 2013 06:24:59 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 7526 invoked by uid 500); 22 Feb 2013 06:24:59 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 7506 invoked by uid 99); 22 Feb 2013 06:24:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Feb 2013 06:24:59 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of amuseme.lu@gmail.com designates 209.85.214.177 as permitted sender) Received: from [209.85.214.177] (HELO mail-ob0-f177.google.com) (209.85.214.177) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Feb 2013 06:24:52 +0000 Received: by mail-ob0-f177.google.com with SMTP id wc18so281952obb.22 for ; Thu, 21 Feb 2013 22:24:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=x35OetoB5dDMWOhBDykJE5+KSchI/hxFehIZLIIpxiU=; b=VkvgjS+urLPyh9Mwc5BDFyYRdr1lhL94QRqminqjyJt3eZBk2thY4IqfCimUOYsOeG ElEAXsRIgh4A/D2e5rCQyO5FcLFg8Q3viCB+KSG1GrZeO7mcaW8U1lvkNd+iRQwoOxEF gdhvPo2rWnGuydJCUdLQeqe21K3mM70C30Nu+7ZoypEf33sKM0J/Lafn64f7+VbhD0X+ wjvr9PUR7K3qVQsnruh6riUQJFCSTjonRVapyOv5afpFWiwRj/J1BSW/D5GhWhBQxdJx 0QZZERNKbbYQENpEKdiX2pAo1X2gYDjG4SM+Di7RNuvZuO9SkVm9ADmCUPnDyOa4Fwlc Nltw== X-Received: by 10.182.221.105 with SMTP id qd9mr288172obc.97.1361514271738; Thu, 21 Feb 2013 22:24:31 -0800 (PST) MIME-Version: 1.0 Received: by 10.60.58.97 with HTTP; Thu, 21 Feb 2013 22:24:11 -0800 (PST) In-Reply-To: References: From: feng lu Date: Fri, 22 Feb 2013 14:24:11 +0800 Message-ID: Subject: Re: MapReduce processing with extra (possibly non-serializable) configuration To: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=f46d0447f2f8860c0c04d64a3e00 X-Virus-Checked: Checked by ClamAV on apache.org --f46d0447f2f8860c0c04d64a3e00 Content-Type: text/plain; charset=ISO-8859-1 yes, you are right. First upload serialized configuration file to HDFS and retrieve that file in the Mapper#configure method for each Mapper, and deserialize the file to configuration object. It seem that the configuration file serialization is required. You can find many data serialization system such as avro,protobuf and etc. On Fri, Feb 22, 2013 at 12:11 PM, Public Network Services < publicnetworkservices@gmail.com> wrote: > You mean save the serialized configuration object in the custom split > file, retrieve that in the Mapper, reconstruct the configuration and use > the rest of the split file (i.e., the actual data) as input to the map > function? > > > On Thu, Feb 21, 2013 at 5:57 PM, Azuryy Yu wrote: > >> I just have one simple suggestion for you: writer an customer split to >> replace FileSplit, include all your special configurations in this split. >> then write a custom InputFormat. >> >> during map phrase, you can get this split, then you get all special >> configurations. >> >> >> >> On Fri, Feb 22, 2013 at 5:10 AM, Public Network Services < >> publicnetworkservices@gmail.com> wrote: >> >>> Hi... >>> >>> I am trying to put an existing file processing application into Hadoop >>> and need to find the best way of propagating some extra configuration per >>> split, in the form of complex and proprietary custom Java objects. >>> >>> The general idea is >>> >>> 1. A custom InputFormat splits the input data >>> 2. The same InputFormat prepares the appropriate configuration for >>> each split >>> 3. Hadoop processes each split in MapReduce, using the split itself >>> and the corresponding configuration >>> >>> The problem is that these configuration objects contain a lot of >>> properties and references to other complex objects, and so on, therefore it >>> will take a lot of work to cover all the possible combinations and make the >>> whole thing serializable (if it can be done in the first place). >>> >>> Most probably this is the only way forward, but if anyone has ever dealt >>> with this problem, please suggest the best approach to follow. >>> >>> Thanks! >>> >>> >> > -- Don't Grow Old, Grow Up... :-) --f46d0447f2f8860c0c04d64a3e00 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
yes, you are right. First upload=A0serialized=A0configuration file= to HDFS and retrieve that file in the Mapper#configure method for each Map= per, and deserialize the file to configuration object.

It seem that the configuration file serialization is r= equired. You can find many=A0data serialization system = such as avro,protobuf and etc.=A0

On Fri, Feb 22, 2= 013 at 12:11 PM, Public Network Services <publicnetworkservi= ces@gmail.com> wrote:
You mean save the serialized configuration object in the c= ustom split file, retrieve that in the Mapper, reconstruct the configuratio= n and use the rest of the split file (i.e., the actual data) as input to th= e map function?


On Thu, Feb 21, 2013 at 5:57 PM, Azuryy = Yu <azuryyyu@gmail.com> wrote:
I just have one simple suggestion for you: writer an = customer split to replace FileSplit, include all your special configuration= s in this split. then write a custom InputFormat.

during map p= hrase, you can get this split, then you get all special configurations.



On Fri, Feb 22, 2013 at 5:10 AM, Public Network Services <publicnetworkservices@gmail.com> wrote:
Hi...

I am trying to pu= t an existing file processing application into Hadoop and need to find the = best way of propagating some extra configuration per split, in the form of = complex and proprietary custom Java objects.

The general idea is
  1. A custom InputFor= mat splits the input data
  2. The same InputFormat prepares the appropr= iate configuration for each split
  3. Hadoop processes each split in Ma= pReduce, using the split itself and the corresponding configuration
The problem is that these configuration objects contain a l= ot of properties and references to other complex objects, and so on, theref= ore it will take a lot of work to cover all the possible combinations and m= ake the whole thing serializable (if it can be done in the first place).

Most probably this is the only way forward, but if anyo= ne has ever dealt with this problem, please suggest the best approach to fo= llow.

Thanks!






--
= Don't Grow Old, Grow Up... :-)
--f46d0447f2f8860c0c04d64a3e00--