Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4DCD36088 for ; Wed, 22 Jun 2011 13:18:36 +0000 (UTC) Received: (qmail 80007 invoked by uid 500); 22 Jun 2011 13:18:35 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 79961 invoked by uid 500); 22 Jun 2011 13:18:35 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 79953 invoked by uid 99); 22 Jun 2011 13:18:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jun 2011 13:18:35 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of Hassen.Riahi@cern.ch designates 137.138.144.179 as permitted sender) Received: from [137.138.144.179] (HELO CERNMX31.cern.ch) (137.138.144.179) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 22 Jun 2011 13:18:28 +0000 Received: from CERNFE23.cern.ch (137.138.144.152) by cernmxgwlb2.cern.ch (137.138.144.179) with Microsoft SMTP Server (TLS) id 14.1.270.1; Wed, 22 Jun 2011 15:18:08 +0200 Received: from rihai-nb1.wlcommon.priv (193.205.222.25) by smtp.cern.ch (137.138.144.172) with Microsoft SMTP Server (TLS) id 14.1.270.2; Wed, 22 Jun 2011 15:18:08 +0200 Message-ID: <310CEAD0-C1A7-4192-AF8A-572755FB2AF1@cern.ch> From: Hassen Riahi To: In-Reply-To: Content-Type: text/plain; charset="US-ASCII"; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit MIME-Version: 1.0 (Apple Message framework v936) Subject: Re: mapreduce and python Date: Wed, 22 Jun 2011 15:18:06 +0200 References: <1308633233.11153.13.camel@massimo> X-Mailer: Apple Mail (2.936) X-Originating-IP: [193.205.222.25] Keywords: CERN SpamKiller Note: -50 X-Virus-Checked: Checked by ClamAV on apache.org I'm trying these solutions...Thanks for suggestions. > I'd like to +1 to using Dumbo for all things Python and Hadoop > MapReduce. Its one of the better ways to do things. > > Do look at the initial conversation here: > http://old.nabble.com/hadoop-streaming-binary-input---image-processing-td23544344.html > as well. > > The feature/bug fixes specified in the post are present in Apache > Hadoop 0.21 (which isn't deemed to be suited for production use yet) > and is also available in other (in-production-use) Hadoop > distributions such as Cloudera's, which is based off on 0.20.2: > https://ccp.cloudera.com/display/SUPPORT/Downloads > > On Tue, Jun 21, 2011 at 10:43 AM, Jeremy Lewi wrote: >> Hassen, >> >> I've been very succesful using Hadoop Streaming, Dumbo, and >> TypedBytes >> as a solution for using python to implement mappers and reducers. >> >> TypedBytes is a hadoop encoding format that allows binary data >> (including lists and maps) to be encoded in a format that permits the >> serialized data to safely be passed to mappers/reducers via the >> command >> line through hadoop streaming. >> >> Dumbo is a python library which makes it easy to implement your >> mappers >> and reducers in python. In particular, it handles decoding the data >> encoded as typedbytes to native python types. >> >> J >> On Mon, 2011-06-20 at 21:05 -0400, Joe Stein wrote: >>> Hassen, >>> >>> >>> I have lots of binary data that I parse using Python streaming. >>> >>> >>> The way I do this is stream the binary data into sequence files (the >>> binary data object I save in the key and (null) as the value). >>> >>> >>> Each key then gets written back to me line by line, key by key for >>> an >>> entire block when streaming. >>> >>> >>> To have this work in streaming on the command line you need to >>> use -inputformat SequenceFileAsTextInputFormat >>> >>> >>> To create the sequence files I have a jar file that goes from >>> BufferedReader and writes to >>> org.apache.hadoop.io.SequenceFile.Writer >>> >>> >>> I am not sure if you can do this for your data but if not then make >>> your own InputFormat. >>> >>> >>> good luck! >>> >>> >>> /* >>> Joe Stein >>> http://www.linkedin.com/in/charmalloc >>> Twitter: @allthingshadoop >>> */ >>> >>> On Mon, Jun 20, 2011 at 4:13 PM, Hassen Riahi >>> wrote: >>> Dear all, >>> >>> Is it possible to have a binary input to a map code >>> written in >>> python? >>> >>> Thank you >>> Hassen >>> >>> >>> >> >> > > > > -- > Harsh J