Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7C061D745 for ; Tue, 5 Mar 2013 17:28:11 +0000 (UTC) Received: (qmail 39854 invoked by uid 500); 5 Mar 2013 17:28:06 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 39671 invoked by uid 500); 5 Mar 2013 17:28:05 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 39661 invoked by uid 99); 5 Mar 2013 17:28:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Mar 2013 17:28:05 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of sandy.ryza@cloudera.com designates 74.125.83.47 as permitted sender) Received: from [74.125.83.47] (HELO mail-ee0-f47.google.com) (74.125.83.47) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Mar 2013 17:28:00 +0000 Received: by mail-ee0-f47.google.com with SMTP id e52so4787859eek.6 for ; Tue, 05 Mar 2013 09:27:39 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type:x-gm-message-state; bh=OjA61R2KVemrvaJtZbOrCHQ/QLqGoHW3XzhjcvI+qB0=; b=pMtNfcNBySdz0Q6Ptfo4Bw9gwUMXxsurwebh4sbRK3cGpqlDititl5PBk6vZIzjS0F wI7a2gUPyg6d3h777PkGqgJLSDH/PLLyQKHi8mgZnE9E/4OU4xVa5QjswSkXRjrMkh3t JcsUvX0Q5lbWxbRdIhTSDcTj2HdJtCvS5RJsKaUtxz2ktnN2ECjUKDm8XRQ/NLdT/mMR IMf6dzXJ2/k3FHyfnM+wDYPfezJde1LIuyUq5xsh9pYhVEqrTxXBTYdPeAaXB11WtlS+ nRaTb8ZQKESAxpOTtLVUfrQGmjsjLhhuilvCIch8L9lHDntB1i2lf1UochN/BNJ+hdei EJHQ== MIME-Version: 1.0 X-Received: by 10.14.3.70 with SMTP id 46mr73024362eeg.2.1362504455396; Tue, 05 Mar 2013 09:27:35 -0800 (PST) Received: by 10.15.44.202 with HTTP; Tue, 5 Mar 2013 09:27:35 -0800 (PST) In-Reply-To: References: Date: Tue, 5 Mar 2013 09:27:35 -0800 Message-ID: Subject: Re: Transpose From: Sandy Ryza To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7b66f61911df9e04d730ca73 X-Gm-Message-State: ALoCoQkvCPQc6xUVcfecyg2oo1faKqQy9Z8y8I8Y7xiTTi/2KkjeGB4i1gt824Nryp1FVkZghzsC X-Virus-Checked: Checked by ClamAV on apache.org --047d7b66f61911df9e04d730ca73 Content-Type: text/plain; charset=ISO-8859-1 Hi, Essentially what you want to do is group your data points by their position in the column, and have each reduce call construct the data for each row into a row. To have each record that the mapper processes be one of the columns, you can use TextInputFormat with conf.set("textinputformat.record.delimiter", ";"). Your mapper will receive keys as LongWritables specifying the byte index into the input file, and Text as values. The mapper will tokenize the input string. Emiting a map output for each data point in each column, you can then use secondary sort to send the data to the right place in the right order (see http://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/). Your composite key would look like (index of data point in column, which is the row index; the LongWritable passed in as the map input key). Each reduce call would get all the points in a single row. You would sort/group by row index, and within a reduce's values, sort by byte index so that entries from earlier columns come before later ones. Does that make sense? Sandy On Tue, Mar 5, 2013 at 7:11 AM, Mix Nin wrote: > Hi > > I have data in a file as follows . There are 3 columns separated by > semicolon(;). Each column would have multiple values separated by comma > (,). > > 11,22,33;144,244,344;yny; > > I need output data in below format. It is like transposing values of each > column. > > 11 144 y > 22 244 n > 33 344 y > > Can we write map reduce program to achieve this. Could you help on the > code on how to write. > > > Thanks > --047d7b66f61911df9e04d730ca73 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi,

Essentially what you want to do is group your data p= oints by their position in the column, and have each reduce call construct = the data for each row into a row. =A0To have each record that the mapper pr= ocesses be one of the columns, you can use TextInputFormat with conf.set(&q= uot;textinputformat.record.delimiter", ";"). =A0Your mapper = will receive keys as LongWritables specifying the byte index into the input= file, and Text as values. =A0The mapper will tokenize the input string.=A0=

Emiting a map output for each data point in each column= , you can then use secondary sort to send the data to the right place in th= e right order (see=A0http://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-va= lues-in-hadoops-mapreduce-programming-paradigm/). Your composite key wo= uld look like=A0(index of data point in column, which is the row index; the= LongWritable passed in as the map input key). =A0Each reduce call would ge= t all the points in a single row. You would sort/group by row index, and wi= thin a reduce's values, sort by byte index so that entries from earlier= columns come before later ones.

Does that make sense?

San= dy

On Tue, Mar 5, 2013 at 7:11 AM,= Mix Nin <pig.mixed@gmail.com> wrote:
Hi

I have data in a file as follows . There are 3 columns separated by semico= lon(;). Each column would have multiple values separated by comma (,).=A0
11,22,33;144,244,344;yny;

I need output data in below format. It is like tr= ansposing =A0values of each column.

11 144 y<= span style=3D"white-space:pre-wrap">
22 244 n
33 344 y

Can we write map reduc= e program to achieve this. Could you help on the code on how to write.


Thanks

--047d7b66f61911df9e04d730ca73--