Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of sandy.ryza@cloudera.com
 designates 74.125.83.47 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAD_3QJKMDx5tLTqKBEVuoLUf3vZPUVR5uTeL6z03TP2W3Tnq6A@mail.gmail.com>
References: 
 <CAD_3QJKMDx5tLTqKBEVuoLUf3vZPUVR5uTeL6z03TP2W3Tnq6A@mail.gmail.com>
Date: Tue, 5 Mar 2013 09:27:35 -0800
Message-ID: 
 <CACBYxKJ-8fzQYGdVdnuqLEnEX19YkddeZ3CtfPz+pbM46pSFBw@mail.gmail.com>
Subject: Re: Transpose
From: Sandy Ryza <sandy.ryza@cloudera.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=047d7b66f61911df9e04d730ca73

--047d7b66f61911df9e04d730ca73
Content-Type: text/plain; charset=ISO-8859-1

Hi,

Essentially what you want to do is group your data points by their position
in the column, and have each reduce call construct the data for each row
into a row.  To have each record that the mapper processes be one of the
columns, you can use TextInputFormat with
conf.set("textinputformat.record.delimiter", ";").  Your mapper will
receive keys as LongWritables specifying the byte index into the input
file, and Text as values.  The mapper will tokenize the input string.

Emiting a map output for each data point in each column, you can then use
secondary sort to send the data to the right place in the right order (see
http://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/).
Your composite key would look like (index of data point in column, which is
the row index; the LongWritable passed in as the map input key).  Each
reduce call would get all the points in a single row. You would sort/group
by row index, and within a reduce's values, sort by byte index so that
entries from earlier columns come before later ones.

Does that make sense?

Sandy

On Tue, Mar 5, 2013 at 7:11 AM, Mix Nin <pig.mixed@gmail.com> wrote:

> Hi
>
> I have data in a file as follows . There are 3 columns separated by
> semicolon(;). Each column would have multiple values separated by comma
> (,).
>
> 11,22,33;144,244,344;yny;
>
> I need output data in below format. It is like transposing  values of each
> column.
>
> 11 144 y
> 22 244 n
> 33 344 y
>
> Can we write map reduce program to achieve this. Could you help on the
> code on how to write.
>
>
> Thanks
>

--047d7b66f61911df9e04d730ca73
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Hi,<div><br></div><div>Essentially what you want to do is group your data p=
oints by their position in the column, and have each reduce call construct =
the data for each row into a row. =A0To have each record that the mapper pr=
ocesses be one of the columns, you can use TextInputFormat with conf.set(&q=
uot;textinputformat.record.delimiter&quot;, &quot;;&quot;). =A0Your mapper =
will receive keys as LongWritables specifying the byte index into the input=
 file, and Text as values. =A0The mapper will tokenize the input string.=A0=
</div>
<div><br></div><div>Emiting a map output for each data point in each column=
, you can then use secondary sort to send the data to the right place in th=
e right order (see=A0<a href=3D"http://vangjee.wordpress.com/2012/03/20/sec=
ondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm=
/">http://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-va=
lues-in-hadoops-mapreduce-programming-paradigm/</a>). Your composite key wo=
uld look like=A0(index of data point in column, which is the row index; the=
 LongWritable passed in as the map input key). =A0Each reduce call would ge=
t all the points in a single row. You would sort/group by row index, and wi=
thin a reduce&#39;s values, sort by byte index so that entries from earlier=
 columns come before later ones.</div>
<div><br></div><div><div>Does that make sense?</div><div><br></div><div>San=
dy</div><div><br><div class=3D"gmail_quote">On Tue, Mar 5, 2013 at 7:11 AM,=
 Mix Nin <span dir=3D"ltr">&lt;<a href=3D"mailto:pig.mixed@gmail.com" targe=
t=3D"_blank">pig.mixed@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>Hi</div><div><br></div=
>I have data in a file as follows . There are 3 columns separated by semico=
lon(;). Each column would have multiple values separated by comma (,).=A0<d=
iv>
<br></div><div>11,22,33;144,244,344;yny;<br>
</div><div><br></div><div>I need output data in below format. It is like tr=
ansposing =A0values of each column.</div><div><br></div><div><div>11 144 y<=
span style=3D"white-space:pre-wrap">	</span></div><div>
22 244 n</div><div>33 344 y</div><div><br></div><div>Can we write map reduc=
e program to achieve this. Could you help on the code on how to write.</div=
><div><br></div><div><br></div><div>Thanks</div>
</div></div>
</blockquote></div><br></div></div>

--047d7b66f61911df9e04d730ca73--