Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of michael_segel@hotmail.com
 designates 65.55.111.86 as permitted sender)
Message-ID: <BLU0-SMTP448902518343EF1951A2E18F900@phx.gbl>
From: Michael Segel <michael_segel@hotmail.com>
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_82D854F8-BBE7-4BB4-8752-B0BFD51DD4C2"
MIME-Version: 1.0 (Mac OS X Mail 6.3 \(1503\))
Subject: Re: Reading json format input
Date: Wed, 29 May 2013 18:30:24 -0500
References: 
 <CACb0Fn43YyU=7VraanRHkhq09voTEoaicT2OcHEZ=tc5o6p_2Q@mail.gmail.com>
 <CANSvDjq==z_51upgX00igLXvaSoa4m_XgBV52y5UUpb-F3x2dw@mail.gmail.com>
To: user@hadoop.apache.org
In-Reply-To: 
 <CANSvDjq==z_51upgX00igLXvaSoa4m_XgBV52y5UUpb-F3x2dw@mail.gmail.com>

--Apple-Mail=_82D854F8-BBE7-4BB4-8752-B0BFD51DD4C2
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="iso-8859-1"

Yeah,=20
I have to agree w Russell. Pig is definitely the way to go on this.=20

If you want to do it as a Java program you will have to do some work on =
the input string but it too should be trivial.=20
How formal do you want to go?=20
Do you want to strip it down or just find the quote after the text part?=20=


On May 29, 2013, at 5:13 PM, Russell Jurney <russell.jurney@gmail.com> =
wrote:

> Seriously consider Pig (free answer, 4 LOC):
>=20
> my_data =3D LOAD 'my_data.json' USING =
com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
> words =3D FOREACH my_data GENERATE $0#'author' as author, =
FLATTEN(TOKENIZE($0#'text')) as word;
> word_counts =3D FOREACH (GROUP words BY word) GENERATE group AS word, =
COUNT_STAR(words) AS word_count;
> STORE word_counts INTO '/tmp/word_counts.txt';
>=20
> It will be faster than the Java you'll likely write.
>=20
>=20
> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <jamalshasha@gmail.com> =
wrote:
> Hi,
>    I am stuck again. :(
> My input data is in hdfs. I am again trying to do wordcount but there =
is slight difference.
> The data is in json format.
> So each line of data is:
>=20
> {"author":"foo", "text": "hello"}
> {"author":"foo123", "text": "hello world"}
> {"author":"foo234", "text": "hello this world"}
>=20
> So I want to do wordcount for text part.
> I understand that in mapper, I just have to pass this data as json and =
extract "text" and rest of the code is just the same but I am trying to =
switch from python to java hadoop.=20
> How do I do this.
> Thanks
>=20
>=20
>=20
> --=20
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com =
datasyndrome.com


--Apple-Mail=_82D854F8-BBE7-4BB4-8752-B0BFD51DD4C2
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset="iso-8859-1"

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html =
charset=3Diso-8859-1"></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; =
">Yeah,&nbsp;<div>I have to agree w Russell. Pig is definitely the way =
to go on this.&nbsp;</div><div><br></div><div>If you want to do it as a =
Java program you will have to do some work on the input string but it =
too should be trivial.&nbsp;</div><div>How formal do you want to =
go?&nbsp;</div><div>Do you want to strip it down or just find the quote =
after the text part?&nbsp;</div><div><br></div><div><br><div><div>On May =
29, 2013, at 5:13 PM, Russell Jurney &lt;<a =
href=3D"mailto:russell.jurney@gmail.com">russell.jurney@gmail.com</a>&gt; =
wrote:</div><br class=3D"Apple-interchange-newline"><blockquote =
type=3D"cite"><div dir=3D"ltr">Seriously consider Pig (free answer, 4 =
LOC):<div><br></div><div>my_data =3D LOAD 'my_data.json' USING =
com.twitter.elephantbird.pig.load.JsonLoader() AS =
json:map[];<br></div><div style=3D"">words =3D FOREACH my_data GENERATE =
$0#'author' as author, FLATTEN(TOKENIZE($0#'text')) as word;</div>
<div style=3D"">word_counts =3D FOREACH (GROUP words BY word) GENERATE =
group AS word, COUNT_STAR(words) AS word_count;</div><div style=3D"">STORE=
 word_counts INTO '/tmp/word_counts.txt';</div><div =
style=3D""><br></div><div style=3D"">It will be faster than the Java =
you'll likely write.</div>
</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On =
Wed, May 29, 2013 at 2:54 PM, jamal sasha <span dir=3D"ltr">&lt;<a =
href=3D"mailto:jamalshasha@gmail.com" =
target=3D"_blank">jamalshasha@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
dir=3D"ltr">Hi,<div>&nbsp; &nbsp;I am stuck again. :(</div><div>My input =
data is in hdfs. I am again trying to do wordcount but there is slight =
difference.</div>
<div>The data is in json format.</div><div>So each line of data =
is:</div>
<div><br></div><div>{"author":"foo", "text": =
"hello"}</div><div>{"author":"foo123", "text": "hello =
world"}<br></div><div>
{"author":"foo234", "text": "hello this =
world"}<br></div><div><br></div><div>So I want to do wordcount for text =
part.</div><div>I understand that in mapper, I just have to pass this =
data as json and extract "text" and rest of the code is just the same =
but I am trying to switch from python to java hadoop.&nbsp;</div>

<div>How do I do this.</div><div>Thanks</div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><span =
style=3D"font-family:arial,sans-serif;font-size:14px">Russell =
Jurney&nbsp;</span><a href=3D"http://twitter.com/rjurney" =
style=3D"font-family:arial,sans-serif;font-size:14px;color:rgb(0,0,204)" =
target=3D"_blank">twitter.<span =
style=3D"background-color:rgb(255,255,136);color:rgb(34,34,34)">com</span>=
/rjurney</a><span =
style=3D"font-family:arial,sans-serif;font-size:14px">&nbsp;</span><font =
color=3D"#888888" style=3D"font-family:arial,sans-serif;font-size:14px"><a=
 href=3D"mailto:russell.jurney@gmail.com" style=3D"color:rgb(0,0,204)" =
target=3D"_blank"><font color=3D"#0000cc" =
style=3D"color:rgb(0,0,204)">russell.jurney@gmail.</font><span =
style=3D"background-color:rgb(255,255,136);color:rgb(34,34,34)">com</span>=
</a>&nbsp;<a href=3D"http://datasyndrome.com/" =
style=3D"color:rgb(0,0,204)" target=3D"_blank"><span =
style=3D"background-color:rgb(255,255,136);color:rgb(34,34,34)">datasyndro=
me</span>.<span =
style=3D"background-color:rgb(255,255,136);color:rgb(34,34,34)">com</span>=
</a></font>
</div>
</blockquote></div><br></div></body></html>=

--Apple-Mail=_82D854F8-BBE7-4BB4-8752-B0BFD51DD4C2--