Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of rahul.rec.dgp@gmail.com
 designates 209.85.128.174 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CACb0Fn4rfWxLAoD0pgQDScBNzgy-zwiUOi6stTBabwxDj6LMmA@mail.gmail.com>
References: 
 <CACb0Fn43YyU=7VraanRHkhq09voTEoaicT2OcHEZ=tc5o6p_2Q@mail.gmail.com>
 <CANSvDjq==z_51upgX00igLXvaSoa4m_XgBV52y5UUpb-F3x2dw@mail.gmail.com>
 <BLU0-SMTP448902518343EF1951A2E18F900@phx.gbl>
 <CACb0Fn4rfWxLAoD0pgQDScBNzgy-zwiUOi6stTBabwxDj6LMmA@mail.gmail.com>
From: Rahul Bhattacharjee <rahul.rec.dgp@gmail.com>
Date: Thu, 30 May 2013 08:42:20 +0530
Message-ID: 
 <CAO7hTbPbUUGvJYNx4E6V1fe16thPVfLX0wk4xR98ML5ApZ9SGw@mail.gmail.com>
Subject: Re: Reading json format input
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=001a11c1bda202617b04dde6df35

--001a11c1bda202617b04dde6df35
Content-Type: text/plain; charset=UTF-8

Whatever you have mentioned Jamal should work.you can debug this.

Thanks,
Rahul


On Thu, May 30, 2013 at 5:14 AM, jamal sasha <jamalshasha@gmail.com> wrote:

> Hi,
>   For some reason, this have to be in java :(
> I am trying to use org.json library, something like (in mapper)
> JSONObject jsn = new JSONObject(value.toString());
>
> String text = (String) jsn.get("text");
> StringTokenizer itr = new StringTokenizer(text);
>
> But its not working :(
> It would be better to get this thing properly but I wouldnt mind using a
> hack as well :)
>
>
> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <michael_segel@hotmail.com>wrote:
>
>> Yeah,
>> I have to agree w Russell. Pig is definitely the way to go on this.
>>
>> If you want to do it as a Java program you will have to do some work on
>> the input string but it too should be trivial.
>> How formal do you want to go?
>> Do you want to strip it down or just find the quote after the text part?
>>
>>
>> On May 29, 2013, at 5:13 PM, Russell Jurney <russell.jurney@gmail.com>
>> wrote:
>>
>> Seriously consider Pig (free answer, 4 LOC):
>>
>> my_data = LOAD 'my_data.json' USING
>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>> words = FOREACH my_data GENERATE $0#'author' as author,
>> FLATTEN(TOKENIZE($0#'text')) as word;
>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>> COUNT_STAR(words) AS word_count;
>> STORE word_counts INTO '/tmp/word_counts.txt';
>>
>> It will be faster than the Java you'll likely write.
>>
>>
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <jamalshasha@gmail.com>wrote:
>>
>>> Hi,
>>>    I am stuck again. :(
>>> My input data is in hdfs. I am again trying to do wordcount but there is
>>> slight difference.
>>> The data is in json format.
>>> So each line of data is:
>>>
>>> {"author":"foo", "text": "hello"}
>>> {"author":"foo123", "text": "hello world"}
>>> {"author":"foo234", "text": "hello this world"}
>>>
>>> So I want to do wordcount for text part.
>>> I understand that in mapper, I just have to pass this data as json and
>>> extract "text" and rest of the code is just the same but I am trying to
>>> switch from python to java hadoop.
>>> How do I do this.
>>> Thanks
>>>
>>
>>
>>
>> --
>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
>> com
>>
>>
>>
>

--001a11c1bda202617b04dde6df35
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:georgia,=
serif;color:rgb(0,0,0)">Whatever you have mentioned Jamal should work.you c=
an debug this.<br><br></div><div class=3D"gmail_default" style=3D"font-fami=
ly:georgia,serif;color:rgb(0,0,0)">

Thanks,<br>Rahul<br></div></div><div class=3D"gmail_extra"><br><br><div cla=
ss=3D"gmail_quote">On Thu, May 30, 2013 at 5:14 AM, jamal sasha <span dir=
=3D"ltr">&lt;<a href=3D"mailto:jamalshasha@gmail.com" target=3D"_blank">jam=
alshasha@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi,<div>=C2=A0 For some rea=
son, this have to be in java :(</div><div>I am trying to use org.json libra=
ry, something like (in mapper)</div>

<div><div>JSONObject jsn =3D new JSONObject(value.toString());</div>
<div><br></div><div>String text =3D (String) jsn.get(&quot;text&quot;);</di=
v><div>StringTokenizer itr =3D new StringTokenizer(text);</div><div><br></d=
iv><div>But its not working :(</div><div>It would be better to get this thi=
ng properly but I wouldnt mind using a hack as well :)</div>


</div></div><div class=3D"HOEnZb"><div class=3D"h5"><div class=3D"gmail_ext=
ra"><br><br><div class=3D"gmail_quote">On Wed, May 29, 2013 at 4:30 PM, Mic=
hael Segel <span dir=3D"ltr">&lt;<a href=3D"mailto:michael_segel@hotmail.co=
m" target=3D"_blank">michael_segel@hotmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div style=3D"word-wrap:break-word">Yeah,=C2=
=A0<div>I have to agree w Russell. Pig is definitely the way to go on this.=
=C2=A0</div>


<div><br></div><div>If you want to do it as a Java program you will have to=
 do some work on the input string but it too should be trivial.=C2=A0</div>=
<div>How formal do you want to go?=C2=A0</div><div>Do you want to strip it =
down or just find the quote after the text part?=C2=A0</div>


<div><div><div><br></div><div><br><div><div>On May 29, 2013, at 5:13 PM, Ru=
ssell Jurney &lt;<a href=3D"mailto:russell.jurney@gmail.com" target=3D"_bla=
nk">russell.jurney@gmail.com</a>&gt; wrote:</div><br><blockquote type=3D"ci=
te">


<div dir=3D"ltr">Seriously consider Pig (free answer, 4 LOC):<div><br></div=
><div>my_data =3D LOAD &#39;my_data.json&#39; USING com.twitter.elephantbir=
d.pig.load.JsonLoader() AS json:map[];<br></div><div>words =3D FOREACH my_d=
ata GENERATE $0#&#39;author&#39; as author, FLATTEN(TOKENIZE($0#&#39;text&#=
39;)) as word;</div>


<div>word_counts =3D FOREACH (GROUP words BY word) GENERATE group AS word, =
COUNT_STAR(words) AS word_count;</div><div>STORE word_counts INTO &#39;/tmp=
/word_counts.txt&#39;;</div><div><br></div><div>It will be faster than the =
Java you&#39;ll likely write.</div>


</div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Wed,=
 May 29, 2013 at 2:54 PM, jamal sasha <span dir=3D"ltr">&lt;<a href=3D"mail=
to:jamalshasha@gmail.com" target=3D"_blank">jamalshasha@gmail.com</a>&gt;</=
span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi,<div>=C2=A0 =C2=A0I am s=
tuck again. :(</div><div>My input data is in hdfs. I am again trying to do =
wordcount but there is slight difference.</div>


<div>The data is in json format.</div><div>So each line of data is:</div>
<div><br></div><div>{&quot;author&quot;:&quot;foo&quot;, &quot;text&quot;: =
&quot;hello&quot;}</div><div>{&quot;author&quot;:&quot;foo123&quot;, &quot;=
text&quot;: &quot;hello world&quot;}<br></div><div>
{&quot;author&quot;:&quot;foo234&quot;, &quot;text&quot;: &quot;hello this =
world&quot;}<br></div><div><br></div><div>So I want to do wordcount for tex=
t part.</div><div>I understand that in mapper, I just have to pass this dat=
a as json and extract &quot;text&quot; and rest of the code is just the sam=
e but I am trying to switch from python to java hadoop.=C2=A0</div>


<div>How do I do this.</div><div>Thanks</div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br><span style=
=3D"font-family:arial,sans-serif;font-size:14px">Russell Jurney=C2=A0</span=
><a href=3D"http://twitter.com/rjurney" style=3D"font-family:arial,sans-ser=
if;font-size:14px;color:rgb(0,0,204)" target=3D"_blank">twitter.<span style=
=3D"background-color:rgb(255,255,136);color:rgb(34,34,34)">com</span>/rjurn=
ey</a><span style=3D"font-family:arial,sans-serif;font-size:14px">=C2=A0</s=
pan><font style=3D"font-family:arial,sans-serif;font-size:14px" color=3D"#8=
88888"><a href=3D"mailto:russell.jurney@gmail.com" style=3D"color:rgb(0,0,2=
04)" target=3D"_blank"><font style=3D"color:rgb(0,0,204)" color=3D"#0000cc"=
>russell.jurney@gmail.</font><span style=3D"background-color:rgb(255,255,13=
6);color:rgb(34,34,34)">com</span></a>=C2=A0<a href=3D"http://datasyndrome.=
com/" style=3D"color:rgb(0,0,204)" target=3D"_blank"><span style=3D"backgro=
und-color:rgb(255,255,136);color:rgb(34,34,34)">datasyndrome</span>.<span s=
tyle=3D"background-color:rgb(255,255,136);color:rgb(34,34,34)">com</span></=
a></font>
</div>
</blockquote></div><br></div></div></div></div></blockquote></div><br></div=
>
</div></div></blockquote></div><br></div>

--001a11c1bda202617b04dde6df35--