Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B15429CB2 for ; Thu, 30 May 2013 03:13:11 +0000 (UTC) Received: (qmail 85855 invoked by uid 500); 30 May 2013 03:13:06 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 85514 invoked by uid 500); 30 May 2013 03:13:06 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 85504 invoked by uid 99); 30 May 2013 03:13:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 May 2013 03:13:05 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rahul.rec.dgp@gmail.com designates 209.85.128.174 as permitted sender) Received: from [209.85.128.174] (HELO mail-ve0-f174.google.com) (209.85.128.174) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 30 May 2013 03:13:01 +0000 Received: by mail-ve0-f174.google.com with SMTP id oz10so3889054veb.5 for ; Wed, 29 May 2013 20:12:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=438yDTxlbHlY9U3B4IWyneBmNjWJzHOdnoF4foiSoxs=; b=FVx8aMPln/0Px1bHQQcopG89qAxxGAzfMKqZ5WVSaZcS0HG61VvyF4tSXdbd0ZtPPD DweWMeW39aIDN/LIEUHQ3YjnekmGdakCoBVoApl/hNzTJjZys5VEaQ4nGXQAnTbhJzBA jlMJgoBS/Vsz7KoKNNpjtCeXb8YlPA+F9E74olP7D29ldW9rZ/11ke/hKXXLoBUa/KHk RCqoEASgn2XKJvU0aCKUAhuOSa7oVy8W3e0sxlu7i8bt0DaTlnj/PcG22cZe5ixVduA7 e+O2cpTTn59Enr/ivFrhxd223BFmvau10XxZ3+isZ3pwopdKfDjOXik7NppYt8EPvigF 1Tag== X-Received: by 10.220.188.201 with SMTP id db9mr3657834vcb.30.1369883560532; Wed, 29 May 2013 20:12:40 -0700 (PDT) MIME-Version: 1.0 Received: by 10.59.6.68 with HTTP; Wed, 29 May 2013 20:12:20 -0700 (PDT) In-Reply-To: References: From: Rahul Bhattacharjee Date: Thu, 30 May 2013 08:42:20 +0530 Message-ID: Subject: Re: Reading json format input To: "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=001a11c1bda202617b04dde6df35 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c1bda202617b04dde6df35 Content-Type: text/plain; charset=UTF-8 Whatever you have mentioned Jamal should work.you can debug this. Thanks, Rahul On Thu, May 30, 2013 at 5:14 AM, jamal sasha wrote: > Hi, > For some reason, this have to be in java :( > I am trying to use org.json library, something like (in mapper) > JSONObject jsn = new JSONObject(value.toString()); > > String text = (String) jsn.get("text"); > StringTokenizer itr = new StringTokenizer(text); > > But its not working :( > It would be better to get this thing properly but I wouldnt mind using a > hack as well :) > > > On Wed, May 29, 2013 at 4:30 PM, Michael Segel wrote: > >> Yeah, >> I have to agree w Russell. Pig is definitely the way to go on this. >> >> If you want to do it as a Java program you will have to do some work on >> the input string but it too should be trivial. >> How formal do you want to go? >> Do you want to strip it down or just find the quote after the text part? >> >> >> On May 29, 2013, at 5:13 PM, Russell Jurney >> wrote: >> >> Seriously consider Pig (free answer, 4 LOC): >> >> my_data = LOAD 'my_data.json' USING >> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[]; >> words = FOREACH my_data GENERATE $0#'author' as author, >> FLATTEN(TOKENIZE($0#'text')) as word; >> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word, >> COUNT_STAR(words) AS word_count; >> STORE word_counts INTO '/tmp/word_counts.txt'; >> >> It will be faster than the Java you'll likely write. >> >> >> On Wed, May 29, 2013 at 2:54 PM, jamal sasha wrote: >> >>> Hi, >>> I am stuck again. :( >>> My input data is in hdfs. I am again trying to do wordcount but there is >>> slight difference. >>> The data is in json format. >>> So each line of data is: >>> >>> {"author":"foo", "text": "hello"} >>> {"author":"foo123", "text": "hello world"} >>> {"author":"foo234", "text": "hello this world"} >>> >>> So I want to do wordcount for text part. >>> I understand that in mapper, I just have to pass this data as json and >>> extract "text" and rest of the code is just the same but I am trying to >>> switch from python to java hadoop. >>> How do I do this. >>> Thanks >>> >> >> >> >> -- >> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome. >> com >> >> >> > --001a11c1bda202617b04dde6df35 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Whatever you have mentioned Jamal should work.you c= an debug this.

Thanks,
Rahul


On Thu, May 30, 2013 at 5:14 AM, jamal sasha <jam= alshasha@gmail.com> wrote:
Hi,
=C2=A0 For some rea= son, this have to be in java :(
I am trying to use org.json libra= ry, something like (in mapper)
JSONObject jsn =3D new JSONObject(value.toString());

String text =3D (String) jsn.get("text");
StringTokenizer itr =3D new StringTokenizer(text);

But its not working :(
It would be better to get this thi= ng properly but I wouldnt mind using a hack as well :)


On Wed, May 29, 2013 at 4:30 PM, Mic= hael Segel <michael_segel@hotmail.com> wrote:
Yeah,=C2= =A0
I have to agree w Russell. Pig is definitely the way to go on this.= =C2=A0

If you want to do it as a Java program you will have to= do some work on the input string but it too should be trivial.=C2=A0
=
How formal do you want to go?=C2=A0
Do you want to strip it = down or just find the quote after the text part?=C2=A0


On May 29, 2013, at 5:13 PM, Ru= ssell Jurney <russell.jurney@gmail.com> wrote:

Seriously consider Pig (free answer, 4 LOC):

my_data =3D LOAD 'my_data.json' USING com.twitter.elephantbir= d.pig.load.JsonLoader() AS json:map[];
words =3D FOREACH my_d= ata GENERATE $0#'author' as author, FLATTEN(TOKENIZE($0#'text&#= 39;)) as word;
word_counts =3D FOREACH (GROUP words BY word) GENERATE group AS word, = COUNT_STAR(words) AS word_count;
STORE word_counts INTO '/tmp= /word_counts.txt';

It will be faster than the = Java you'll likely write.


On Wed,= May 29, 2013 at 2:54 PM, jamal sasha <jamalshasha@gmail.com> wrote:
Hi,
=C2=A0 =C2=A0I am s= tuck again. :(
My input data is in hdfs. I am again trying to do = wordcount but there is slight difference.
The data is in json format.
So each line of data is:

{"author":"foo", "text": = "hello"}
{"author":"foo123", "= text": "hello world"}
{"author":"foo234", "text": "hello this = world"}

So I want to do wordcount for tex= t part.
I understand that in mapper, I just have to pass this dat= a as json and extract "text" and rest of the code is just the sam= e but I am trying to switch from python to java hadoop.=C2=A0
How do I do this.
Thanks



--
Russell Jurney=C2=A0twitter.com/rjurn= ey=C2=A0russell.jurney@gmail.com=C2=A0datasyndrome.com



--001a11c1bda202617b04dde6df35--