Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2C8F1DA0C for ; Wed, 29 May 2013 23:30:59 +0000 (UTC) Received: (qmail 49758 invoked by uid 500); 29 May 2013 23:30:54 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 49690 invoked by uid 500); 29 May 2013 23:30:54 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 49683 invoked by uid 99); 29 May 2013 23:30:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 May 2013 23:30:53 +0000 X-ASF-Spam-Status: No, hits=3.2 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of michael_segel@hotmail.com designates 65.55.111.86 as permitted sender) Received: from [65.55.111.86] (HELO blu0-omc2-s11.blu0.hotmail.com) (65.55.111.86) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 May 2013 23:30:48 +0000 Received: from BLU0-SMTP44 ([65.55.111.73]) by blu0-omc2-s11.blu0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Wed, 29 May 2013 16:30:27 -0700 X-EIP: [vaUD+5gn/vJ58Kjvyo5VlMbqoMYrJt0A] X-Originating-Email: [michael_segel@hotmail.com] Message-ID: Received: from [10.1.10.10] ([173.15.87.38]) by BLU0-SMTP44.phx.gbl over TLS secured channel with Microsoft SMTPSVC(6.0.3790.4675); Wed, 29 May 2013 16:30:25 -0700 From: Michael Segel Content-Type: multipart/alternative; boundary="Apple-Mail=_82D854F8-BBE7-4BB4-8752-B0BFD51DD4C2" MIME-Version: 1.0 (Mac OS X Mail 6.3 \(1503\)) Subject: Re: Reading json format input Date: Wed, 29 May 2013 18:30:24 -0500 References: To: user@hadoop.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1503) X-OriginalArrivalTime: 29 May 2013 23:30:25.0951 (UTC) FILETIME=[7F4E5AF0:01CE5CC4] X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_82D854F8-BBE7-4BB4-8752-B0BFD51DD4C2 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="iso-8859-1" Yeah,=20 I have to agree w Russell. Pig is definitely the way to go on this.=20 If you want to do it as a Java program you will have to do some work on = the input string but it too should be trivial.=20 How formal do you want to go?=20 Do you want to strip it down or just find the quote after the text part?=20= On May 29, 2013, at 5:13 PM, Russell Jurney = wrote: > Seriously consider Pig (free answer, 4 LOC): >=20 > my_data =3D LOAD 'my_data.json' USING = com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[]; > words =3D FOREACH my_data GENERATE $0#'author' as author, = FLATTEN(TOKENIZE($0#'text')) as word; > word_counts =3D FOREACH (GROUP words BY word) GENERATE group AS word, = COUNT_STAR(words) AS word_count; > STORE word_counts INTO '/tmp/word_counts.txt'; >=20 > It will be faster than the Java you'll likely write. >=20 >=20 > On Wed, May 29, 2013 at 2:54 PM, jamal sasha = wrote: > Hi, > I am stuck again. :( > My input data is in hdfs. I am again trying to do wordcount but there = is slight difference. > The data is in json format. > So each line of data is: >=20 > {"author":"foo", "text": "hello"} > {"author":"foo123", "text": "hello world"} > {"author":"foo234", "text": "hello this world"} >=20 > So I want to do wordcount for text part. > I understand that in mapper, I just have to pass this data as json and = extract "text" and rest of the code is just the same but I am trying to = switch from python to java hadoop.=20 > How do I do this. > Thanks >=20 >=20 >=20 > --=20 > Russell Jurney twitter.com/rjurney russell.jurney@gmail.com = datasyndrome.com --Apple-Mail=_82D854F8-BBE7-4BB4-8752-B0BFD51DD4C2 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset="iso-8859-1" russell.jurney@gmail.com> = wrote:
Seriously consider Pig (free answer, 4 = LOC):

my_data =3D LOAD 'my_data.json' USING = com.twitter.elephantbird.pig.load.JsonLoader() AS = json:map[];
words =3D FOREACH my_data GENERATE = $0#'author' as author, FLATTEN(TOKENIZE($0#'text')) as word;
word_counts =3D FOREACH (GROUP words BY word) GENERATE = group AS word, COUNT_STAR(words) AS word_count;
STORE= word_counts INTO '/tmp/word_counts.txt';

It will be faster than the Java = you'll likely write.


On = Wed, May 29, 2013 at 2:54 PM, jamal sasha <jamalshasha@gmail.com> wrote:
Hi,
   I am stuck again. :(
My input = data is in hdfs. I am again trying to do wordcount but there is slight = difference.
The data is in json format.
So each line of data = is:

{"author":"foo", "text": = "hello"}
{"author":"foo123", "text": "hello = world"}
{"author":"foo234", "text": "hello this = world"}

So I want to do wordcount for text = part.
I understand that in mapper, I just have to pass this = data as json and extract "text" and rest of the code is just the same = but I am trying to switch from python to java hadoop. 
How do I do this.
Thanks



--
Russell = Jurney twitter.com= /rjurney russell.jurney@gmail.com=  datasyndro= me.com=

= --Apple-Mail=_82D854F8-BBE7-4BB4-8752-B0BFD51DD4C2--