Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B2A8CDA76 for ; Wed, 29 May 2013 23:44:15 +0000 (UTC) Received: (qmail 83062 invoked by uid 500); 29 May 2013 23:44:11 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 82934 invoked by uid 500); 29 May 2013 23:44:10 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 82925 invoked by uid 99); 29 May 2013 23:44:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 May 2013 23:44:10 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: error (athena.apache.org: local policy) Received: from [209.85.214.172] (HELO mail-ob0-f172.google.com) (209.85.214.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 May 2013 23:44:05 +0000 Received: by mail-ob0-f172.google.com with SMTP id wo10so1442882obc.31 for ; Wed, 29 May 2013 16:43:24 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=xMyDjWFSuO9H7td4na8FiMIcZOFCzC6qUoBRSIHwRUo=; b=IyILTL0tDJg07OLHIDfeSEHYgjIFuwJtkx7siPCxsvaC/9mif+cM5t03USy5El1Mye rUsAVhvYLXALdVmVK+QBOFl3HXRIZO0KvExUMdQoe0kP0Xsd+PMw1lQ7pMuiD8O+pO7z eO0N2IGjhfdj3WXbu8DKCEU5nB3oHsDQsn3m9T7Arzwktrf8ciImMGLMuVq6XHwlcixv Fo+pORVHpMuIpTuLBWWPEC5nPGjooAHn9V2v9MR3jONMa4HyqH9+ZNC/9hwDcjOTGxPD aYXNO4+7h/a1PBITNM91tUazslN+VV2fm/OLurFywFuiuhlQEGhnaZwsERj77KE6pEKK wKDQ== MIME-Version: 1.0 X-Received: by 10.60.47.200 with SMTP id f8mr3160505oen.33.1369871004489; Wed, 29 May 2013 16:43:24 -0700 (PDT) Received: by 10.76.150.100 with HTTP; Wed, 29 May 2013 16:43:24 -0700 (PDT) In-Reply-To: References: Date: Wed, 29 May 2013 16:43:24 -0700 Message-ID: Subject: Re: Reading json format input From: Rishi Yadav To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a11c2d6829cbb2e04dde3f21f X-Gm-Message-State: ALoCoQmLioWR+WXdD4k/vYNmgLxSxiKZX8QJx/GwmRqd8wYUvPjIHFZ2F4LFYGQ2ft0omKKl9mMY X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2d6829cbb2e04dde3f21f Content-Type: text/plain; charset=ISO-8859-1 Hi Jamal, I took your input and put it in sample wordcount program and it's working just fine and giving this output. author 3 foo234 1 text 3 foo 1 foo123 1 hello 3 this 1 world 2 When we split using String[] words = input.split("\\W+"); it takes care of all non-alphanumeric characters. Thanks and Regards, Rishi Yadav On Wed, May 29, 2013 at 2:54 PM, jamal sasha wrote: > Hi, > I am stuck again. :( > My input data is in hdfs. I am again trying to do wordcount but there is > slight difference. > The data is in json format. > So each line of data is: > > {"author":"foo", "text": "hello"} > {"author":"foo123", "text": "hello world"} > {"author":"foo234", "text": "hello this world"} > > So I want to do wordcount for text part. > I understand that in mapper, I just have to pass this data as json and > extract "text" and rest of the code is just the same but I am trying to > switch from python to java hadoop. > How do I do this. > Thanks > --001a11c2d6829cbb2e04dde3f21f Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Hi Jamal,

I took your input and put it = in sample wordcount program and it's working just fine and giving this = output.=A0

author 3
foo234 1
t= ext 3
foo 1
foo123 1
hello 3
th= is 1
world 2

<= div>
When we split using

String[] word= s =3D input.split("\\W+");

i= t takes care of all non-alphanumeric characters.

Thanks and Regards,

Rishi= Yadav


On Wed, May 29, 2013 = at 2:54 PM, jamal sasha <jamalshasha@gmail.com> wrote:
Hi,
=A0 =A0I am stuck a= gain. :(
My input data is in hdfs. I am again trying to do wordco= unt but there is slight difference.
The data is in json format.
So each line of data is:

{"author":"foo", "text": = "hello"}
{"author":"foo123", "= text": "hello world"}
{"author":"foo234", "text": "hello this = world"}

So I want to do wordcount for tex= t part.
I understand that in mapper, I just have to pass this dat= a as json and extract "text" and rest of the code is just the sam= e but I am trying to switch from python to java hadoop.=A0
How do I do this.
Thanks

--001a11c2d6829cbb2e04dde3f21f--