hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject Re: Reading json format input
Date Wed, 29 May 2013 23:30:24 GMT
I have to agree w Russell. Pig is definitely the way to go on this. 

If you want to do it as a Java program you will have to do some work on the input string but
it too should be trivial. 
How formal do you want to go? 
Do you want to strip it down or just find the quote after the text part? 

On May 29, 2013, at 5:13 PM, Russell Jurney <russell.jurney@gmail.com> wrote:

> Seriously consider Pig (free answer, 4 LOC):
> my_data = LOAD 'my_data.json' USING com.twitter.elephantbird.pig.load.JsonLoader() AS
> words = FOREACH my_data GENERATE $0#'author' as author, FLATTEN(TOKENIZE($0#'text'))
as word;
> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word, COUNT_STAR(words)
AS word_count;
> STORE word_counts INTO '/tmp/word_counts.txt';
> It will be faster than the Java you'll likely write.
> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <jamalshasha@gmail.com> wrote:
> Hi,
>    I am stuck again. :(
> My input data is in hdfs. I am again trying to do wordcount but there is slight difference.
> The data is in json format.
> So each line of data is:
> {"author":"foo", "text": "hello"}
> {"author":"foo123", "text": "hello world"}
> {"author":"foo234", "text": "hello this world"}
> So I want to do wordcount for text part.
> I understand that in mapper, I just have to pass this data as json and extract "text"
and rest of the code is just the same but I am trying to switch from python to java hadoop.

> How do I do this.
> Thanks
> -- 
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

View raw message