hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rahul Bhattacharjee <rahul.rec....@gmail.com>
Subject Re: Reading json format input
Date Thu, 30 May 2013 03:12:20 GMT
Whatever you have mentioned Jamal should work.you can debug this.

Thanks,
Rahul


On Thu, May 30, 2013 at 5:14 AM, jamal sasha <jamalshasha@gmail.com> wrote:

> Hi,
>   For some reason, this have to be in java :(
> I am trying to use org.json library, something like (in mapper)
> JSONObject jsn = new JSONObject(value.toString());
>
> String text = (String) jsn.get("text");
> StringTokenizer itr = new StringTokenizer(text);
>
> But its not working :(
> It would be better to get this thing properly but I wouldnt mind using a
> hack as well :)
>
>
> On Wed, May 29, 2013 at 4:30 PM, Michael Segel <michael_segel@hotmail.com>wrote:
>
>> Yeah,
>> I have to agree w Russell. Pig is definitely the way to go on this.
>>
>> If you want to do it as a Java program you will have to do some work on
>> the input string but it too should be trivial.
>> How formal do you want to go?
>> Do you want to strip it down or just find the quote after the text part?
>>
>>
>> On May 29, 2013, at 5:13 PM, Russell Jurney <russell.jurney@gmail.com>
>> wrote:
>>
>> Seriously consider Pig (free answer, 4 LOC):
>>
>> my_data = LOAD 'my_data.json' USING
>> com.twitter.elephantbird.pig.load.JsonLoader() AS json:map[];
>> words = FOREACH my_data GENERATE $0#'author' as author,
>> FLATTEN(TOKENIZE($0#'text')) as word;
>> word_counts = FOREACH (GROUP words BY word) GENERATE group AS word,
>> COUNT_STAR(words) AS word_count;
>> STORE word_counts INTO '/tmp/word_counts.txt';
>>
>> It will be faster than the Java you'll likely write.
>>
>>
>> On Wed, May 29, 2013 at 2:54 PM, jamal sasha <jamalshasha@gmail.com>wrote:
>>
>>> Hi,
>>>    I am stuck again. :(
>>> My input data is in hdfs. I am again trying to do wordcount but there is
>>> slight difference.
>>> The data is in json format.
>>> So each line of data is:
>>>
>>> {"author":"foo", "text": "hello"}
>>> {"author":"foo123", "text": "hello world"}
>>> {"author":"foo234", "text": "hello this world"}
>>>
>>> So I want to do wordcount for text part.
>>> I understand that in mapper, I just have to pass this data as json and
>>> extract "text" and rest of the code is just the same but I am trying to
>>> switch from python to java hadoop.
>>> How do I do this.
>>> Thanks
>>>
>>
>>
>>
>> --
>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.
>> com
>>
>>
>>
>

Mime
View raw message