hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mischa Tuffield <mis...@mmt.me.uk>
Subject Re: Mapreduce using JSONObjects
Date Tue, 04 Jun 2013 23:39:13 GMT
Hello, 

On 4 Jun 2013, at 23:49, Max Lebedev <max.l@actionx.com> wrote:

> Hi. I've been trying to use JSONObjects to identify duplicates in JSONStrings. 
> The duplicate strings contain the same data, but not necessarily in the same order. For
example the following two lines should be identified as duplicates (and filtered). 
> 
> {"ts":1368758947.291035,"isSecure":true,"version":2,"source":"sdk","debug":false
> {"ts":1368758947.291035,"version":2,"source":"sdk","isSecure":true,"debug":false} 
> 
Can you not use the timestamp as a URI and emit them as URIs. Then you have your mapper emit
the following kv : 

output.collect(ts, value); 

And you would have a straight forward reducer that can dedup based on the timestamps. 

If above doesn't work for you, I would look at the jackson library for mangling json in java.
It method of using java beans for json is clean from a code pov and comes with lots of nice
features. 
http://stackoverflow.com/a/2255893

P.S. In your code you are using the old'er map reduce API, I would look at using the newer
APIs in this package org.apache.hadoop.mapreduce

Mischa
> This is the code: 
> 
> class DupFilter{
> 
>         public static class Map extends MapReduceBase implements Mapper<LongWritable,
Text, JSONObject, Text> {
> 
>                 public void map(LongWritable key, Text value, OutputCollector<JSONObject,
Text> output, Reporter reporter) throws IOException{ 
> 
>                 JSONObject jo = null; 
> 
>                 try { 
> 
>                         jo = new JSONObject(value.toString()); 
> 
>                         } catch (JSONException e) { 
> 
>                                 e.printStackTrace(); 
> 
>                         } 
> 
>                 output.collect(jo, value); 
> 
>                 } 
> 
>         } 
> 
>         public static class Reduce extends MapReduceBase implements Reducer<JSONObject,
Text, NullWritable, Text> { 
> 
>                 public void reduce(JSONObject jo, Iterator<Text> lines, OutputCollector<NullWritable,
Text> output, Reporter reporter) throws IOException { 
> 
> 
>                         output.collect(null, lines.next()); 
> 
>                 } 
> 
>         } 
> 
>         public static void main(String[] args) throws Exception { 
> 
>                 JobConf conf = new JobConf(DupFilter.class); 
> 
>                 conf.setOutputKeyClass(JSONObject.class); 
> 
>                 conf.setOutputValueClass(Text.class); 
> 
>                 conf.setMapperClass(Map.class); 
> 
>                 conf.setReducerClass(Reduce.class); 
> 
>                 conf.setInputFormat(TextInputFormat.class); 
> 
>                 conf.setOutputFormat(TextOutputFormat.class);
> 
>                 FileInputFormat.setInputPaths(conf, new Path(args[0])); 
> 
>                 FileOutputFormat.setOutputPath(conf, new Path(args[1]));
> 
>                 JobClient.runJob(conf); 
> 
>         } 
> 
> } 
> 
> I get the following error:
>  
> 
> java.lang.ClassCastException: class org.json.JSONObject 
> 
>         at java.lang.Class.asSubclass(Class.java:3027) 
> 
>         at org.apache.hadoop.mapred.JobConf.getOutputKeyComparator(JobConf.java:795)

> 
>         at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:817)

> 
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:383) 
> 
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) 
> 
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
>  
> 
> 
> 
> It looks like it has something to do with conf.setOutputKeyClass(). Am I doing something
wrong here? 
> 
> 
> 
> Thanks, 
> 
> Max Lebedev
> 

_______________________________
Mischa Tuffield PhD
http://mmt.me.uk/
@mischat






Mime
View raw message