Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A0E9FCD99 for ; Fri, 7 Jun 2013 17:17:18 +0000 (UTC) Received: (qmail 36863 invoked by uid 500); 7 Jun 2013 17:17:13 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 36535 invoked by uid 500); 7 Jun 2013 17:17:09 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 36528 invoked by uid 99); 7 Jun 2013 17:17:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Jun 2013 17:17:08 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of goksron@gmail.com designates 209.85.160.45 as permitted sender) Received: from [209.85.160.45] (HELO mail-pb0-f45.google.com) (209.85.160.45) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Jun 2013 17:17:04 +0000 Received: by mail-pb0-f45.google.com with SMTP id mc8so4942918pbc.18 for ; Fri, 07 Jun 2013 10:16:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type; bh=ACBAq9zr1Q5U8tkCNynGgSE+SGWDI9vmyq0LSXWeak0=; b=f6dXk6WqzQ26jkXAo8PFlC6Ko0eQSyZfkTZTNShynjOoQPXjvK1lsTAgtSdeytHFof MPctqQntbQiYAzqjxE0emoMMHyD3vzhpgxLwJmQyXpG69n+OGKqgCofg03IG9lAes0y7 6lWVhGAXE9KeOMYrJplYKAV6uK5HrJttCOmnKFnBNWTNnJ8dtXk28NAzyjZDdb912z8I X8Pp/Akcy3eufAj56uYVRMkyaa0OJOcODs23vUc7fv0/S94DnDneex2YCcGaQqfxlhQq UBfp5kMS/bGjuyJT3f600a2GOOfgD50c8LroxKewyqaaJplHyexUKfftjryvamch2adI mLyg== X-Received: by 10.66.19.234 with SMTP id i10mr3738820pae.152.1370625403981; Fri, 07 Jun 2013 10:16:43 -0700 (PDT) Received: from [10.19.220.64] (64-71-21-34.static.wiline.com. [64.71.21.34]) by mx.google.com with ESMTPSA id cc15sm4013914pac.1.2013.06.07.10.16.41 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 07 Jun 2013 10:16:42 -0700 (PDT) Message-ID: <51B21579.6080901@gmail.com> Date: Fri, 07 Jun 2013 10:16:41 -0700 From: Lance Norskog User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130308 Thunderbird/17.0.4 MIME-Version: 1.0 To: user@hadoop.apache.org Subject: Re: Mapreduce using JSONObjects References: <6E439B88-C52C-4682-B4D0-1D1F90EF1208@mmt.me.uk> In-Reply-To: Content-Type: multipart/alternative; boundary="------------030306080303050009080902" X-Virus-Checked: Checked by ClamAV on apache.org This is a multi-part message in MIME format. --------------030306080303050009080902 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 8bit A side point for Hadoop experts: a comparator is used for sorting in the shuffle. If a comparator always returns -1 for unequal objects, then sorting will take longer than it should because there will be a certain amount of items that are compared more than once. Is this true? On 06/05/2013 04:10 PM, Max Lebedev wrote: > > I�ve taken your advice and made a wrapper class which implements > WritableComparable. Thank you very much for your help. I believe > everything is working fine on that front. I used google�s gson for the > comparison. > > > public int compareTo(Object o) { > > JsonElement o1 = PARSER.parse(this.json.toString()); > > JsonElement o2 = PARSER.parse(o.toString()); > > if(o2.equals(o1)) > > return 0; > > else > > return -1; > > } > > > The problem I have now is that only consecutive duplicates are > detected. Given 6 lines: > > {"ts":1368758947.291035,"isSecure":true,"version":2,"source":"sdk","debug":false} > > {"ts":1368758947.291035,"version":2,"source":"sdk","isSecure":true,"debug":false} > > {"ts":1368758947.291035,"version":2,"source":"sdk","isSecure":true,"debug":true} > > {"ts":1368758947.291035,"isSecure":false,"version":2,"source":"sdk","debug":false} > > {"ts":1368758947.291035, > "source":"sdk","isSecure":false,"version":2,"debug":false} > > {"ts":1368758947.291035,"isSecure":true,"version":2,"source":"sdk","debug":false} > > > I get back 1, 3, 4, and 6. I should be getting 1, 3 and 4 as 6 is > exactly equal to 1. If I switch 5 and 6, the original line 5 is no > longer filtered (I get 1,3,4,5,6). I�ve noticed that the compareTo > method is called a total of 13 times. I assume that in order for all 6 > of the keys to be compared, 15 comparisons need to be made. Am I > missing something here? I�ve tested the compareTo manually and line 1 > and 6 are interpreted as equal. My map reduce code currently looks > like this: > > > class DupFilter{ > > private static final Gson GSON = new Gson(); > > private static final JsonParser PARSER = new JsonParser(); > > public static class Map extends MapReduceBase implements > Mapper { > public void map(LongWritable key, Text value, > OutputCollector output, Reporter reporter) > throws IOException{ > > JsonElement je = PARSER.parse(value.toString()); > > JSONWrapper jow = null; > > jow = new JSONWrapper(value.toString()); > > IntWritable one = new IntWritable(1); > > output.collect(jow, one); > > } > > } > > public static class Reduce extends MapReduceBase implements > Reducer { > > public void reduce(JSONWrapper jow, Iterator values, > OutputCollector output, Reporter reporter) > throws IOException { > > int sum = 0; > > while (values.hasNext()) > > sum += values.next().get(); > > output.collect(jow, new IntWritable(sum)); > > } > > } > > public static void main(String[] args) throws Exception { > > JobConf conf = new JobConf(DupFilter.class); > > conf.setJobName("dupfilter"); > > conf.setOutputKeyClass(JSONWrapper.class); > > conf.setOutputValueClass(IntWritable.class); > > conf.setMapperClass(Map.class); > > conf.setReducerClass(Reduce.class); > > conf.setInputFormat(TextInputFormat.class); > > conf.setOutputFormat(TextOutputFormat.class); > > FileInputFormat.setInputPaths(conf, new Path(args[0])); > > FileOutputFormat.setOutputPath(conf, new Path(args[1])); > > JobClient.runJob(conf); > > } > > } > > Thanks, > > Max Lebedev > > > > On Tue, Jun 4, 2013 at 10:58 PM, Rahul Bhattacharjee > > wrote: > > I agree with Shahab , you have to ensure that the key are writable > comparable and values are writable in order to be used in MR. > > You can have writable comparable implementation wrapping the > actual json object. > > Thanks, > Rahul > > > On Wed, Jun 5, 2013 at 5:09 AM, Mischa Tuffield > wrote: > > Hello, > > On 4 Jun 2013, at 23:49, Max Lebedev > wrote: > >> Hi. I've been trying to use JSONObjects to identify >> duplicates in JSONStrings. >> The duplicate strings contain the same data, but not >> necessarily in the same order. For example the following two >> lines should be identified as duplicates (and filtered). >> >> {"ts":1368758947.291035,"isSecure":true,"version":2,"source":"sdk","debug":false >> {"ts":1368758947.291035,"version":2,"source":"sdk","isSecure":true,"debug":false} >> >> > Can you not use the timestamp as a URI and emit them as URIs. > Then you have your mapper emit the following kv : > > output.collect(ts, value); > > And you would have a straight forward reducer that can dedup > based on the timestamps. > > If above doesn't work for you, I would look at the jackson > library for mangling json in java. It method of using java > beans for json is clean from a code pov and comes with lots of > nice features. > http://stackoverflow.com/a/2255893 > > P.S. In your code you are using the old'er map reduce API, I > would look at using the newer APIs in this > package org.apache.hadoop.mapreduce > > Mischa >> >> This is the code: >> >> class DupFilter{ >> >> public static class Map extends MapReduceBase >> implements Mapper { >> >> public void map(LongWritable key, Text value, >> OutputCollector output, Reporter reporter) >> throws IOException{ >> >> JSONObject jo = null; >> >> try { >> >> jo = new JSONObject(value.toString()); >> >> } catch (JSONException e) { >> >> e.printStackTrace(); >> >> } >> >> output.collect(jo, value); >> >> } >> >> } >> >> public static class Reduce extends MapReduceBase >> implements Reducer { >> >> public void reduce(JSONObject jo, >> Iterator lines, OutputCollector >> output, Reporter reporter) throws IOException { >> >> >> output.collect(null, lines.next()); >> >> } >> >> } >> >> public static void main(String[] args) throws >> Exception { >> >> JobConf conf = new JobConf(DupFilter.class); >> >> conf.setOutputKeyClass(JSONObject.class); >> >> conf.setOutputValueClass(Text.class); >> >> conf.setMapperClass(Map.class); >> >> conf.setReducerClass(Reduce.class); >> >> conf.setInputFormat(TextInputFormat.class); >> >> conf.setOutputFormat(TextOutputFormat.class); >> >> FileInputFormat.setInputPaths(conf, new Path(args[0])); >> >> FileOutputFormat.setOutputPath(conf, new Path(args[1])); >> >> JobClient.runJob(conf); >> >> } >> >> } >> >> I get the following error: >> >> java.lang.ClassCastException: class org.json.JSONObject >> >> at java.lang.Class.asSubclass(Class.java:3027) >> >> at >> org.apache.hadoop.mapred.JobConf.getOutputKeyComparator(JobConf.java:795) >> >> >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:817) >> >> >> at >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:383) >> >> at >> org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) >> >> at >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210) >> >> >> It looks like it has something to do with >> conf.setOutputKeyClass(). Am I doing something wrong here? >> >> >> Thanks, >> >> Max Lebedev >> > > _______________________________ > Mischa Tuffield PhD > http://mmt.me.uk/ > @mischat > > > > > > > --------------030306080303050009080902 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: 8bit A side point for Hadoop experts: a comparator is used for sorting in the shuffle. If a comparator always returns -1 for unequal objects, then sorting will take longer than it should because there will be a certain amount of items that are compared more than once.

Is this true?

On 06/05/2013 04:10 PM, Max Lebedev wrote:

I�ve taken your advice and made a wrapper class which implements WritableComparable. Thank you very much for your help. I believe everything is working fine on that front. I used google�s gson for the comparison.


public int compareTo(Object o) {

� ��JsonElement o1 = PARSER.parse(this.json.toString());

� ��JsonElement o2 = PARSER.parse(o.toString());

� ��if(o2.equals(o1))

� ��� ��return 0;

� ��else

� ��� ��return -1;

}


The problem I have now is that only consecutive duplicates are detected. Given 6 lines:�

{"ts":1368758947.291035,"isSecure":true,"version":2,"source":"sdk","debug":false}

{"ts":1368758947.291035,"version":2,"source":"sdk","isSecure":true,"debug":false}

{"ts":1368758947.291035,"version":2,"source":"sdk","isSecure":true,"debug":true}

{"ts":1368758947.291035,"isSecure":false,"version":2,"source":"sdk","debug":false}

{"ts":1368758947.291035, "source":"sdk","isSecure":false,"version":2,"debug":false}

{"ts":1368758947.291035,"isSecure":true,"version":2,"source":"sdk","debug":false}


I get back 1, 3, 4, and 6. I should be getting 1, 3 and 4 as 6 is exactly equal to 1. If I switch 5 and 6, the original line 5 is no longer filtered (I get 1,3,4,5,6). I�ve noticed that the compareTo method is called a total of 13 times. I assume that in order for all 6 of the keys to be compared, 15 comparisons need to be made. Am I missing something here? I�ve tested the compareTo manually and line 1 and 6 are interpreted as equal. My map reduce code currently looks like this:


class DupFilter{

� � private static final Gson GSON = new Gson();

� � private static final JsonParser PARSER = new JsonParser();

� � public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, JSONWrapper, IntWritable> {
� � � � public void map(LongWritable key, Text value, OutputCollector<JSONWrapper, IntWritable> output, Reporter reporter) throws IOException{

� � � � � � JsonElement je = PARSER.parse(value.toString());

� � � � � � JSONWrapper jow = null;

� � � � � ��jow = new JSONWrapper(value.toString());

� � � � � ��IntWritable one = new IntWritable(1);

� � � � � ��output.collect(jow, one);

� � � � � ��}

� � }

� ��public static class Reduce extends MapReduceBase implements Reducer<JSONWrapper, IntWritable, JSONWrapper, IntWritable> {

� ��� ��public void reduce(JSONWrapper jow, Iterator<IntWritable> values, OutputCollector<JSONWrapper, IntWritable> output, Reporter reporter) throws IOException {

� � � ��� ��int sum = 0;

� � � ��� ��while (values.hasNext())

� � � � � ��� ��sum += values.next().get();

� � � ��� ��output.collect(jow, new IntWritable(sum));

� � � ��� ��}

� ��}

� ��public static void main(String[] args) throws Exception {

� ��� ��JobConf conf = new JobConf(DupFilter.class);

� ��� ��conf.setJobName("dupfilter");

� ��� ��conf.setOutputKeyClass(JSONWrapper.class);

� ��� ��conf.setOutputValueClass(IntWritable.class);

� ��� ��conf.setMapperClass(Map.class);�

� ��� ��conf.setReducerClass(Reduce.class);

� ��� ��conf.setInputFormat(TextInputFormat.class);

� ��� ��conf.setOutputFormat(TextOutputFormat.class);

� ��� ��FileInputFormat.setInputPaths(conf, new Path(args[0]));

� ��� ��FileOutputFormat.setOutputPath(conf, new Path(args[1]));

� ��� ��JobClient.runJob(conf);

� ��}

}

Thanks,

Max Lebedev



On Tue, Jun 4, 2013 at 10:58 PM, Rahul Bhattacharjee <rahul.rec.dgp@gmail.com> wrote:
I agree with Shahab , you have to ensure that the key are writable comparable and values are writable in order to be used in MR.

You can have writable comparable implementation wrapping the actual json object.

Thanks,
Rahul


On Wed, Jun 5, 2013 at 5:09 AM, Mischa Tuffield <mischa@mmt.me.uk> wrote:
Hello,�

On 4 Jun 2013, at 23:49, Max Lebedev <max.l@actionx.com> wrote:

Hi. I've been trying to use JSONObjects to identify duplicates in JSONStrings.�
The duplicate strings contain the same data, but not necessarily in the same order. For example the following two lines should be identified as duplicates (and filtered).�

{"ts":1368758947.291035,"isSecure":true,"version":2,"source":"sdk","debug":false
{"ts":1368758947.291035,"version":2,"source":"sdk","isSecure":true,"debug":false}�

Can you not use the timestamp as a URI and emit them as URIs. Then you have your mapper emit the following kv :�

output.collect(ts, value);�

And you would have a straight forward reducer that can dedup based on the timestamps.�

If above doesn't work for you, I would look at the jackson library for mangling json in java. It method of using java beans for json is clean from a code pov and comes with lots of nice features.�

P.S. In your code you are using the old'er map reduce API, I would look at using the newer APIs in this package�org.apache.hadoop.mapreduce

Mischa

This is the code:�

class DupFilter{

� � � � public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, JSONObject, Text> {

� � � � � � � � public void map(LongWritable key, Text value, OutputCollector<JSONObject, Text> output, Reporter reporter) throws IOException{�

� � � � � � � � JSONObject jo = null;�

� � � � � � � � try {�

� � � � � � � � � � � � jo = new JSONObject(value.toString());�

� � � � � � � � � � � � } catch (JSONException e) {�

� � � � � � � � � � � � � � � � e.printStackTrace();�

� � � � � � � � � � � � }�

� � � � � � � � output.collect(jo, value);�

� � � � � � � � }�

� � � � }�

� � � � public static class Reduce extends MapReduceBase implements Reducer<JSONObject, Text, NullWritable, Text> {�

� � � � � � � � public void reduce(JSONObject jo, Iterator<Text> lines, OutputCollector<NullWritable, Text> output, Reporter reporter) throws IOException {�


� � � � � � � � � � � � output.collect(null, lines.next());�

� � � � � � � � }�

� � � � }�

� � � � public static void main(String[] args) throws Exception {�

� � � � � � � � JobConf conf = new JobConf(DupFilter.class);�

� � � � � � � � conf.setOutputKeyClass(JSONObject.class);�

� � � � � � � � conf.setOutputValueClass(Text.class);�

� � � � � � � � conf.setMapperClass(Map.class);�

� � � � � � � � conf.setReducerClass(Reduce.class);�

� � � � � � � � conf.setInputFormat(TextInputFormat.class);�

� � � � � � � � conf.setOutputFormat(TextOutputFormat.class);

� � � � � � � � FileInputFormat.setInputPaths(conf, new Path(args[0]));�

� � � � � � � � FileOutputFormat.setOutputPath(conf, new Path(args[1]));

� � � � � � � � JobClient.runJob(conf);�

� � � � }�

}�

I get the following error:

java.lang.ClassCastException: class org.json.JSONObject�

� � � � at java.lang.Class.asSubclass(Class.java:3027)�

� � � � at org.apache.hadoop.mapred.JobConf.getOutputKeyComparator(JobConf.java:795)�

� � � � at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:817)�

� � � � at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:383)�

� � � � at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)�

� � � � at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)


It looks like it has something to do with conf.setOutputKeyClass(). Am I doing something wrong here?�


Thanks,�

Max Lebedev


_______________________________
Mischa Tuffield PhD
@mischat








--------------030306080303050009080902--