hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Björn-Elmar Macek <ma...@cs.uni-kassel.de>
Subject Re: OutputValueGroupingComparator gets strange inputs (topic changed from "Logs cannot be created")
Date Thu, 09 Aug 2012 15:25:51 GMT
Ok, i found a tutorial for this myself. For everybody who ran into the 
problem: here is a tutorial explaining WriteableComparable types.

http://developer.yahoo.com/hadoop/tutorial/module5.html


Am 09.08.2012 17:14, schrieb Björn-Elmar Macek:
> Ah ok, i got the idea: i can use the abstract class instead of the low 
> level interface, though i am not sure, how to use it. It would just be 
> nice, if complexer mechanics like the sorting would have an up-to-date 
> tutorial with some example code. If i find the time, i will make one, 
> since i want to make a presentation for Hadoop anyways.
>
> Thanks for your help! I will try to use the abstract class.
>
>
> Am 09.08.2012 17:03, schrieb Björn-Elmar Macek:
>> Hi Bertrand,
>>
>> i am using RawComperator because this one was used in the tutorial of 
>> some famous (hadoop) guy describing how to sort the input for the 
>> reducer. Is there an easier alternative?
>>
>>
>> Am 09.08.2012 16:57, schrieb Bertrand Dechoux:
>>> I am just curious but are you using Writable? If so there is a 
>>> WritableComparator...
>>> If you are going to interpret every bytes (you create a String, so 
>>> you do), there no clear reason for choosing such a low level API.
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>> On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek 
>>> <macek@cs.uni-kassel.de <mailto:macek@cs.uni-kassel.de>> wrote:
>>>
>>>     Hi again,
>>>
>>>     this is an direct response to my previous posting with the title
>>>     "Logs cannot be created", where logs could not be created (Spill
>>>     failed). I got the hint, that i gotta check privileges, but that
>>>     was not the problem, because i own the folders that were used
>>>     for this.
>>>
>>>     I finally found an important hint in a log saying:
>>>     12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
>>>     outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout
>>>     <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout>
>>>     12/08/09 15:30:49 WARN mapred.JobClient: Error reading task
>>>     outputhttp://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr
>>>     <http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr>
>>>     12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
>>>     attempt_201208091516_0001_m_000055_0, Status : FAILED
>>>     java.io.IOException: Spill failed
>>>             at
>>>     org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1029)
>>>             at
>>>     org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:592)
>>>             at
>>>     uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:26)
>>>             at
>>>     uni.kassel.macek.rtprep.RetweetMapper.map(RetweetMapper.java:12)
>>>             at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>             at
>>>     org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>>>             at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>>>             at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>>>             at java.security.AccessController.doPrivileged(Native
>>>     Method)
>>>             at javax.security.auth.Subject.doAs(Subject.java:396)
>>>             at
>>>     org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093)
>>>             at org.apache.hadoop.mapred.Child.main(Child.java:249)
>>>     Caused by: java.lang.NumberFormatException: For input string: ""
>>>             at
>>>     java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
>>>             at java.lang.Integer.parseInt(Integer.java:468)
>>>             at java.lang.Integer.parseInt(Integer.java:497)
>>>             at uni.kassel.macek.rtprep.Tweet.getRT(Tweet.java:126)
>>>             at
>>>     uni.kassel.macek.rtprep.TwitterValueGroupingComparator.compare(TwitterValueGroupingComparator.java:47)
>>>             at
>>>     org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1111)
>>>             at
>>>     org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:95)
>>>             at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:59)
>>>             at
>>>     org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1399)
>>>             at
>>>     org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:853)
>>>             at
>>>     org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1344)
>>>
>>>
>>>
>>>     corresponding to the following lines of code within the class
>>>     TwitterValueGroupingComparator:
>>>
>>>     public class TwitterValueGroupingComparator implements
>>>     RawComparator<Text> {
>>>     ...
>>>         public int compare(byte[] text1, int start1, int length1,
>>>     byte[] text2,
>>>             int start2, int length2) {
>>>
>>>         byte[] tweet1 = new byte[length1];// length1-1 (???)
>>>         byte[] tweet2 = new byte[length2];// length1-1 (???)
>>>
>>>         System.arraycopy(text1, start1, tweet1, 0, length1);//
>>>     start1+1 (???)
>>>         System.arraycopy(text2, start2, tweet2, 0, length2);//
>>>     start2+1 (???)
>>>
>>>         Tweet atweet1 = new Tweet(new String(tweet1));
>>>         Tweet atweet2 = new Tweet(new String(tweet2));
>>>
>>>
>>>         String key1 = atweet1.getAuthor();
>>>         String key2 = atweet2.getAuthor();
>>>     ////////////////////////////////////////////////////////////////
>>>     //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
>>>     /////////////////////////////////////////////////////////////////
>>>         if (atweet1.getRT() > 0 && !atweet1.getMention().equals(""))
>>>             key1 = atweet1.getMention();
>>>         if (atweet2.getRT() > 0 && !atweet2.getMention().equals(""))
>>>             key2 = atweet2.getMention();
>>>
>>>         int realKeyCompare = key1.compareTo(key2);
>>>         return realKeyCompare;
>>>         }
>>>
>>>     }
>>>
>>>     As i am taking the incoming bytes and interpret them as Tweets
>>>     by recreating the appropriate CSV-Strings and Tokenizing it, i
>>>     was kind of sure, that the problem somehow are the leading
>>>     bytes, that Hadoop puts in front of the data being compared.
>>>     Since i never really understood what hadoop is doing to the
>>>     strings when they are sent to the KeyComparator  i simply
>>>     appended all strings to a file in order to see myself.
>>>
>>>     You can see the results here:
>>>
>>>     ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>>>     http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>>>     it's mostly bullshit   Alex Sink June Cleaver or Joan Crawford, null
>>>     ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>>>     http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>>>     it's mostly bullshit   Alex Sink June Cleaver or Joan Crawford, null
>>>     I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0,
>>>     ??????????????????, null
>>>     ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
>>>     http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but
>>>     it's mostly bullshit   Alex Sink June Cleaver or Joan Crawford, null
>>>     ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my
>>>     mind Watching food network, null
>>>     b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL
>>>     WORDUP ANT NOTHING LIKE THE HOOD, null
>>>
>>>
>>>     As you can see there are different leading characters: sometimes
>>>     its "??", other times its "b" or "^", etc.
>>>
>>>     My question is now:
>>>     How many bits do i have to cut off, so i get the original Text
>>>     as a String that i put into the key-position of my mapper
>>>     output? What are the concepts behind this?
>>>
>>>     Thanks for your help in advance!
>>>
>>>     Best regards,
>>>     Elmar Macek
>>>
>>>
>>>
>>>
>>>
>>>
>>> -- 
>>> Bertrand Dechoux
>>
>>
>
>



Mime
View raw message