hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bertrand Dechoux <decho...@gmail.com>
Subject Re: OutputValueGroupingComparator gets strange inputs (topic changed from "Logs cannot be created")
Date Thu, 09 Aug 2012 14:57:50 GMT
I am just curious but are you using Writable? If so there is a
WritableComparator...
If you are going to interpret every bytes (you create a String, so you do),
there no clear reason for choosing such a low level API.

Regards

Bertrand

On Thu, Aug 9, 2012 at 4:47 PM, Björn-Elmar Macek <macek@cs.uni-kassel.de>wrote:

> Hi again,
>
> this is an direct response to my previous posting with the title "Logs
> cannot be created", where logs could not be created (Spill failed). I got
> the hint, that i gotta check privileges, but that was not the problem,
> because i own the folders that were used for this.
>
> I finally found an important hint in a log saying:
> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
> its-cs229.its.**uni-kassel.de:50060/tasklog?**plaintext=true&attemptid=**
> attempt_201208091516_0001_m_**000048_0&filter=stdout<http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stdout>
> 12/08/09 15:30:49 WARN mapred.JobClient: Error reading task outputhttp://
> its-cs229.its.**uni-kassel.de:50060/tasklog?**plaintext=true&attemptid=**
> attempt_201208091516_0001_m_**000048_0&filter=stderr<http://its-cs229.its.uni-kassel.de:50060/tasklog?plaintext=true&attemptid=attempt_201208091516_0001_m_000048_0&filter=stderr>
> 12/08/09 15:34:34 INFO mapred.JobClient: Task Id :
> attempt_201208091516_0001_m_**000055_0, Status : FAILED
> java.io.IOException: Spill failed
>         at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> collect(MapTask.java:1029)
>         at org.apache.hadoop.mapred.**MapTask$OldOutputCollector.**
> collect(MapTask.java:592)
>         at uni.kassel.macek.rtprep.**RetweetMapper.map(**
> RetweetMapper.java:26)
>         at uni.kassel.macek.rtprep.**RetweetMapper.map(**
> RetweetMapper.java:12)
>         at org.apache.hadoop.mapred.**MapRunner.run(MapRunner.java:**50)
>         at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**
> java:436)
>         at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:372)
>         at org.apache.hadoop.mapred.**Child$4.run(Child.java:255)
>         at java.security.**AccessController.doPrivileged(**Native Method)
>         at javax.security.auth.Subject.**doAs(Subject.java:396)
>         at org.apache.hadoop.security.**UserGroupInformation.doAs(**
> UserGroupInformation.java:**1093)
>         at org.apache.hadoop.mapred.**Child.main(Child.java:249)
> Caused by: java.lang.**NumberFormatException: For input string: ""
>         at java.lang.**NumberFormatException.**forInputString(**
> NumberFormatException.java:48)
>         at java.lang.Integer.parseInt(**Integer.java:468)
>         at java.lang.Integer.parseInt(**Integer.java:497)
>         at uni.kassel.macek.rtprep.Tweet.**getRT(Tweet.java:126)
>         at uni.kassel.macek.rtprep.**TwitterValueGroupingComparator**
> .compare(**TwitterValueGroupingComparator**.java:47)
>         at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> compare(MapTask.java:1111)
>         at org.apache.hadoop.util.**QuickSort.sortInternal(**
> QuickSort.java:95)
>         at org.apache.hadoop.util.**QuickSort.sort(QuickSort.java:**59)
>         at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> sortAndSpill(MapTask.java:**1399)
>         at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer.**
> access$1800(MapTask.java:853)
>         at org.apache.hadoop.mapred.**MapTask$MapOutputBuffer$**
> SpillThread.run(MapTask.java:**1344)
>
>
>
> corresponding to the following lines of code within the class
> TwitterValueGroupingComparator**:
>
> public class TwitterValueGroupingComparator implements RawComparator<Text>
> {
> ...
>     public int compare(byte[] text1, int start1, int length1, byte[] text2,
>         int start2, int length2) {
>
>     byte[] tweet1 = new byte[length1];// length1-1 (???)
>     byte[] tweet2 = new byte[length2];// length1-1 (???)
>
>     System.arraycopy(text1, start1, tweet1, 0, length1);// start1+1 (???)
>     System.arraycopy(text2, start2, tweet2, 0, length2);// start2+1 (???)
>
>     Tweet atweet1 = new Tweet(new String(tweet1));
>     Tweet atweet2 = new Tweet(new String(tweet2));
>
>
>     String key1 = atweet1.getAuthor();
>     String key2 = atweet2.getAuthor();
> //////////////////////////////**//////////////////////////////**////
> //THE FOLLOWING LINE IS THE ONE MENTIONED IN THE LOG (47)
> //////////////////////////////**//////////////////////////////**/////
>     if (atweet1.getRT() > 0 && !atweet1.getMention().equals("**"))
>         key1 = atweet1.getMention();
>     if (atweet2.getRT() > 0 && !atweet2.getMention().equals("**"))
>         key2 = atweet2.getMention();
>
>     int realKeyCompare = key1.compareTo(key2);
>     return realKeyCompare;
>     }
>
> }
>
> As i am taking the incoming bytes and interpret them as Tweets by
> recreating the appropriate CSV-Strings and Tokenizing it, i was kind of
> sure, that the problem somehow are the leading bytes, that Hadoop puts in
> front of the data being compared. Since i never really understood what
> hadoop is doing to the strings when they are sent to the KeyComparator  i
> simply appended all strings to a file in order to see myself.
>
> You can see the results here:
>
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
> mostly bullshit   Alex Sink June Cleaver or Joan Crawford, null
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
> mostly bullshit   Alex Sink June Cleaver or Joan Crawford, null
> I2009-06-12 04:33:19, ntmp, tsukunep, , , , 1, 0, ??????????????????, null
> ??2009-06-12 07:32:47, davedilbeck, tampabaycom, , ,
> http://tinyurl.com/nookw7, 1, 1, Lots of vivid imagery here but it's
> mostly bullshit   Alex Sink June Cleaver or Joan Crawford, null
> ^2009-06-12 04:33:20, aclouatre, , , , , 0, 0, Bored out of my mind
> Watching food network, null
> b2009-06-12 04:33:20, djnewera, adoremii369, , , , 1, 0, LOL WORDUP ANT
> NOTHING LIKE THE HOOD, null
>
>
> As you can see there are different leading characters: sometimes its "??",
> other times its "b" or "^", etc.
>
> My question is now:
> How many bits do i have to cut off, so i get the original Text as a String
> that i put into the key-position of my mapper output? What are the concepts
> behind this?
>
> Thanks for your help in advance!
>
> Best regards,
> Elmar Macek
>
>
>
>


-- 
Bertrand Dechoux

Mime
View raw message