hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Per Stolpe <persto...@gmail.com>
Subject Letting the Mapper handle multiple lines.
Date Thu, 04 Jun 2009 16:18:11 GMT
I'm quite new to Hadoop programming, so to get a good start I started
writing my own program that summarizes a column in a large tab separated
file (~100 000 000 lines). My first naive implementation was quite simple, a
small rework of the WordCounter example that comes with Hadoop. This program
did calculate the correct answer, but it performed quite badly, since every
line in the file invokes a call to map(). To solve this, I wrote my own
RecordReader, one that would return a List<Text> instead of just a Text. It
does type check in Eclipse and all seems to be fine until I actually run the
program. When I do, I get the following error:

java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
        at Summarizer$TokenizerMapper.map(Summarizer.java:1)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:518)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:303)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

(repeated several times)

What might be the problem?
And are there maybe InputFormat (that are not marked as Deprecated) that
already solves my problem?

Source code:
Summarizer: http://pastebin.com/m52876939
RecordReader: http://pastebin.com/m2c541a00
InputFormat: http://pastebin.com/m7714b0c

Hadoop version: 0.20.0
Java JDK version: 1.6 u14

Per and Felix

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message