hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Grey" <jason.grey.w...@gmail.com>
Subject Using JavaSerialzation and SequenceFileInput
Date Tue, 16 Sep 2008 16:46:32 GMT
I'm trying to use JavaSerialization for a series of MapReduce jobs, and when
it comes to reading a SequenceFile using SequenceFileInputFormat with
JavaSerialized objects, something breaks down.

I've added "org.apache.hadoop.io.serializer.JavaSerialization" to the
io.serializations property in my config, and using native java types in my
mapper and reducer implementations, like so:

MyMapper implements Mapper<String,MyObject,String,MyObject>
MyReducer implements Reducer<String,MyObject,String,MyObject>

in my job configuration, i"m doing this:

conf.setInputFormat(SequenceFileInputFormat.class);
FileInputFormat.setInputPaths(conf, path1, path2);
conf.setOutputFormat(SequenceFileOutputFormat.class);
FileOutputFormat.setOutputPath(conf, path3);
conf.setOutputKeyClass(String.class);
conf.setOutputKeyComparatorClass(JavaSerializationComparator.class);
conf.setOutputValueClass(MyObject.class);
conf.setMapperClass(MyMapper.class);
conf.setReducerClass(MyReducer.class);

When I run the job, and output the keys & values from the mapper to
System.out, it doesn't seem like the key & value are getting populated
correctly - the key is NULL, and the value is a new, empty instance of
MyObject.

The files this job is reading were output by another job that used a custom
InputFormat, and so it didn't have the same problem, and I have validated
using a SequenceFile.Reader that the data is actually there, and non-null.
One strange thing i had to do to get the reader to work is this (see
*BOLD*text - I had to add that in order for the values to show up - I
think this
may have something to do with why SequenceFileInputFormat is having trouble
as well...)

String key = new String();
while (*(key = (String) *r.next(key)) != null) {
     HeadlineDocument value = new HeadlineDocument();
     *value = (HeadlineDocument) *r.getCurrentValue(value);
     System.out.println("Key: " + key.toString());
     System.out.println("Value: " + value.toString());
}

Anyone got any hints as to how one uses JavaSerialization properly in the
INPUT phase of a MapReduce job?

Thanks for any help

-jg-

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message