hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <msegel_had...@hotmail.com>
Subject Bug in ORC file code? (OrcSerde)?
Date Wed, 19 Oct 2016 16:00:13 GMT
Hi, 
Since I am not on the ORC mailing list… and since the ORC java code is in the hive APIs…
this seems like a good place to start. ;-)


So… 

Ran in to a little problem… 

One of my developers was writing a map/reduce job to read records from a source and after
some filter, write the result set to an ORC file. 
There’s an example of how to do this at:
http://hadoopcraft.blogspot.com/2014/07/generating-orc-files-using-mapreduce.html

So far, so good. 
But now here’s the problem….  Large source data, means many mappers and with the filter,
the number of output rows is a fraction in terms of size. 
So we want to write to a single reducer. (An identity reducer) so that we get only a single
file. 

Here’s the snag. 

We were using the OrcSerde class to serialize the data and generate an Orc row which we then
wrote to the file. 

Looking at the source code for OrcSerde, OrcSerde.serialize() returns a OrcSerdeRow.
see: http://grepcode.com/file/repo1.maven.org/maven2/co.cask.cdap/hive-exec/0.13.0/org/apache/hadoop/hive/ql/io/orc/OrcSerde.java

OrcSerdeRow implements Writable and as we can see in the example code… for a map only example…
context.write(Text, Writable) works. 

However… if we attempt to make this in to a Map/Reduce job, we run in to a problem during
run time. the context.write() throws the following exception:
 "Error: java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.Writable,
received org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow”


The goal was to reduce the orc rows and then write out in the reducer. 

I’m curious as to why the context.write() fails? 
The error is a bit cryptic since the OrcSerdeRow implements Writable… so the error message
doesn’t make sense. 


Now the quick fix is to borrow the ArrayListWritable from giraph and create the list of fields
in to an ArrayListWritable and pass that to the reducer which will then use that to generate
the ORC file. 

Trying to figure out why the context.write() fails… when sending to reducer while it works
if its a mapside write.

The documentation on the ORC site is … well… to be polite… lacking. ;-) 

I have some ideas why it doesn’t work, however I would like to confirm my suspicions. 

Thx

-Mike


Mime
View raw message