hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <msegel_had...@hotmail.com>
Subject Re: Bug in ORC file code? (OrcSerde)?
Date Wed, 19 Oct 2016 18:29:19 GMT
Just to follow up… 

This appears to be a bug in the hive version of the code… fixed in the orc library…  NOTE:
There are two different libraries. 

Documentation is a bit lax… but in terms of design… 

Its better to do the build completely in the reducer making the mapper code cleaner. 

> On Oct 19, 2016, at 11:00 AM, Michael Segel <msegel_hadoop@hotmail.com> wrote:
> Hi, 
> Since I am not on the ORC mailing list… and since the ORC java code is in the hive
APIs… this seems like a good place to start. ;-)
> So… 
> Ran in to a little problem… 
> One of my developers was writing a map/reduce job to read records from a source and after
some filter, write the result set to an ORC file. 
> There’s an example of how to do this at:
> http://hadoopcraft.blogspot.com/2014/07/generating-orc-files-using-mapreduce.html
> So far, so good. 
> But now here’s the problem….  Large source data, means many mappers and with the
filter, the number of output rows is a fraction in terms of size. 
> So we want to write to a single reducer. (An identity reducer) so that we get only a
single file. 
> Here’s the snag. 
> We were using the OrcSerde class to serialize the data and generate an Orc row which
we then wrote to the file. 
> Looking at the source code for OrcSerde, OrcSerde.serialize() returns a OrcSerdeRow.
> see: http://grepcode.com/file/repo1.maven.org/maven2/co.cask.cdap/hive-exec/0.13.0/org/apache/hadoop/hive/ql/io/orc/OrcSerde.java
> OrcSerdeRow implements Writable and as we can see in the example code… for a map only
example… context.write(Text, Writable) works. 
> However… if we attempt to make this in to a Map/Reduce job, we run in to a problem
during run time. the context.write() throws the following exception:
> "Error: java.io.IOException: Type mismatch in value from map: expected org.apache.hadoop.io.Writable,
received org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow”
> The goal was to reduce the orc rows and then write out in the reducer. 
> I’m curious as to why the context.write() fails? 
> The error is a bit cryptic since the OrcSerdeRow implements Writable… so the error
message doesn’t make sense. 
> Now the quick fix is to borrow the ArrayListWritable from giraph and create the list
of fields in to an ArrayListWritable and pass that to the reducer which will then use that
to generate the ORC file. 
> Trying to figure out why the context.write() fails… when sending to reducer while it
works if its a mapside write.
> The documentation on the ORC site is … well… to be polite… lacking. ;-) 
> I have some ideas why it doesn’t work, however I would like to confirm my suspicions.

> Thx
> -Mike

To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org
View raw message