hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravi Prakash <ravihad...@gmail.com>
Subject Re: Bug in ORC file code? (OrcSerde)?
Date Wed, 19 Oct 2016 22:00:08 GMT
MIchael!

Although there is a little overlap in the communities, I strongly suggest
you email user@orc.apache.org ( https://orc.apache.org/help/ ) I don't know
if you have to be subscribed to a mailing list to get replies to your email
address.

Ravi



On Wed, Oct 19, 2016 at 11:29 AM, Michael Segel <msegel_hadoop@hotmail.com>
wrote:

> Just to follow up…
>
> This appears to be a bug in the hive version of the code… fixed in the orc
> library…  NOTE: There are two different libraries.
>
> Documentation is a bit lax… but in terms of design…
>
> Its better to do the build completely in the reducer making the mapper
> code cleaner.
>
>
> > On Oct 19, 2016, at 11:00 AM, Michael Segel <msegel_hadoop@hotmail.com>
> wrote:
> >
> > Hi,
> > Since I am not on the ORC mailing list… and since the ORC java code is
> in the hive APIs… this seems like a good place to start. ;-)
> >
> >
> > So…
> >
> > Ran in to a little problem…
> >
> > One of my developers was writing a map/reduce job to read records from a
> source and after some filter, write the result set to an ORC file.
> > There’s an example of how to do this at:
> > http://hadoopcraft.blogspot.com/2014/07/generating-orc-
> files-using-mapreduce.html
> >
> > So far, so good.
> > But now here’s the problem….  Large source data, means many mappers and
> with the filter, the number of output rows is a fraction in terms of size.
> > So we want to write to a single reducer. (An identity reducer) so that
> we get only a single file.
> >
> > Here’s the snag.
> >
> > We were using the OrcSerde class to serialize the data and generate an
> Orc row which we then wrote to the file.
> >
> > Looking at the source code for OrcSerde, OrcSerde.serialize() returns a
> OrcSerdeRow.
> > see: http://grepcode.com/file/repo1.maven.org/maven2/co.
> cask.cdap/hive-exec/0.13.0/org/apache/hadoop/hive/ql/io/orc/OrcSerde.java
> >
> > OrcSerdeRow implements Writable and as we can see in the example code…
> for a map only example… context.write(Text, Writable) works.
> >
> > However… if we attempt to make this in to a Map/Reduce job, we run in to
> a problem during run time. the context.write() throws the following
> exception:
> > "Error: java.io.IOException: Type mismatch in value from map: expected
> org.apache.hadoop.io.Writable, received org.apache.hadoop.hive.ql.io.
> orc.OrcSerde$OrcSerdeRow”
> >
> >
> > The goal was to reduce the orc rows and then write out in the reducer.
> >
> > I’m curious as to why the context.write() fails?
> > The error is a bit cryptic since the OrcSerdeRow implements Writable… so
> the error message doesn’t make sense.
> >
> >
> > Now the quick fix is to borrow the ArrayListWritable from giraph and
> create the list of fields in to an ArrayListWritable and pass that to the
> reducer which will then use that to generate the ORC file.
> >
> > Trying to figure out why the context.write() fails… when sending to
> reducer while it works if its a mapside write.
> >
> > The documentation on the ORC site is … well… to be polite… lacking. ;-)
> >
> > I have some ideas why it doesn’t work, however I would like to confirm
> my suspicions.
> >
> > Thx
> >
> > -Mike
> >
> >
> >  B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB�
> � [��X��ܚX�K  K[XZ[ � \�\�][��X��ܚX�P  Y �� �\ X�
K�ܙ�B��܈ Y  ] [ۘ[  ��[X[�
> �  K[XZ[ � \�\�Z [    Y �� �\ X� K�ܙ�B
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: user-help@hadoop.apache.org
>

Mime
View raw message