Mailing-List: contact user-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hive.apache.org
MIME-Version: 1.0
In-Reply-To: <DM3PR14MB0829ACA743EEFDD90594BF1B98D20@DM3PR14MB0829.namprd14.prod.outlook.com>
References: <DM3PR14MB0829105EBAE5BA4CB2B7742598D20@DM3PR14MB0829.namprd14.prod.outlook.com>
 <DM3PR14MB0829ACA743EEFDD90594BF1B98D20@DM3PR14MB0829.namprd14.prod.outlook.com>
From: Ravi Prakash <ravihadoop@gmail.com>
Date: Wed, 19 Oct 2016 15:00:08 -0700
Message-ID: <CAMs9kVjgAM9FNgzGuFXa5MOe+BYjcpjuBkfS2rBz+QbFQ4z2bw@mail.gmail.com>
Subject: Re: Bug in ORC file code? (OrcSerde)?
To: Michael Segel <msegel_hadoop@hotmail.com>
Cc: user <user@hive.apache.org>, "user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=001a113ec3b4b50802053f3eee63
archived-at: Wed, 19 Oct 2016 22:05:22 -0000

--001a113ec3b4b50802053f3eee63
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

MIchael!

Although there is a little overlap in the communities, I strongly suggest
you email user@orc.apache.org ( https://orc.apache.org/help/ ) I don't know
if you have to be subscribed to a mailing list to get replies to your email
address.

Ravi


On Wed, Oct 19, 2016 at 11:29 AM, Michael Segel <msegel_hadoop@hotmail.com>
wrote:

> Just to follow up=E2=80=A6
>
> This appears to be a bug in the hive version of the code=E2=80=A6 fixed i=
n the orc
> library=E2=80=A6  NOTE: There are two different libraries.
>
> Documentation is a bit lax=E2=80=A6 but in terms of design=E2=80=A6
>
> Its better to do the build completely in the reducer making the mapper
> code cleaner.
>
>
> > On Oct 19, 2016, at 11:00 AM, Michael Segel <msegel_hadoop@hotmail.com>
> wrote:
> >
> > Hi,
> > Since I am not on the ORC mailing list=E2=80=A6 and since the ORC java =
code is
> in the hive APIs=E2=80=A6 this seems like a good place to start. ;-)
> >
> >
> > So=E2=80=A6
> >
> > Ran in to a little problem=E2=80=A6
> >
> > One of my developers was writing a map/reduce job to read records from =
a
> source and after some filter, write the result set to an ORC file.
> > There=E2=80=99s an example of how to do this at:
> > http://hadoopcraft.blogspot.com/2014/07/generating-orc-
> files-using-mapreduce.html
> >
> > So far, so good.
> > But now here=E2=80=99s the problem=E2=80=A6.  Large source data, means =
many mappers and
> with the filter, the number of output rows is a fraction in terms of size=
.
> > So we want to write to a single reducer. (An identity reducer) so that
> we get only a single file.
> >
> > Here=E2=80=99s the snag.
> >
> > We were using the OrcSerde class to serialize the data and generate an
> Orc row which we then wrote to the file.
> >
> > Looking at the source code for OrcSerde, OrcSerde.serialize() returns a
> OrcSerdeRow.
> > see: http://grepcode.com/file/repo1.maven.org/maven2/co.
> cask.cdap/hive-exec/0.13.0/org/apache/hadoop/hive/ql/io/orc/OrcSerde.java
> >
> > OrcSerdeRow implements Writable and as we can see in the example code=
=E2=80=A6
> for a map only example=E2=80=A6 context.write(Text, Writable) works.
> >
> > However=E2=80=A6 if we attempt to make this in to a Map/Reduce job, we =
run in to
> a problem during run time. the context.write() throws the following
> exception:
> > "Error: java.io.IOException: Type mismatch in value from map: expected
> org.apache.hadoop.io.Writable, received org.apache.hadoop.hive.ql.io.
> orc.OrcSerde$OrcSerdeRow=E2=80=9D
> >
> >
> > The goal was to reduce the orc rows and then write out in the reducer.
> >
> > I=E2=80=99m curious as to why the context.write() fails?
> > The error is a bit cryptic since the OrcSerdeRow implements Writable=E2=
=80=A6 so
> the error message doesn=E2=80=99t make sense.
> >
> >
> > Now the quick fix is to borrow the ArrayListWritable from giraph and
> create the list of fields in to an ArrayListWritable and pass that to the
> reducer which will then use that to generate the ORC file.
> >
> > Trying to figure out why the context.write() fails=E2=80=A6 when sendin=
g to
> reducer while it works if its a mapside write.
> >
> > The documentation on the ORC site is =E2=80=A6 well=E2=80=A6 to be poli=
te=E2=80=A6 lacking. ;-)
> >
> > I have some ideas why it doesn=E2=80=99t work, however I would like to =
confirm
> my suspicions.
> >
> > Thx
> >
> > -Mike
> >
> >
> >  B=EF=BF=BDKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK=
KKKKKKKKCB=EF=BF=BD
> =EF=BF=BD [=EF=BF=BD=EF=BF=BDX=EF=BF=BD=EF=BF=BD=DC=9AX=EF=BF=BDK  K[XZ[ =
=EF=BF=BD \=EF=BF=BD\=EF=BF=BD][=EF=BF=BD=EF=BF=BDX=EF=BF=BD=EF=BF=BD=DC=9A=
X=EF=BF=BDP  Y =EF=BF=BD=EF=BF=BD =EF=BF=BD\ X=EF=BF=BD K=EF=BF=BD=DC=99=EF=
=BF=BDB=EF=BF=BD=EF=BF=BD=DC=88 Y  ] [=DB=98[  =EF=BF=BD=EF=BF=BD[X[=EF=BF=
=BD
> =EF=BF=BD  K[XZ[ =EF=BF=BD \=EF=BF=BD\=EF=BF=BDZ [    Y =EF=BF=BD=EF=BF=
=BD =EF=BF=BD\ X=EF=BF=BD K=EF=BF=BD=DC=99=EF=BF=BDB
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: user-help@hadoop.apache.org
>

--001a113ec3b4b50802053f3eee63
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div>MIchael!<br><br></div>Although there is a little=
 overlap in the communities, I strongly suggest you email <a href=3D"mailto=
:user@orc.apache.org">user@orc.apache.org</a> ( <a href=3D"https://orc.apac=
he.org/help/">https://orc.apache.org/help/</a> ) I don&#39;t know if you ha=
ve to be subscribed to a mailing list to get replies to your email address.=
<br><br></div>Ravi<br><div><br><br></div></div><div class=3D"gmail_extra"><=
br><div class=3D"gmail_quote">On Wed, Oct 19, 2016 at 11:29 AM, Michael Seg=
el <span dir=3D"ltr">&lt;<a href=3D"mailto:msegel_hadoop@hotmail.com" targe=
t=3D"_blank">msegel_hadoop@hotmail.com</a>&gt;</span> wrote:<br><blockquote=
 class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc soli=
d;padding-left:1ex">Just to follow up=E2=80=A6<br>
<br>
This appears to be a bug in the hive version of the code=E2=80=A6 fixed in =
the orc library=E2=80=A6=C2=A0 NOTE: There are two different libraries.<br>
<br>
Documentation is a bit lax=E2=80=A6 but in terms of design=E2=80=A6<br>
<br>
Its better to do the build completely in the reducer making the mapper code=
 cleaner.<br>
<div><div class=3D"h5"><br>
<br>
&gt; On Oct 19, 2016, at 11:00 AM, Michael Segel &lt;<a href=3D"mailto:mseg=
el_hadoop@hotmail.com">msegel_hadoop@hotmail.com</a>&gt; wrote:<br>
&gt;<br>
&gt; Hi,<br>
&gt; Since I am not on the ORC mailing list=E2=80=A6 and since the ORC java=
 code is in the hive APIs=E2=80=A6 this seems like a good place to start. ;=
-)<br>
&gt;<br>
&gt;<br>
&gt; So=E2=80=A6<br>
&gt;<br>
&gt; Ran in to a little problem=E2=80=A6<br>
&gt;<br>
&gt; One of my developers was writing a map/reduce job to read records from=
 a source and after some filter, write the result set to an ORC file.<br>
&gt; There=E2=80=99s an example of how to do this at:<br>
&gt; <a href=3D"http://hadoopcraft.blogspot.com/2014/07/generating-orc-file=
s-using-mapreduce.html" rel=3D"noreferrer" target=3D"_blank">http://hadoopc=
raft.blogspot.<wbr>com/2014/07/generating-orc-<wbr>files-using-mapreduce.ht=
ml</a><br>
&gt;<br>
&gt; So far, so good.<br>
&gt; But now here=E2=80=99s the problem=E2=80=A6.=C2=A0 Large source data, =
means many mappers and with the filter, the number of output rows is a frac=
tion in terms of size.<br>
&gt; So we want to write to a single reducer. (An identity reducer) so that=
 we get only a single file.<br>
&gt;<br>
&gt; Here=E2=80=99s the snag.<br>
&gt;<br>
&gt; We were using the OrcSerde class to serialize the data and generate an=
 Orc row which we then wrote to the file.<br>
&gt;<br>
&gt; Looking at the source code for OrcSerde, OrcSerde.serialize() returns =
a OrcSerdeRow.<br>
&gt; see: <a href=3D"http://grepcode.com/file/repo1.maven.org/maven2/co.cas=
k.cdap/hive-exec/0.13.0/org/apache/hadoop/hive/ql/io/orc/OrcSerde.java" rel=
=3D"noreferrer" target=3D"_blank">http://grepcode.com/file/<wbr>repo1.maven=
.org/maven2/co.<wbr>cask.cdap/hive-exec/0.13.0/<wbr>org/apache/hadoop/hive/=
ql/io/<wbr>orc/OrcSerde.java</a><br>
&gt;<br>
&gt; OrcSerdeRow implements Writable and as we can see in the example code=
=E2=80=A6 for a map only example=E2=80=A6 context.write(Text, Writable) wor=
ks.<br>
&gt;<br>
&gt; However=E2=80=A6 if we attempt to make this in to a Map/Reduce job, we=
 run in to a problem during run time. the context.write() throws the follow=
ing exception:<br>
&gt; &quot;Error: java.io.IOException: Type mismatch in value from map: exp=
ected org.apache.hadoop.io.Writable, received <a href=3D"http://org.apache.=
hadoop.hive.ql.io">org.apache.hadoop.hive.ql.io</a>.<wbr>orc.OrcSerde$OrcSe=
rdeRow=E2=80=9D<br>
&gt;<br>
&gt;<br>
&gt; The goal was to reduce the orc rows and then write out in the reducer.=
<br>
&gt;<br>
&gt; I=E2=80=99m curious as to why the context.write() fails?<br>
&gt; The error is a bit cryptic since the OrcSerdeRow implements Writable=
=E2=80=A6 so the error message doesn=E2=80=99t make sense.<br>
&gt;<br>
&gt;<br>
&gt; Now the quick fix is to borrow the ArrayListWritable from giraph and c=
reate the list of fields in to an ArrayListWritable and pass that to the re=
ducer which will then use that to generate the ORC file.<br>
&gt;<br>
&gt; Trying to figure out why the context.write() fails=E2=80=A6 when sendi=
ng to reducer while it works if its a mapside write.<br>
&gt;<br>
&gt; The documentation on the ORC site is =E2=80=A6 well=E2=80=A6 to be pol=
ite=E2=80=A6 lacking. ;-)<br>
&gt;<br>
&gt; I have some ideas why it doesn=E2=80=99t work, however I would like to=
 confirm my suspicions.<br>
&gt;<br>
&gt; Thx<br>
&gt;<br>
&gt; -Mike<br>
&gt;<br>
&gt;<br>
</div></div>&gt;=C2=A0 B=EF=BF=BD<wbr>KKKKKKKKKKKKKKKKKKKKKKKKKKKKKK<wbr>KK=
KKKKKKKKKKKKKKKKKKKKKKKKKKKK<wbr>KKKKKKKKCB=EF=BF=BD =EF=BF=BD [=EF=BF=BD=
=EF=BF=BDX=EF=BF=BD=EF=BF=BD=DC=9AX=EF=BF=BDK=C2=A0 K[XZ[ =EF=BF=BD \=EF=BF=
=BD\=EF=BF=BD][=EF=BF=BD=EF=BF=BDX=EF=BF=BD=EF=BF=BD=DC=9AX=EF=BF=BDP=C2=A0=
 Y =EF=BF=BD=EF=BF=BD =EF=BF=BD\ X=EF=BF=BD K=EF=BF=BD=DC=99=EF=BF=BDB=EF=
=BF=BD=EF=BF=BD=DC=88 Y=C2=A0 ] [=DB=98[=C2=A0 =EF=BF=BD=EF=BF=BD[X[=EF=BF=
=BD =EF=BF=BD=C2=A0 K[XZ[ =EF=BF=BD \=EF=BF=BD\=EF=BF=BDZ [=C2=A0 =C2=A0 Y =
=EF=BF=BD=EF=BF=BD =EF=BF=BD\ X=EF=BF=BD K=EF=BF=BD=DC=99=EF=BF=BDB<br>
<br>
<br>
------------------------------<wbr>------------------------------<wbr>-----=
----<br>
To unsubscribe, e-mail: <a href=3D"mailto:user-unsubscribe@hadoop.apache.or=
g">user-unsubscribe@hadoop.<wbr>apache.org</a><br>
For additional commands, e-mail: <a href=3D"mailto:user-help@hadoop.apache.=
org">user-help@hadoop.apache.org</a><br>
</blockquote></div><br></div>

--001a113ec3b4b50802053f3eee63--