Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 99ABA200BB1 for ; Thu, 20 Oct 2016 00:05:22 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 98441160AFB; Wed, 19 Oct 2016 22:05:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 93475160AEA for ; Thu, 20 Oct 2016 00:05:21 +0200 (CEST) Received: (qmail 56968 invoked by uid 500); 19 Oct 2016 22:05:20 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 56942 invoked by uid 99); 19 Oct 2016 22:05:19 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Oct 2016 22:05:19 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 114B6C13BC for ; Wed, 19 Oct 2016 22:05:19 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.679 X-Spam-Level: ** X-Spam-Status: No, score=2.679 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, KAM_LIVE=1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id ZGYa0zoXGA5s for ; Wed, 19 Oct 2016 22:05:17 +0000 (UTC) Received: from mail-it0-f51.google.com (mail-it0-f51.google.com [209.85.214.51]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id C54E360DD6 for ; Wed, 19 Oct 2016 22:05:16 +0000 (UTC) Received: by mail-it0-f51.google.com with SMTP id 66so46922135itl.1 for ; Wed, 19 Oct 2016 15:05:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=sTly+H8RsgPzBx7XS5H6n1XgdpufKRewp23sb4Rr6Qg=; b=y3K/qx2v94aIyh8W83pumqRCr7AwxjDaofvGe/Qhx3Btk1a3oNAHNGdUoOsKXQ0L9L Nthgzg/YKCYc2LqZWq00pKJEZIySr5mrILLV0/gucuQdpJNyIWDXU8OSavDJayEzEDc2 RA4sR2eOmrcdWxA4+/RSPncd3EPu1gJYkGVAmCnMCUVCoPZK0YrX915CvdblcI+s+SAS IF/qbg8+cLN8fzkxvS2HvddRsdWk+rTshaxsNSk92KrRsEZ6tR7LQ0F+nA9aA61vvHAW sfuoke7wCLWbXvuwFxLwhw1ODnKgZOl7+75zjXYVEat1imTupYpJyo7dKSeoJfc7waa3 UFWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=sTly+H8RsgPzBx7XS5H6n1XgdpufKRewp23sb4Rr6Qg=; b=QrXRamR3quSBrXhHpcbnWk0Rf7ntDtkhghukeFZpJePIk6fapt++PRst70k1ARtaL2 E7achgpJ/Amwu9hhHzX9WBmrYsDzmJLNEvxSl3t8mVtjiltRJoQu4A8/NByPzgcfFreW Jic1WAswf62JBQ9Ag/6RSYgRbqksTRbiZafnwDthCJML1oc/rZKUYASSITAfvdozutUK h+ldf3/UKPamDrB9rliCZCjZdb7r1SETMUAj6exvlbU8qFqWBQmIBp2XP7ORkAGCdCBC /JG2jxWJWp4+oOWnENnDfPnZIiSJ1SXtMzCyp6WiwsoeZAEmwVkApKyhffIkX+P5sPRZ f+Rw== X-Gm-Message-State: AA6/9Rmn98TLpC0HFloBzD1irqK6IG1phGQ6ptaQ7FcHkKhKMAJniDhpF4zQqHE0rL1suVLsJZpgVPOFfhZ09g== X-Received: by 10.107.136.232 with SMTP id s101mr8936861ioi.171.1476914408912; Wed, 19 Oct 2016 15:00:08 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.201.209 with HTTP; Wed, 19 Oct 2016 15:00:08 -0700 (PDT) In-Reply-To: References: From: Ravi Prakash Date: Wed, 19 Oct 2016 15:00:08 -0700 Message-ID: Subject: Re: Bug in ORC file code? (OrcSerde)? To: Michael Segel Cc: user , "user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=001a113ec3b4b50802053f3eee63 archived-at: Wed, 19 Oct 2016 22:05:22 -0000 --001a113ec3b4b50802053f3eee63 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable MIchael! Although there is a little overlap in the communities, I strongly suggest you email user@orc.apache.org ( https://orc.apache.org/help/ ) I don't know if you have to be subscribed to a mailing list to get replies to your email address. Ravi On Wed, Oct 19, 2016 at 11:29 AM, Michael Segel wrote: > Just to follow up=E2=80=A6 > > This appears to be a bug in the hive version of the code=E2=80=A6 fixed i= n the orc > library=E2=80=A6 NOTE: There are two different libraries. > > Documentation is a bit lax=E2=80=A6 but in terms of design=E2=80=A6 > > Its better to do the build completely in the reducer making the mapper > code cleaner. > > > > On Oct 19, 2016, at 11:00 AM, Michael Segel > wrote: > > > > Hi, > > Since I am not on the ORC mailing list=E2=80=A6 and since the ORC java = code is > in the hive APIs=E2=80=A6 this seems like a good place to start. ;-) > > > > > > So=E2=80=A6 > > > > Ran in to a little problem=E2=80=A6 > > > > One of my developers was writing a map/reduce job to read records from = a > source and after some filter, write the result set to an ORC file. > > There=E2=80=99s an example of how to do this at: > > http://hadoopcraft.blogspot.com/2014/07/generating-orc- > files-using-mapreduce.html > > > > So far, so good. > > But now here=E2=80=99s the problem=E2=80=A6. Large source data, means = many mappers and > with the filter, the number of output rows is a fraction in terms of size= . > > So we want to write to a single reducer. (An identity reducer) so that > we get only a single file. > > > > Here=E2=80=99s the snag. > > > > We were using the OrcSerde class to serialize the data and generate an > Orc row which we then wrote to the file. > > > > Looking at the source code for OrcSerde, OrcSerde.serialize() returns a > OrcSerdeRow. > > see: http://grepcode.com/file/repo1.maven.org/maven2/co. > cask.cdap/hive-exec/0.13.0/org/apache/hadoop/hive/ql/io/orc/OrcSerde.java > > > > OrcSerdeRow implements Writable and as we can see in the example code= =E2=80=A6 > for a map only example=E2=80=A6 context.write(Text, Writable) works. > > > > However=E2=80=A6 if we attempt to make this in to a Map/Reduce job, we = run in to > a problem during run time. the context.write() throws the following > exception: > > "Error: java.io.IOException: Type mismatch in value from map: expected > org.apache.hadoop.io.Writable, received org.apache.hadoop.hive.ql.io. > orc.OrcSerde$OrcSerdeRow=E2=80=9D > > > > > > The goal was to reduce the orc rows and then write out in the reducer. > > > > I=E2=80=99m curious as to why the context.write() fails? > > The error is a bit cryptic since the OrcSerdeRow implements Writable=E2= =80=A6 so > the error message doesn=E2=80=99t make sense. > > > > > > Now the quick fix is to borrow the ArrayListWritable from giraph and > create the list of fields in to an ArrayListWritable and pass that to the > reducer which will then use that to generate the ORC file. > > > > Trying to figure out why the context.write() fails=E2=80=A6 when sendin= g to > reducer while it works if its a mapside write. > > > > The documentation on the ORC site is =E2=80=A6 well=E2=80=A6 to be poli= te=E2=80=A6 lacking. ;-) > > > > I have some ideas why it doesn=E2=80=99t work, however I would like to = confirm > my suspicions. > > > > Thx > > > > -Mike > > > > > > B=EF=BF=BDKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK= KKKKKKKKCB=EF=BF=BD > =EF=BF=BD [=EF=BF=BD=EF=BF=BDX=EF=BF=BD=EF=BF=BD=DC=9AX=EF=BF=BDK K[XZ[ = =EF=BF=BD \=EF=BF=BD\=EF=BF=BD][=EF=BF=BD=EF=BF=BDX=EF=BF=BD=EF=BF=BD=DC=9A= X=EF=BF=BDP Y =EF=BF=BD=EF=BF=BD =EF=BF=BD\ X=EF=BF=BD K=EF=BF=BD=DC=99=EF= =BF=BDB=EF=BF=BD=EF=BF=BD=DC=88 Y ] [=DB=98[ =EF=BF=BD=EF=BF=BD[X[=EF=BF= =BD > =EF=BF=BD K[XZ[ =EF=BF=BD \=EF=BF=BD\=EF=BF=BDZ [ Y =EF=BF=BD=EF=BF= =BD =EF=BF=BD\ X=EF=BF=BD K=EF=BF=BD=DC=99=EF=BF=BDB > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org > For additional commands, e-mail: user-help@hadoop.apache.org > --001a113ec3b4b50802053f3eee63 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
MIchael!

Although there is a little= overlap in the communities, I strongly suggest you email user@orc.apache.org ( https://orc.apache.org/help/ ) I don't know if you ha= ve to be subscribed to a mailing list to get replies to your email address.=

Ravi


<= br>
On Wed, Oct 19, 2016 at 11:29 AM, Michael Seg= el <msegel_hadoop@hotmail.com> wrote:
Just to follow up=E2=80=A6

This appears to be a bug in the hive version of the code=E2=80=A6 fixed in = the orc library=E2=80=A6=C2=A0 NOTE: There are two different libraries.

Documentation is a bit lax=E2=80=A6 but in terms of design=E2=80=A6

Its better to do the build completely in the reducer making the mapper code= cleaner.


> On Oct 19, 2016, at 11:00 AM, Michael Segel <msegel_hadoop@hotmail.com> wrote:
>
> Hi,
> Since I am not on the ORC mailing list=E2=80=A6 and since the ORC java= code is in the hive APIs=E2=80=A6 this seems like a good place to start. ;= -)
>
>
> So=E2=80=A6
>
> Ran in to a little problem=E2=80=A6
>
> One of my developers was writing a map/reduce job to read records from= a source and after some filter, write the result set to an ORC file.
> There=E2=80=99s an example of how to do this at:
> http://hadoopc= raft.blogspot.com/2014/07/generating-orc-files-using-mapreduce.ht= ml
>
> So far, so good.
> But now here=E2=80=99s the problem=E2=80=A6.=C2=A0 Large source data, = means many mappers and with the filter, the number of output rows is a frac= tion in terms of size.
> So we want to write to a single reducer. (An identity reducer) so that= we get only a single file.
>
> Here=E2=80=99s the snag.
>
> We were using the OrcSerde class to serialize the data and generate an= Orc row which we then wrote to the file.
>
> Looking at the source code for OrcSerde, OrcSerde.serialize() returns = a OrcSerdeRow.
> see: http://grepcode.com/file/repo1.maven= .org/maven2/co.cask.cdap/hive-exec/0.13.0/org/apache/hadoop/hive/= ql/io/orc/OrcSerde.java
>
> OrcSerdeRow implements Writable and as we can see in the example code= =E2=80=A6 for a map only example=E2=80=A6 context.write(Text, Writable) wor= ks.
>
> However=E2=80=A6 if we attempt to make this in to a Map/Reduce job, we= run in to a problem during run time. the context.write() throws the follow= ing exception:
> "Error: java.io.IOException: Type mismatch in value from map: exp= ected org.apache.hadoop.io.Writable, received org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSe= rdeRow=E2=80=9D
>
>
> The goal was to reduce the orc rows and then write out in the reducer.=
>
> I=E2=80=99m curious as to why the context.write() fails?
> The error is a bit cryptic since the OrcSerdeRow implements Writable= =E2=80=A6 so the error message doesn=E2=80=99t make sense.
>
>
> Now the quick fix is to borrow the ArrayListWritable from giraph and c= reate the list of fields in to an ArrayListWritable and pass that to the re= ducer which will then use that to generate the ORC file.
>
> Trying to figure out why the context.write() fails=E2=80=A6 when sendi= ng to reducer while it works if its a mapside write.
>
> The documentation on the ORC site is =E2=80=A6 well=E2=80=A6 to be pol= ite=E2=80=A6 lacking. ;-)
>
> I have some ideas why it doesn=E2=80=99t work, however I would like to= confirm my suspicions.
>
> Thx
>
> -Mike
>
>
>=C2=A0 B=EF=BF=BDKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK= KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB=EF=BF=BD =EF=BF=BD [=EF=BF=BD= =EF=BF=BDX=EF=BF=BD=EF=BF=BD=DC=9AX=EF=BF=BDK=C2=A0 K[XZ[ =EF=BF=BD \=EF=BF= =BD\=EF=BF=BD][=EF=BF=BD=EF=BF=BDX=EF=BF=BD=EF=BF=BD=DC=9AX=EF=BF=BDP=C2=A0= Y =EF=BF=BD=EF=BF=BD =EF=BF=BD\ X=EF=BF=BD K=EF=BF=BD=DC=99=EF=BF=BDB=EF= =BF=BD=EF=BF=BD=DC=88 Y=C2=A0 ] [=DB=98[=C2=A0 =EF=BF=BD=EF=BF=BD[X[=EF=BF= =BD =EF=BF=BD=C2=A0 K[XZ[ =EF=BF=BD \=EF=BF=BD\=EF=BF=BDZ [=C2=A0 =C2=A0 Y = =EF=BF=BD=EF=BF=BD =EF=BF=BD\ X=EF=BF=BD K=EF=BF=BD=DC=99=EF=BF=BDB


-----------------------------------------------------------------= ----
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org

--001a113ec3b4b50802053f3eee63--