Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2B88E113EB for ; Thu, 24 Jul 2014 21:53:58 +0000 (UTC) Received: (qmail 67545 invoked by uid 500); 24 Jul 2014 21:53:56 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 67472 invoked by uid 500); 24 Jul 2014 21:53:56 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 67456 invoked by uid 99); 24 Jul 2014 21:53:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Jul 2014 21:53:56 +0000 X-ASF-Spam-Status: No, hits=2.4 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of java8964@hotmail.com designates 65.55.90.174 as permitted sender) Received: from [65.55.90.174] (HELO SNT004-OMC3S35.hotmail.com) (65.55.90.174) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Jul 2014 21:53:51 +0000 Received: from SNT149-W24 ([65.55.90.135]) by SNT004-OMC3S35.hotmail.com with Microsoft SMTPSVC(7.5.7601.22701); Thu, 24 Jul 2014 14:53:30 -0700 X-TMN: [BzEd1s878NupvPjvWmfSl8Ycx90UMwnNz2g1M7su0Rc=] X-Originating-Email: [java8964@hotmail.com] Message-ID: Content-Type: multipart/alternative; boundary="_2d49b510-696f-433a-8d51-4c279cce96ab_" From: java8964 To: "user@hive.apache.org" Subject: RE: python UDF and Avro tables Date: Thu, 24 Jul 2014 17:53:30 -0400 Importance: Normal In-Reply-To: References: MIME-Version: 1.0 X-OriginalArrivalTime: 24 Jul 2014 21:53:30.0836 (UTC) FILETIME=[B522F940:01CFA789] X-Virus-Checked: Checked by ClamAV on apache.org --_2d49b510-696f-433a-8d51-4c279cce96ab_ Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable Are you trying to read the Avro file directly in your UDF? If so=2C that is= not the correct way to do it in UDF. Hive can support Avro file natively. Don't know your UDF requirement=2C but= here is normally what I will do: Create the table in hive as using AvroContainerInputFormat create external table foorow format serde 'org.apache.hadoop.hive.serde2.av= ro.AvroSerDe'stored asinputformat 'org.apache.hadoop.hive.ql.io.avro.AvroCo= ntainerInputFormat'outputformat 'org.apache.hadoop.hive.ql.io.avro.AvroCont= ainerOutputFormat'location '/xxx.avro'tblproperties ( 'avro.schema.url'= =3D'hdfs://xxxx.avsc')=3B In this case=2C the hive will map the table structure based on the avro sch= ema file. Then you can register your UDF and start to use it. Remember=2C in this case=2C when your python UDF being invoked=2C the avro = data will be wrapped as a JSON string=2C passed to your python UDF through = STDIN. For example=2C if you do "select MYUDF(col1) from foo"=2C then the col1 dat= a from Avro will be passed to your python script as a JSON string=2C even = if the col1 is a nested structure. Then it is up to your python script to h= andle the JSON string=2C and return whatever output result through STDOUT. Yong From: Kevin.Weiler@imc-chicago.com To: user@hive.apache.org Subject: python UDF and Avro tables Date: Thu=2C 24 Jul 2014 15:52:03 +0000 =0A= =0A= =0A= =0A= =0A= Hi All=2C=0A= =0A= =0A= I hope I=92m not duplicating a previous question=2C but I couldn=92t find a= ny search functionality for the user list archives.=0A= =0A= =0A= I have written a relatively simple python script that is meant to take a fi= eld from a hive query and transform it (just some string processing through= a dict) given that certain conditions are met. After reading this guide:= =0A= =0A= =0A= http://blog.spryinc.com/2013/09/a-guide-to-user-defined-functions-in.html= =0A= =0A= =0A= it would appear that the python script needs to read from STDIN the native = file format (in my case Avro) and write to STDOUT. I implemented this funct= ionality using the python fastavro deserializer and cStringIO for the STDIN= /STDOUT bit. I then placed=0A= the appropriate python modules on all the nodes (which I could probably do= a bit better by simply storing in HDFS). Unfortunately=2C I=92m still gett= ing errors while trying to transform my field which are appended below. I b= elieve the problem is that HDFS can=0A= end up splitting the files at arbitrary points and you could have an Avro = file with no schema appended to the top. Has anyone had any luck running a = python UDF on an Avro table? Cheers!=0A= =0A= =0A= =0A= Traceback (most recent call last):=0A= File "coltoskip.py"=2C line 33=2C in =0A= reader =3D avro.reader(avrofile)=0A= File "_reader.py"=2C line 368=2C in fastavro._reader.iter_avro.__init__ (= fastavro/_reader.c:6438)=0A= ValueError: cannot read header - is it an avro file?=0A= org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error o= ccurred when trying to close the Operator running your custom script.=0A= at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java= :514)=0A= at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)=0A= at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)=0A= at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)=0A= at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)= =0A= at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)=0A= at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)=0A= at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)=0A= at org.apache.hadoop.mapred.Child$4.run(Child.java:268)=0A= at java.security.AccessController.doPrivileged(Native Method)=0A= at javax.security.auth.Subject.doAs(Subject.java:415)=0A= at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati= on.java:1548)=0A= at org.apache.hadoop.mapred.Child.main(Child.java:262)=0A= org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error o= ccurred when trying to close the Operator running your custom script.=0A= at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java= :514)=0A= at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)=0A= at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)=0A= at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)=0A= at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)= =0A= at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)=0A= at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)=0A= at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)=0A= at org.apache.hadoop.mapred.Child$4.run(Child.java:268)=0A= at java.security.AccessController.doPrivileged(Native Method)=0A= at javax.security.auth.Subject.doAs(Subject.java:415)=0A= at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati= on.java:1548)=0A= at org.apache.hadoop.mapred.Child.main(Child.java:262)=0A= org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error o= ccurred when trying to close the Operator running your custom script.=0A= at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java= :514)=0A= at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)=0A= at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)=0A= at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)=0A= at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)= =0A= at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)=0A= at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)=0A= at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)=0A= at org.apache.hadoop.mapred.Child$4.run(Child.java:268)=0A= at java.security.AccessController.doPrivileged(Native Method)=0A= at javax.security.auth.Subject.doAs(Subject.java:415)=0A= at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati= on.java:1548)=0A= at org.apache.hadoop.mapred.Child.main(Child.java:262)=0A= =0A= =0A= =0A= =0A= =0A= --=0A= =0A= =0A= =0A= =0A= Kevin Weiler=0A= =0A= =0A= IT=0A= =0A= =0A= =0A= =0A= =0A= =0A= =0A= IMC Financial Markets | 233 S. Wacker Drive=2C Suite 4300 | Chicago=2C IL 6= 0606 | http://imc-chicago.com/=0A= =0A= =0A= =0A= Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: kevin.weiler@imc-ch= icago.com=0A= =0A= =0A= =0A= =0A= =0A= =0A= =0A= =0A= =0A= The information in this e-mail is intended only for the person or entity to= which it is addressed. =0A= =0A= It may contain confidential and /or privileged material. If someone other t= han the intended recipient should receive this e-mail=2C he / she shall not= be entitled to read=2C disseminate=2C disclose or duplicate it. =0A= =0A= If you receive this e-mail unintentionally=2C please inform us immediately = by "reply" and then delete it from your system. Although this information h= as been compiled with great care=2C neither IMC Financial Markets & Asset M= anagement nor any of its related entities=0A= shall accept any responsibility for any errors=2C omissions or other inacc= uracies in this information or for the consequences thereof=2C nor shall it= be bound in any way by the contents of this e-mail or its attachments. In = the event of incomplete or incorrect=0A= transmission=2C please return the e-mail to the sender and permanently del= ete this message and any attachments. =0A= =0A= Messages and attachments are scanned for all known viruses. Always scan att= achments before opening them. =0A= = --_2d49b510-696f-433a-8d51-4c279cce96ab_ Content-Type: text/html; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable
Are you trying to read the Avro = file directly in your UDF? If so=2C that is not the correct way to do it in= UDF.

Hive can support Avro file natively. Don't know yo= ur UDF requirement=2C but here is normally what I will do:

Create the table in hive as using AvroContainerInputFormat

create external table foo
row format serde= 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
stored as
inputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'=
outputformat 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOut= putFormat'
location '/xxx.avro'
tblproperties (
 =3B  =3B 'avro.schema.url'=3D'hdfs://xxxx.avsc'
)=3B<= /div>

In this case=2C the hive will map the table structure b= ased on the avro schema file.

Then  =3Byou can= register your UDF and start to use it.

Remember= =2C in this case=2C when your python UDF being invoked=2C the avro data wil= l be wrapped as a JSON string=2C passed to your python UDF through STDIN.

For example=2C if you do "select MYUDF(col1) from f= oo"=2C then the col1 data from Avro will be passed to  =3Byour python s= cript as a JSON string=2C even if the col1 is a nested structure. =3B
Then it is up to your python script to handle the JSON string=2C a= nd return whatever output result through STDOUT.

Y= ong


From: Kevin.Weiler@imc-chica= go.com
To: user@hive.apache.org
Subject: python UDF and Avro tablesDate: Thu=2C 24 Jul 2014 15:52:03 +0000

=0A= =0A= =0A= =0A= =0A= Hi All=2C=0A=

=0A=
=0A=
I hope I=92m not duplicating a previous question=2C but I couldn=92t f= ind any search functionality for the user list archives.
=0A=

=0A=
=0A=
I have written a relatively simple python script that is meant to take= a field from a hive query and transform it (just some string processing th= rough a dict) given that certain conditions are met. After reading this gui= de:
=0A=

=0A=
=0A= =0A=

=0A=
=0A=
it would appear that the python script needs to read from STDIN the na= tive file format (in my case Avro) and write to STDOUT. I implemented this = functionality using the python fastavro deserializer and cStringIO for the = STDIN/STDOUT bit. I then placed=0A= the appropriate python modules on all the nodes (which I could probably do= a bit better by simply storing in HDFS). Unfortunately=2C I=92m still gett= ing errors while trying to transform my field which are appended below. I b= elieve the problem is that HDFS can=0A= end up splitting the files at arbitrary points and you could have an Avro = file with no schema appended to the top. Has anyone had any luck running a = python UDF on an Avro table? Cheers!
=0A=

=0A=
=0A=
=0A=
Traceback (most recent call last):=0A=
  File "coltoskip.py"=2C line 33=2C in <=3Bmodule>=3B=0A=
    reader =3D avro.reader(avrofile)=0A=
  File "_reader.py"=2C line 368=2C in fastavro._reader.iter_avro.__init__ (=
fastavro/_reader.c:6438)=0A=
ValueError: cannot read header - is it an avro file?=0A=
org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error o=
ccurred when trying to close the Operator running your custom script.=0A=
	at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java=
:514)=0A=
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)=0A=
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)=0A=
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)=0A=
	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)=
=0A=
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)=0A=
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)=0A=
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)=0A=
	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)=0A=
	at java.security.AccessController.doPrivileged(Native Method)=0A=
	at javax.security.auth.Subject.doAs(Subject.java:415)=0A=
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati=
on.java:1548)=0A=
	at org.apache.hadoop.mapred.Child.main(Child.java:262)=0A=
org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error o=
ccurred when trying to close the Operator running your custom script.=0A=
	at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java=
:514)=0A=
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)=0A=
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)=0A=
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)=0A=
	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)=
=0A=
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)=0A=
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)=0A=
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)=0A=
	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)=0A=
	at java.security.AccessController.doPrivileged(Native Method)=0A=
	at javax.security.auth.Subject.doAs(Subject.java:415)=0A=
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati=
on.java:1548)=0A=
	at org.apache.hadoop.mapred.Child.main(Child.java:262)=0A=
org.apache.hadoop.hive.ql.metadata.HiveException: [Error 20003]: An error o=
ccurred when trying to close the Operator running your custom script.=0A=
	at org.apache.hadoop.hive.ql.exec.ScriptOperator.close(ScriptOperator.java=
:514)=0A=
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)=0A=
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)=0A=
	at org.apache.hadoop.hive.ql.exec.Operator.close(Operator.java:613)=0A=
	at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.close(ExecMapper.java:207)=
=0A=
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)=0A=
	at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)=0A=
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)=0A=
	at org.apache.hadoop.mapred.Child$4.run(Child.java:268)=0A=
	at java.security.AccessController.doPrivileged(Native Method)=0A=
	at javax.security.auth.Subject.doAs(Subject.java:415)=0A=
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati=
on.java:1548)=0A=
	at org.apache.hadoop.mapred.Child.main(Child.java:262)
=0A=
=0A=
=0A=
=0A=
=0A=
= =0A= --<= /span>
=0A=
=0A=
=0A=
=0A=
= =0A= Kevin Weiler
=0A=
=0A=

=0A= IT<= span class=3D"ecxapple-style-span">

=0A=
=0A=
=0A=

=0A=  =3B

=0A=
=0A=
=0A=
= =0A= IMC Financial Markets | 233 S. Wacker Drive=2C Suite = 4300 | Chicago=2C IL 60606 | =3Bhttp://imc-chicago.com/
=0A=
=0A=
=0A=
= =0A= Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mai= l: =3Bkevin.weiler@imc-chicago.com
=0A=
=0A=
=0A=
=0A=
=0A=
=0A=
=0A=
=0A=
=0A=
=0A= The information in this e-mail is intended only for the person or entity to= which it is addressed.
=0A=
=0A= It may contain confidential and /or privileged material. If someone other t= han the intended recipient should receive this e-mail=2C he / she shall not= be entitled to read=2C disseminate=2C disclose or duplicate it.
=0A=
=0A= If you receive this e-mail unintentionally=2C please inform us immediately = by "reply" and then delete it from your system. Although this information h= as been compiled with great care=2C neither IMC Financial Markets &=3B A= sset Management nor any of its related entities=0A= shall accept any responsibility for any errors=2C omissions or other inacc= uracies in this information or for the consequences thereof=2C nor shall it= be bound in any way by the contents of this e-mail or its attachments. In = the event of incomplete or incorrect=0A= transmission=2C please return the e-mail to the sender and permanently del= ete this message and any attachments.
=0A=
=0A= Messages and attachments are scanned for all known viruses. Always scan att= achments before opening them.
=0A=
= --_2d49b510-696f-433a-8d51-4c279cce96ab_--