Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id EDDDE200C1B for ; Tue, 14 Feb 2017 19:45:28 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id EC5BE160B5F; Tue, 14 Feb 2017 18:45:28 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 9F505160B45 for ; Tue, 14 Feb 2017 19:45:27 +0100 (CET) Received: (qmail 48348 invoked by uid 500); 14 Feb 2017 18:45:21 -0000 Mailing-List: contact user-help@kudu.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@kudu.apache.org Delivered-To: mailing list user@kudu.apache.org Received: (qmail 48340 invoked by uid 99); 14 Feb 2017 18:45:21 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Feb 2017 18:45:21 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 5D2F7C0D33 for ; Tue, 14 Feb 2017 18:45:21 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.481 X-Spam-Level: ** X-Spam-Status: No, score=2.481 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, HTML_OBFUSCATE_05_10=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=cloudera-com.20150623.gappssmtp.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id kN_m_vVwWKZ1 for ; Tue, 14 Feb 2017 18:45:18 +0000 (UTC) Received: from mail-ua0-f173.google.com (mail-ua0-f173.google.com [209.85.217.173]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id B12C35F659 for ; Tue, 14 Feb 2017 18:45:17 +0000 (UTC) Received: by mail-ua0-f173.google.com with SMTP id 96so91871943uaq.3 for ; Tue, 14 Feb 2017 10:45:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudera-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=eQu6UutENwqdzltQjIM1nBJf9iy15woEDpjF8MJYUSk=; b=gKFQduu7mUQjd78k+I+6COuqZWFNDGVzHZGx9xTJUl8tVthP7NCnqurkWS40+S0Fao gQSfbpFPoQMWRLdlMmzAmV7P7z+nTbp3JP9/82FkvY/5000Dosx9/nzY1nOtTfEAighY Ml8paHdRZrAOqk0FIdKoHNVc7nTGYReD11wdY6va/os/rkpyJ43TUxURspHlow9AvsX1 iFPAzPoehGAw3VbUPzYICvlbzkPjUepXUHUZLRNU4ptZAC8CuSxcuMqIHGXKiUowgE6a 5z/IF1zpZwCMJofT2EZG41q/ZBAyctZZrKvMbC03U6D6rKVeBPsjcPdpeangi8EOgvOt 1Ing== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=eQu6UutENwqdzltQjIM1nBJf9iy15woEDpjF8MJYUSk=; b=FZ2XcYszkO95qBBi1zIj14vkcJcBpW2S4GysLGiX1PxYd2FlxiTll2BjhigFDnOgmL JT2X2Lf6cIYcf753HyljGw78FAB7ZY9tIP1yceqXLcRXfn7ZWkrtTkpUrbfGX9OLVTA9 gWT+CvPcxQ07CxVxJ41MWX0uaOlftGvLq5J8mirvnjfZZw+u0YY8t8TLj82JDCQDA0uV EZ0SWvULYL9js1KoZPBf5nwzTA+M193DfSw4rJF7zW0P2WXb1DHE44QcAHh0YwMadB21 L6rN18ZwyEJMP6mCqFUlsTdxP4KwtP3eo4puVK98oYv7vgUjtoVjVucU3kA6qGZa7XFI w+yA== X-Gm-Message-State: AMke39kY2KZ+Q4jDxsYpPVLVvsuPkCMTKHZIL9Xy6omU4orYyBIMw1pL8gQUK94fT2d3wElHEjNMN70hByz6T3Hx X-Received: by 10.159.55.200 with SMTP id q66mr1254746uaq.63.1487097910113; Tue, 14 Feb 2017 10:45:10 -0800 (PST) MIME-Version: 1.0 Received: by 10.176.84.216 with HTTP; Tue, 14 Feb 2017 10:44:49 -0800 (PST) In-Reply-To: References: From: Todd Lipcon Date: Tue, 14 Feb 2017 10:44:49 -0800 Message-ID: Subject: Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler' To: user@kudu.apache.org Content-Type: multipart/alternative; boundary=94eb2c04c8ceae54da054881f67f archived-at: Tue, 14 Feb 2017 18:45:29 -0000 --94eb2c04c8ceae54da054881f67f Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Frank, Could you try something like: data =3D [(42, 2017, 'John')] schema =3D StructType([ StructField("id", ByteType(), True), StructField("year", ByteType(), True), StructField("name", StringType(), True)]) df =3D sqlContext.createDataFrame(data, schema) That should explicitly set the types (based on my reading of the pyspark docs for createDataFrame) -Todd On Tue, Feb 14, 2017 at 1:11 AM, Frank Heimerzheim wrote: > Hello, > > here a snippet which produces the error. > > Call from the shell: > spark-submit --jars /opt/storage/data_nfs/cloudera > /pyspark/libs/kudu-spark_2.10-1.2.0.jar test.py > > > Snippet from the python-code test.py: > > (..) > builder =3D kudu.schema_builder() > builder.add_column('id', kudu.int64, nullable=3DFalse) > builder.add_column('year', kudu.int32) > builder.add_column('name', kudu.string) > (..) > > (..) > data =3D [(42, 2017, 'John')] > df =3D sqlContext.createDataFrame(data, ['id', 'year', 'name']) > df.write.format('org.apache.kudu.spark.kudu').option('kudu.master', kudu_= master)\ > .option('kudu.table', kudu_t= able)\ > .mode('append')\ > .save() > (..) > > Error: > 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Starting task 1.0 in sta= ge 4.0 (TID 6, ls00152y.xxx.com, partition 1,PROCESS_LOCAL, 2096 bytes) > 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Finished task 0.0 in sta= ge 4.0 (TID 5) in 113 ms on ls00152y.xxx.com (1/2) > 17/02/13 12:59:24 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 4= .0 (TID 6, ls00152y.xx.com): java.lang.IllegalArgumentException: year isn't= [Type: int64, size: 8, Type: unixtime_micros, size: 8], it's int32 > at org.apache.kudu.client.PartialRow.checkColumn(PartialRow.java:462) > at org.apache.kudu.client.PartialRow.addLong(PartialRow.java:217) > at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark= $kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.= scala:215) > at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark= $kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.= scala:205) > at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(= TraversableLike.scala:772) > at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimize= d.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.s= cala:771) > at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark= $kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:205) > at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark= $kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:203) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at org.apache.kudu.spark.kudu.KuduContext.org$apache$kudu$spark$kudu$Kud= uContext$$writePartitionRows(KuduContext.scala:203) > at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(Kud= uContext.scala:181) > at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(Kud= uContext.scala:180) > at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$3= 3.apply(RDD.scala:920) > at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$3= 3.apply(RDD.scala:920) > at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.sc= ala:1869) > at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.sc= ala:1869) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.= java:1142) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor= .java:617) > at java.lang.Thread.run(Thread.java:745) > > > Same result with kudu.int8 and kudu.int16. Only kudu.int64 works for me. = The problem persists, be the attribute part of the key or not. > > My > > Greeting > > Frank > > > 2017-02-13 6:23 GMT+01:00 Todd Lipcon : > >> On Tue, Feb 7, 2017 at 6:17 AM, Frank Heimerzheim >> wrote: >> >>> Hello, >>> >>> quite a while i=C2=B4ve worked successfully with https://maven2repo.com= /org. >>> apache.kudu/kudu-spark_2.10/1.2.0/jar >>> >>> For a bit i ignored a problem with kudu datatype int8. With the >>> connector i can=C2=B4t write int8 as int in python will always bring up >>> errors like >>> >>> "java.lang.IllegalArgumentException: id isn=C2=B4t [Type: int64, size: = 8, >>> Tye: unixtime_micros, size: 8], it=C2=B4s int8" >>> >>> As python isn=C2=B4t hard typed the connector is trying to find a suita= ble >>> type for python int in java/kudu. Apparently the python int is matched >>> to int64/unixtime_micros and not int8 as kudu is expecting at this >>> place. >>> >>> As a quick solution all my int in kudu are int64 at the moment >>> >>> In the long run i can=C2=B4t accept this waste of hdd space or even wor= se >>> I/O. Any idea when i can store int8 from python/spark to kudu? >>> >>> With the "normal" python api everything works fine, only the spark/kudu= /python >>> connector brings up the problem. >>> >> >> Not 100% sure I'm following. You're using pyspark here? Can you post a >> bit of sample code that reproduces the issue? >> >> -Todd >> >> >>> 2016-12-13 12:12 GMT+01:00 Frank Heimerzheim : >>> >>>> Hello, >>>> >>>> within the impala-shell i can create an external table and thereafter >>>> select and insert data from an underlying kudu table. Within the state= ment >>>> for creation of the table an 'StorageHandler' will be set to >>>> 'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine as >>>> there exists apparently an *.jar with the referenced library within. >>>> >>>> When trying to select from a hive-shell there is an error that the >>>> handler is not available. Trying to 'rdd.collect()' from an hiveCtx wi= thin >>>> an sparkSession i also get an error JavaClassNotFoundException as >>>> the KuduStorageHandler is not available. >>>> >>>> I then tried to find a jar in my system with the intention to copy it >>>> to all my data nodes. Sadly i couldn=C2=B4t find the specific jar. I t= hink it >>>> exists in the system as impala apparently is using it. For a test i=C2= =B4ve >>>> changed the 'StorageHandler' in the creation statement to >>>> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement >>>> worked. Also the select from impala, but i didin=C2=B4t return any dat= a. There >>>> was no error as i expected. The test was just for the case impala woul= d in >>>> a magic way select data from kudu without an correct 'StorageHandler'. >>>> Apparently this is not the case and impala has access to an >>>> 'com.cloudera.kudu.hive.KuduStorageHandler'. >>>> >>>> Long story, short question: >>>> In which *.jar i can find the 'com.cloudera.kudu.hive.KuduS >>>> torageHandler'? >>>> Is the approach to copy the jar per hand to all nodes an appropriate >>>> way to bring spark in a position to work with kudu? >>>> What about the beeline-shell from hive and the possibility to read fro= m >>>> kudu? >>>> >>>> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed >>>> parcels. Build a working python-kudu library successfully from scratch= (git) >>>> >>>> Thanks a lot! >>>> Frank >>>> >>> >>> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> > > --=20 Todd Lipcon Software Engineer, Cloudera --94eb2c04c8ceae54da054881f67f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Frank,

Could you try something like:=

data =3D [(42, 2017, 'John')]
schema =3D St= ructType([
=C2=A0 =C2=A0 StructField("id", ByteType(), True),<= br>=C2=A0 =C2=A0 StructField("year", ByteType(), True),
=C2=A0= =C2=A0 StructField("name", StringType(), True)])
df =3D sqlCo= ntext.createDataFrame(data, schema)

That should explicit= ly set the types (based on my reading of the pyspark docs for createDataFra= me)

-Todd


On Tue, Feb 14, 2017 at 1:11 AM= , Frank Heimerzheim <fh.ordix@gmail.com> wrote:
Hello,

here a sn= ippet which produces the error.

Call from the shel= l:
spark-submit --jars /opt/s= torage/data_nfs/cloudera/pyspark/libs/kudu-spark_2.10-1.2.0.jar t= est.py


=
Snippet from the python-code test.py:
(..)
builder =3D k= udu.schema_builder()
builder.add_column(
'id'= , kudu.int64, nullable=3DFalse)
build= er.add_column(
'year', kudu.int32)
builder.add_column(
'name= 9;, kudu.string)
(..)

(..)
d= ata =3D [(
42, 2017, 'John')]
df =3D sqlContext.createDataFrame(
data, ['id', 'year', 'name'])
df.write.format(
'org.apache.kudu.spa= rk.kudu').option(, kudu_master= )\
.option('kudu.table', kudu_table)\ .mode(= 9;append')\
= .save()
(..)

Error:
17/02/13 = 12:59:24 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 4.0 (TID= 6, ls00152y.xxx.com, partition 1,PROCESS_LOCAL, 2096 bytes)
17/02/13 12:59:24 INFO = scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 5) in 113 ms = on
ls00152y.xxx.com (1/2)
17/02/13 12:59:24 WARN scheduler.TaskSetManager: Lost task= 1.0 in stage 4.0 (TID 6,
ls00152y.xx.com<= /a>): java.lang.PartialRow.checkColumn(PartialRow.java:462)
at org.apache.kudu.client= .
PartialRow.addLong(PartialRow.java:217)
at org.apache.kudu.spark.ku= du.
KuduContext$$anonfun$org$<= font face=3D"courier new">apache$kudu$spark$kudu$
KuduContext$$writeParti= tionRows$1$$anonfun$apply$2.apply(Ku= duContext.scala:215)
at org.apac= he.kudu.spark.kudu.
KuduContext$$anon= fun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:205)=
at scala.collection.
TraversableLike$WithFilter$$anonfun$foreach$1.a= pply(TraversableLike.scala:772)
= at scala.collection.
IndexedSeqOptimi= zed$class.foreach(IndexedSeqOptimize= d.scala:33)
at scala.collection.mutab= le.
ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:
771)
at org.apache.kudu.spark.kudu.
KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduCon= text$$writePartitionRows$1.apply(KuduContext.scala:205)
at org.apach= e.kudu.spark.kudu.
KuduContext$$anonf= un$org$apache$kudu$spark$kudu$KuduContext$$= writePartitionRows$1.apply(Kudu= Context.scala:203)
at scala.collection.Iterator$
class.foreach(Iterator.scala:727)
at scala.collection.
Abstract= Iterator.foreach(Iterator.scala:1157)
at
org= .apache.kudu.spark.kudu.KuduContext.org= $apache$kudu$spark$kudu$KuduContext$= $writePartitionRows(KuduContext.scala:203)
at org.apache.kudu.spark.kudu.=
KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:181)
at org.apache.kudu.spark.kudu.
KuduContext$$anonfun$scal= a:180)
at org.apache.spark.rdd.RDD$$
a= nonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920) at org.apache.spark.rdd.RDD$$anonf= un$foreachPartition$1$$anonfun$apply= $33.apply(RDD.scala:920)
at org.= apache.spark.SparkContext$
$anonfun$r= unJob$5.apply(SparkContext.scala:186= 9)
at org.apache.spark.SparkContext$
<= wbr>$anonfun$runJob$5.apply(SparkCon= text.scala:1869)
at org.apache.spark.scheduler.
ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.
Task.run(Task.scala:89)
at org.apache.spark.executor.<= /font>Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.<= /font>ThreadPoolExecutor.runWorker(<= font face=3D"courier new">ThreadPoolExecutor.java:1142)
at java.ut= il.concurrent.
ThreadPoolExecutor$Wor= ker.run(ThreadPoolExecutor.java:617)=
at java.lang.Thread.run(Thread.
java:745)

Same resu=
lt with kudu.int8 and kudu.int16. Only kudu.int64 works for me. The problem=
 persists, be the attribute part of the key or not.
My 
Greeting
Frank<=
/pre>

2017-02-13 6:23 GMT+0= 1:00 Todd Lipcon <todd@cloudera.com>:
On Tue, Feb 7, 2017 at 6:17 AM, Frank Heimerzheim <fh.ord= ix@gmail.com> wrote:
Hello,

quite a while i=C2=B4ve worked success= fully with=C2=A0ht= tps://maven2repo.com/org.apache.kudu/kudu-spark_2.10/1.2.0/jar

= For a bit i ignored a problem with kudu datatype int8. With the connector i can=C2=B4t write int8 as int in python wil= l always bring up errors like

&quo= t;java.lang.IllegalArgumentException: = id isn=C2=B4t [Type: int64, size: 8, Tye: unixtime_micros, = size: 8], it=C2=B4s int8"

As python isn=C2=B4= t hard typed the connector is trying to find a suitable type for python int= in java/kudu. Apparently the python int is matched to int= 64/unixtime_micros and not int8 as kudu i= s expecting at this place.

As a quick solution all= my int in kudu=C2=A0are int64 at the moment
In the long run i can=C2=B4t accept this waste of hdd= space or even worse I/O. Any idea when i can store int8 from python= /spark to kudu?

With the &quo= t;normal" python api everything works fine, only the = spark/kudu/python connector brings up the problem.

Not 100% sure I'm following= . You're using pyspark here? Can you post a bit of sample code that rep= roduces the issue?

-Todd
= =C2=A0
20= 16-12-13 12:12 GMT+01:00 Frank Heimerzheim <fh.ordix@gmail.com>:
Hello,

within the impala-shell i can create an external table and thereaft= er select and insert data from an underlying kudu table. Within the stateme= nt for creation of the table an 'StorageHandler' will be set to =C2= =A0'com.cloudera.kudu.hive.KuduStorageHandler'. Everything wor= ks fine as there exists apparently an *.jar with the referenced library wit= hin.

When trying to select from a hive-shell there= is an error that the handler is not available. Trying to 'rdd.collect(= )' from an hiveCtx within an sparkSession i also get an error JavaClass= NotFoundException as the=C2=A0KuduStorageHandler is not available.

I then tried to find a jar in my system with the intention= to copy it to all my data nodes. Sadly i couldn=C2=B4t find the specific j= ar. I think it exists in the system as impala apparently is using it. For a= test i=C2=B4ve changed the 'StorageHandler' in the creation statem= ent to 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The cr= eate statement worked. Also the select from impala, but i didin=C2=B4t retu= rn any data. There was no error as i expected. The test was just for the ca= se impala would in a magic way select data from kudu without an correct = 9;StorageHandler'. Apparently this is not the case and impala has acces= s to an =C2=A0'com.cloudera.kudu.hive.KuduStorageHandler'.

Long story, short question:
In which *.jar = i can find the =C2=A0'com.cloudera.kudu.hive.KuduStorageHandler= 9;?
Is the approach to copy the jar per hand to all nodes an appr= opriate way to bring spark in a position to work with kudu?
What = about the beeline-shell from hive and the possibility to read from kudu?

My Environment: Cloudera 5.7 with kudu and impala-ku= du from installed parcels. Build a working python-kudu library successfully= from scratch (git)

Thanks a lot!
Frank




=
--
Todd Lipcon
Software Engi= neer, Cloudera




--
=
Todd Lipc= on
Software Engineer, Cloudera
--94eb2c04c8ceae54da054881f67f--