kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'
Date Tue, 14 Feb 2017 18:44:49 GMT
Hi Frank,

Could you try something like:

data = [(42, 2017, 'John')]
schema = StructType([
    StructField("id", ByteType(), True),
    StructField("year", ByteType(), True),
    StructField("name", StringType(), True)])
df = sqlContext.createDataFrame(data, schema)

That should explicitly set the types (based on my reading of the pyspark
docs for createDataFrame)

-Todd


On Tue, Feb 14, 2017 at 1:11 AM, Frank Heimerzheim <fh.ordix@gmail.com>
wrote:

> Hello,
>
> here a snippet which produces the error.
>
> Call from the shell:
> spark-submit --jars /opt/storage/data_nfs/cloudera
> /pyspark/libs/kudu-spark_2.10-1.2.0.jar test.py
>
>
> Snippet from the python-code test.py:
>
> (..)
> builder = kudu.schema_builder()
> builder.add_column('id', kudu.int64, nullable=False)
> builder.add_column('year', kudu.int32)
> builder.add_column('name', kudu.string)
> (..)
>
> (..)
> data = [(42, 2017, 'John')]
> df = sqlContext.createDataFrame(data, ['id', 'year', 'name'])
> df.write.format('org.apache.kudu.spark.kudu').option('kudu.master', kudu_master)\
>                                              .option('kudu.table', kudu_table)\
>                                              .mode('append')\
>                                              .save()
> (..)
>
> Error:
> 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 4.0 (TID
6, ls00152y.xxx.com, partition 1,PROCESS_LOCAL, 2096 bytes)
> 17/02/13 12:59:24 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID
5) in 113 ms on ls00152y.xxx.com (1/2)
> 17/02/13 12:59:24 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 4.0 (TID 6, ls00152y.xx.com):
java.lang.IllegalArgumentException: year isn't [Type: int64, size: 8, Type: unixtime_micros,
size: 8], it's int32
> 	at org.apache.kudu.client.PartialRow.checkColumn(PartialRow.java:462)
> 	at org.apache.kudu.client.PartialRow.addLong(PartialRow.java:217)
> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:215)
> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:205)
> 	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> 	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> 	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> 	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:205)
> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:203)
> 	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> 	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> 	at org.apache.kudu.spark.kudu.KuduContext.org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows(KuduContext.scala:203)
> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:181)
> 	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:180)
> 	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
> 	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
> 	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
> 	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:89)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
>
>
> Same result with kudu.int8 and kudu.int16. Only kudu.int64 works for me. The problem
persists, be the attribute part of the key or not.
>
> My
>
> Greeting
>
> Frank
>
>
> 2017-02-13 6:23 GMT+01:00 Todd Lipcon <todd@cloudera.com>:
>
>> On Tue, Feb 7, 2017 at 6:17 AM, Frank Heimerzheim <fh.ordix@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> quite a while i´ve worked successfully with https://maven2repo.com/org.
>>> apache.kudu/kudu-spark_2.10/1.2.0/jar
>>>
>>> For a bit i ignored a problem with kudu datatype int8. With the
>>> connector i can´t write int8 as int in python will always bring up
>>> errors like
>>>
>>> "java.lang.IllegalArgumentException: id isn´t [Type: int64, size: 8,
>>> Tye: unixtime_micros, size: 8], it´s int8"
>>>
>>> As python isn´t hard typed the connector is trying to find a suitable
>>> type for python int in java/kudu. Apparently the python int is matched
>>> to int64/unixtime_micros and not int8 as kudu is expecting at this
>>> place.
>>>
>>> As a quick solution all my int in kudu are int64 at the moment
>>>
>>> In the long run i can´t accept this waste of hdd space or even worse
>>> I/O. Any idea when i can store int8 from python/spark to kudu?
>>>
>>> With the "normal" python api everything works fine, only the spark/kudu/python
>>> connector brings up the problem.
>>>
>>
>> Not 100% sure I'm following. You're using pyspark here? Can you post a
>> bit of sample code that reproduces the issue?
>>
>> -Todd
>>
>>
>>> 2016-12-13 12:12 GMT+01:00 Frank Heimerzheim <fh.ordix@gmail.com>:
>>>
>>>> Hello,
>>>>
>>>> within the impala-shell i can create an external table and thereafter
>>>> select and insert data from an underlying kudu table. Within the statement
>>>> for creation of the table an 'StorageHandler' will be set to
>>>>  'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine as
>>>> there exists apparently an *.jar with the referenced library within.
>>>>
>>>> When trying to select from a hive-shell there is an error that the
>>>> handler is not available. Trying to 'rdd.collect()' from an hiveCtx within
>>>> an sparkSession i also get an error JavaClassNotFoundException as
>>>> the KuduStorageHandler is not available.
>>>>
>>>> I then tried to find a jar in my system with the intention to copy it
>>>> to all my data nodes. Sadly i couldn´t find the specific jar. I think it
>>>> exists in the system as impala apparently is using it. For a test i´ve
>>>> changed the 'StorageHandler' in the creation statement to
>>>> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement
>>>> worked. Also the select from impala, but i didin´t return any data. There
>>>> was no error as i expected. The test was just for the case impala would in
>>>> a magic way select data from kudu without an correct 'StorageHandler'.
>>>> Apparently this is not the case and impala has access to an
>>>>  'com.cloudera.kudu.hive.KuduStorageHandler'.
>>>>
>>>> Long story, short question:
>>>> In which *.jar i can find the  'com.cloudera.kudu.hive.KuduS
>>>> torageHandler'?
>>>> Is the approach to copy the jar per hand to all nodes an appropriate
>>>> way to bring spark in a position to work with kudu?
>>>> What about the beeline-shell from hive and the possibility to read from
>>>> kudu?
>>>>
>>>> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
>>>> parcels. Build a working python-kudu library successfully from scratch (git)
>>>>
>>>> Thanks a lot!
>>>> Frank
>>>>
>>>
>>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
View raw message