kudu-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Heimerzheim <fh.or...@gmail.com>
Subject Re: Missing 'com.cloudera.kudu.hive.KuduStorageHandler'
Date Tue, 14 Feb 2017 09:11:04 GMT
Hello,

here a snippet which produces the error.

Call from the shell:
spark-submit --jars
/opt/storage/data_nfs/cloudera/pyspark/libs/kudu-spark_2.10-1.2.0.jar
test.py


Snippet from the python-code test.py:

(..)
builder = kudu.schema_builder()
builder.add_column('id', kudu.int64, nullable=False)
builder.add_column('year', kudu.int32)
builder.add_column('name', kudu.string)
(..)

(..)
data = [(42, 2017, 'John')]
df = sqlContext.createDataFrame(data, ['id', 'year', 'name'])
df.write.format('org.apache.kudu.spark.kudu').option('kudu.master',
kudu_master)\
                                             .option('kudu.table', kudu_table)\
                                             .mode('append')\
                                             .save()
(..)

Error:
17/02/13 12:59:24 INFO scheduler.TaskSetManager: Starting task 1.0 in
stage 4.0 (TID 6, ls00152y.xxx.com, partition 1,PROCESS_LOCAL, 2096
bytes)
17/02/13 12:59:24 INFO scheduler.TaskSetManager: Finished task 0.0 in
stage 4.0 (TID 5) in 113 ms on ls00152y.xxx.com (1/2)
17/02/13 12:59:24 WARN scheduler.TaskSetManager: Lost task 1.0 in
stage 4.0 (TID 6, ls00152y.xx.com):
java.lang.IllegalArgumentException: year isn't [Type: int64, size: 8,
Type: unixtime_micros, size: 8], it's int32
	at org.apache.kudu.client.PartialRow.checkColumn(PartialRow.java:462)
	at org.apache.kudu.client.PartialRow.addLong(PartialRow.java:217)
	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:215)
	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1$$anonfun$apply$2.apply(KuduContext.scala:205)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
	at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
	at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:205)
	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows$1.apply(KuduContext.scala:203)
	at scala.collection.Iterator$class.foreach(Iterator.scala:727)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
	at org.apache.kudu.spark.kudu.KuduContext.org$apache$kudu$spark$kudu$KuduContext$$writePartitionRows(KuduContext.scala:203)
	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:181)
	at org.apache.kudu.spark.kudu.KuduContext$$anonfun$writeRows$1.apply(KuduContext.scala:180)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1869)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)


Same result with kudu.int8 and kudu.int16. Only kudu.int64 works for
me. The problem persists, be the attribute part of the key or not.

My

Greeting

Frank


2017-02-13 6:23 GMT+01:00 Todd Lipcon <todd@cloudera.com>:

> On Tue, Feb 7, 2017 at 6:17 AM, Frank Heimerzheim <fh.ordix@gmail.com>
> wrote:
>
>> Hello,
>>
>> quite a while i´ve worked successfully with https://maven2repo.com/org.
>> apache.kudu/kudu-spark_2.10/1.2.0/jar
>>
>> For a bit i ignored a problem with kudu datatype int8. With the
>> connector i can´t write int8 as int in python will always bring up
>> errors like
>>
>> "java.lang.IllegalArgumentException: id isn´t [Type: int64, size: 8,
>> Tye: unixtime_micros, size: 8], it´s int8"
>>
>> As python isn´t hard typed the connector is trying to find a suitable
>> type for python int in java/kudu. Apparently the python int is matched
>> to int64/unixtime_micros and not int8 as kudu is expecting at this place.
>>
>> As a quick solution all my int in kudu are int64 at the moment
>>
>> In the long run i can´t accept this waste of hdd space or even worse
>> I/O. Any idea when i can store int8 from python/spark to kudu?
>>
>> With the "normal" python api everything works fine, only the spark/kudu/python
>> connector brings up the problem.
>>
>
> Not 100% sure I'm following. You're using pyspark here? Can you post a bit
> of sample code that reproduces the issue?
>
> -Todd
>
>
>> 2016-12-13 12:12 GMT+01:00 Frank Heimerzheim <fh.ordix@gmail.com>:
>>
>>> Hello,
>>>
>>> within the impala-shell i can create an external table and thereafter
>>> select and insert data from an underlying kudu table. Within the statement
>>> for creation of the table an 'StorageHandler' will be set to
>>>  'com.cloudera.kudu.hive.KuduStorageHandler'. Everything works fine as
>>> there exists apparently an *.jar with the referenced library within.
>>>
>>> When trying to select from a hive-shell there is an error that the
>>> handler is not available. Trying to 'rdd.collect()' from an hiveCtx within
>>> an sparkSession i also get an error JavaClassNotFoundException as
>>> the KuduStorageHandler is not available.
>>>
>>> I then tried to find a jar in my system with the intention to copy it to
>>> all my data nodes. Sadly i couldn´t find the specific jar. I think it
>>> exists in the system as impala apparently is using it. For a test i´ve
>>> changed the 'StorageHandler' in the creation statement to
>>> 'com.cloudera.kudu.hive.KuduStorageHandler_foo'. The create statement
>>> worked. Also the select from impala, but i didin´t return any data. There
>>> was no error as i expected. The test was just for the case impala would in
>>> a magic way select data from kudu without an correct 'StorageHandler'.
>>> Apparently this is not the case and impala has access to an
>>>  'com.cloudera.kudu.hive.KuduStorageHandler'.
>>>
>>> Long story, short question:
>>> In which *.jar i can find the  'com.cloudera.kudu.hive.KuduS
>>> torageHandler'?
>>> Is the approach to copy the jar per hand to all nodes an appropriate way
>>> to bring spark in a position to work with kudu?
>>> What about the beeline-shell from hive and the possibility to read from
>>> kudu?
>>>
>>> My Environment: Cloudera 5.7 with kudu and impala-kudu from installed
>>> parcels. Build a working python-kudu library successfully from scratch (git)
>>>
>>> Thanks a lot!
>>> Frank
>>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Mime
View raw message