phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Mahonin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PHOENIX-3333) can not load phoenix table by spark 2.0 which is working well under spark 1.6
Date Tue, 08 Nov 2016 19:34:58 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-3333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15648565#comment-15648565
] 

Josh Mahonin commented on PHOENIX-3333:
---------------------------------------

I've got a proof-of-concept version that works with Spark 2.0 here:
https://github.com/jmahonin/phoenix/tree/spark_2.0

Although this code compiles against either Spark 1.6 or Spark 2.0, unfortunately, due to Spark
changing the DataFrame API, as well as a Scala version change, the resultant JAR isn't binary
compatible with Spark versions < 2.0.

Other projects have wrestled with this in a variety of ways, e.g. HBase [1|https://issues.apache.org/jira/browse/HBASE-16179],
Cassandra [2|https://github.com/datastax/spark-cassandra-connector/pull/996] and ElasticSearch
[3|https://github.com/elastic/elasticsearch-hadoop/commit/43017a2566f7b50ebca1e20e96820f0d037655ff]

In terms of simplicity, dropping all support for Spark 1.6 and below would be easiest, but
least user friendly. Another option is to use maven profiles to switch between which Spark
version Phoenix gets compiled against. The down-side there is it's not plainly obvious for
those using the client JAR which version of Spark it will be compatible with. And yet another
option is to create two client JARs which are compatible with specific Spark versions, but
adds more bloat and complexity to the existing assembly process.

I'm leaning towards using a Maven profile that defaults to Spark 2.0+, but I'd be curious
if other users (vendors?) have any opinions here.

cc [~jamestaylor] [~sergey.soldatov] [~kalyanhadoop] [~ankit.singhal] [~devaraj]

> can not load phoenix table by spark 2.0 which is working well under spark 1.6
> -----------------------------------------------------------------------------
>
>                 Key: PHOENIX-3333
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3333
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 4.8.0
>         Environment: spark 2.0 ,phoenix 4.8.0 , os is centos 6.7 ,hadoop is hdp 2.5
>            Reporter: dalin qin
>
> spark version is  2.0.0.2.5.0.0-1245
> As mentioned by Josh , I believe spark 2.0 changed their api so that failed phoenix.
Please come up with update version to adapt spark's change.
> In [1]: df = sqlContext.read \
>    ...:   .format("org.apache.phoenix.spark") \
>    ...:   .option("table", "TABLE1") \
>    ...:   .option("zkUrl", "namenode:2181:/hbase-unsecure") \
>    ...:   .load()
> ---------------------------------------------------------------------------
> Py4JJavaError                             Traceback (most recent call last)
> <ipython-input-1-e5dfb7bbb28b> in <module>()
> ----> 1 df = sqlContext.read   .format("org.apache.phoenix.spark")   .option("table",
"TABLE1")   .option("zkUrl", "namenode:2181:/hbase-unsecure")   .load()
> /usr/hdp/2.5.0.0-1245/spark2/python/pyspark/sql/readwriter.pyc in load(self, path, format,
schema, **options)
>     151             return self._df(self._jreader.load(self._spark._sc._jvm.PythonUtils.toSeq(path)))
>     152         else:
> --> 153             return self._df(self._jreader.load())
>     154
>     155     @since(1.4)
> /usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py in __call__(self,
*args)
>     931         answer = self.gateway_client.send_command(command)
>     932         return_value = get_return_value(
> --> 933             answer, self.gateway_client, self.target_id, self.name)
>     934
>     935         for temp_arg in temp_args:
> /usr/hdp/2.5.0.0-1245/spark2/python/pyspark/sql/utils.pyc in deco(*a, **kw)
>      61     def deco(*a, **kw):
>      62         try:
> ---> 63             return f(*a, **kw)
>      64         except py4j.protocol.Py4JJavaError as e:
>      65             s = e.java_exception.toString()
> /usr/hdp/2.5.0.0-1245/spark2/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py in get_return_value(answer,
gateway_client, target_id, name)
>     310                 raise Py4JJavaError(
>     311                     "An error occurred while calling {0}{1}{2}.\n".
> --> 312                     format(target_id, ".", name), value)
>     313             else:
>     314                 raise Py4JError(
> Py4JJavaError: An error occurred while calling o43.load.
> : java.lang.NoClassDefFoundError: org/apache/spark/sql/DataFrame
>         at java.lang.Class.getDeclaredMethods0(Native Method)
>         at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
>         at java.lang.Class.getDeclaredMethod(Class.java:2128)
>         at java.io.ObjectStreamClass.getPrivateMethod(ObjectStreamClass.java:1475)
>         at java.io.ObjectStreamClass.access$1700(ObjectStreamClass.java:72)
>         at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:498)
>         at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:472)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:472)
>         at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:369)
>         at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1134)
>         at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>         at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>         at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>         at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>         at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>         at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
>         at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>         at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
>         at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>         at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>         at org.apache.spark.SparkContext.clean(SparkContext.scala:2037)
>         at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366)
>         at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365)
>         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>         at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>         at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
>         at org.apache.spark.rdd.RDD.map(RDD.scala:365)
>         at org.apache.phoenix.spark.PhoenixRDD.toDataFrame(PhoenixRDD.scala:119)
>         at org.apache.phoenix.spark.PhoenixRelation.schema(PhoenixRelation.scala:59)
>         at org.apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:40)
>         at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
>         at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
>         at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>         at py4j.Gateway.invoke(Gateway.java:280)
>         at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>         at py4j.commands.CallCommand.execute(CallCommand.java:79)
>         at py4j.GatewayConnection.run(GatewayConnection.java:211)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.DataFrame
>         at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>         ... 45 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message