phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Mahonin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PHOENIX-4490) Phoenix Spark Module doesn't pass in user properties to create connection
Date Fri, 02 Feb 2018 15:16:00 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-4490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350498#comment-16350498
] 

Josh Mahonin commented on PHOENIX-4490:
---------------------------------------

FWIW, I think there should be a more elegant solution here. It would be nice if theseĀ sorts
of parameters could be passed in as options to the Dataframe / Dataset builder, and then carried
forward as needed.

As I recall, the Configuration object itself isĀ _not_ Serializable, which is a big challenge
for Spark, and why it gets re-created several times within the phoenix-spark module. Perhaps
there's another solution for that problem we could leverage?

Glad there's a workaround, but if anyone has time for a patch to the underlying issue, that
would be fantastic!

> Phoenix Spark Module doesn't pass in user properties to create connection
> -------------------------------------------------------------------------
>
>                 Key: PHOENIX-4490
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4490
>             Project: Phoenix
>          Issue Type: Bug
>            Reporter: Karan Mehta
>            Priority: Major
>
> Phoenix Spark module doesn't work perfectly in a Kerberos environment. This is because
whenever new {{PhoenixRDD}} are built, they are always built with new and default properties.
The following piece of code in {{PhoenixRelation}} is an example. This is the class used by
spark to create {{BaseRelation}} before executing a scan. 
> {code}
>     new PhoenixRDD(
>       sqlContext.sparkContext,
>       tableName,
>       requiredColumns,
>       Some(buildFilter(filters)),
>       Some(zkUrl),
>       new Configuration(),
>       dateAsTimestamp
>     ).toDataFrame(sqlContext).rdd
> {code}
> This would work fine in most cases if the spark code is being run on the same cluster
as HBase, the config object will pickup properties from Class path xml files. However in an
external environment we should use the user provided properties and merge them before creating
any {{PhoenixRelation}} or {{PhoenixRDD}}. As per my understanding, we should ideally provide
properties in {{DefaultSource#createRelation() method}}.
> An example of when this fails is, Spark tries to get the splits to optimize the MR performance
for loading data in the table in {{PhoenixInputFormat#generateSplits()}} methods. Ideally,
it should get all the config parameters from the {{JobContext}} being passed, but it is defaulted
to {{new Configuration()}}, irrespective of what user passes in. Thus it fails to create a
connection.
> [~jmahonin] [~maghamravikiran@gmail.com] 
> Any ideas or advice? Let me know if I am missing anything obvious here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message