phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Mahonin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PHOENIX-3814) Unable to connect to Phoenix via Spark
Date Thu, 04 May 2017 15:23:04 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996911#comment-15996911
] 

Josh Mahonin commented on PHOENIX-3814:
---------------------------------------

I've managed to spend a bit of time looking at this, and it does seem the solution to PHOENIX-3721
should solve this issue as well, though I'm fine to keep it open until it's closed.

Re: SaveMode behaviour, these are good starting points to look at:

First point of contact from Spark into Phoenix when saving a DataFrame:
https://github.com/apache/phoenix/blob/5b099014446865c12779f3882fd8b407496717ea/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DefaultSource.scala#L40-L47

The code that unwraps the DataFrame into an RDD, then uses the Phoenix MapReduce-style classes
to affect a distributed save() through Spark:
https://github.com/apache/phoenix/blob/5b099014446865c12779f3882fd8b407496717ea/phoenix-spark/src/main/scala/org/apache/phoenix/spark/DataFrameFunctions.scala#L31-L65

Relevant unit tests:
https://github.com/apache/phoenix/blob/5b099014446865c12779f3882fd8b407496717ea/phoenix-spark/src/it/scala/org/apache/phoenix/spark/PhoenixSparkIT.scala#L336-L342

According to PHOENIX-2745, the current 'Overwrite' acts more like the 'Append' should. I'm
not entirely sure if that's accurate since the definitions in the docs say this

{quote}
Append mode means that when saving a DataFrame to a data source, if data/table already exists,
contents of the DataFrame are expected to be appended to existing data.
Overwrite mode means that when saving a DataFrame to a data source, if data/table already
exists, existing data is expected to be overwritten by the contents of the DataFrame.
{quote}

Neither definition implies dropping or recreating the table. Also, under the hood, each row
of the DataFrame is turned into a Phoenix UPSERT statement, which by definition does both.
bq. Upsert: Inserts if not present and updates otherwise the value in the table

However, if the current method is counter to how other data sources operate, I'm open to suggestions.
I suggest continuing this conversation over on PHOENIX-2745.

> Unable to connect to Phoenix via Spark
> --------------------------------------
>
>                 Key: PHOENIX-3814
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3814
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 4.10.0
>         Environment: Ubuntu 16.04.1, Apache Spark 2.1.0, Hbase 1.2.5, Phoenix 4.10.0
>            Reporter: Wajid Khattak
>
> Please see http://stackoverflow.com/questions/43640864/apache-phoenix-for-spark-not-working



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message