phoenix-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Mahonin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PHOENIX-2328) "Unsupported filter" error for "like" when using Spark DataFrame API
Date Fri, 16 Oct 2015 14:22:05 GMT

    [ https://issues.apache.org/jira/browse/PHOENIX-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14960772#comment-14960772
] 

Josh Mahonin commented on PHOENIX-2328:
---------------------------------------

Thanks for the bug report. I've attached a patch which should fix this (and prevent future
unsupported filter issues, should spark introduce them). 

[~maghamravikiran@gmail.com] Please review.

Regarding your second query: The Spark jdbc format parallelizes queries by taking in 3 parameters:
lowerBound, upperBound, and numPartitions. The bounds must be of type LONG and bound to a
parameter in your query (ideally a primary key!). For some use cases, this is entirely appropriate
and works across most types of JDBC databases, including Phoenix.

The spark-phoenix plugin natively understands the partition splits that Phoenix provides,
and doesn't require a LONG field in the query to use as a partition key. All you need to do
is specify a table, and optionally SELECT columns and/or a WHERE predicate, and Phoenix does
the rest for you. There are limitations though, such as not being able to perform Phoenix
aggregations like GROUP BY or SUM. However, once the data is loaded into Spark, those operation
are available to you, at a trade-off of efficiency.

The choice of which one to use very much comes down to your own specific use case.

> "Unsupported filter" error for "like" when using Spark DataFrame API
> --------------------------------------------------------------------
>
>                 Key: PHOENIX-2328
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-2328
>             Project: Phoenix
>          Issue Type: Bug
>    Affects Versions: 4.5.3
>            Reporter: Suhas Nalapure
>            Assignee: Josh Mahonin
>         Attachments: PHOENIX-2328.patch
>
>
> Hi, I'm using Spark Dataframe API to connect to Hbase 0.98 through Phoenix 4.5.3 &
get a " Unsupported filter" error when the filter condition is 'like'. The error trail &
the relevant lines from the source code code given below.
> Also I have another related query. Given that Phoenix can be accessed using the standard
java jdbc api, Spark DataFrame can also be constructed using "jdbc" format string ( E.g. df
= sqlContext.read().format("jdbc").options(params).load();  where params is a Map of Phoenix
jdbc connection url and other relevant parameters). So of these 2 ways to work with Phoenix
with Spark i.e. 1. as a Spark datasource plugin 2. as another rdbms source, which one would
be the recommended way & why?
> Exception:
> -------------
> 2015-10-16 17:25:42,944 DEBUG [main] com.dataken.utilities.DFHelper
> Filtering using expr: ID like 'RrcLog%'
> Exception in thread "main" java.lang.Exception: Unsupported filter
>         at org.apache.phoenix.spark.PhoenixRelation$$anonfun$buildFilter$1.apply(PhoenixRelation.scala:83)
>         at org.apache.phoenix.spark.PhoenixRelation$$anonfun$buildFilter$1.apply(PhoenixRelation.scala:70)
>         at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>         at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>         at org.apache.phoenix.spark.PhoenixRelation.buildFilter(PhoenixRelation.scala:70)
>         at org.apache.phoenix.spark.PhoenixRelation.buildScan(PhoenixRelation.scala:42)
>         at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$3.apply(DataSourceStrategy.scala:53)
>         at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$3.apply(DataSourceStrategy.scala:53)
>         at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:279)
>         at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:278)
>         at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:310)
>         at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:274)
>         at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:49)
>         at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>         at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>         at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>         at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>         at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>         at org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:374)
>         at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>         at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>         at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>         at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>         at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:920)
>         at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:918)
>         at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:924)
>         at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:924)
>         at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
>         at org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904)
>         at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1385)
>         at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1315)
>         at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1378)
>         at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:178)
>         at org.apache.spark.sql.DataFrame.show(DataFrame.scala:402)
>         at org.apache.spark.sql.DataFrame.show(DataFrame.scala:363)
>         at org.apache.spark.sql.DataFrame.show(DataFrame.scala:371)
>         at com.dataken.designer.analytical.pojo.EvaluableExpressionTest.main(EvaluableExpressionTest.java:177)
> SOURCE CODE
> -----------------------
> DataFrame df = sqlContext.read().format("org.apache.phoenix.spark").options(params).load();
> df.filter("ID like 'RrcLog%'");
> Thanks,
> Suhas



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message