Hi Ted thanks much for your help. So fix is in JIRA 10671 and it is suppose to release in spark 1.6.0 right? Until 1.6.0 is released I won't be able to invoke callUdf using string and percentile_approx with lit as argument right

On Oct 14, 2015 03:26, "Ted Yu" <yuzhihong@gmail.com> wrote:
I modified DataFrameSuite, in master branch, to call percentile_approx instead of simpleUDF :

- deprecated callUdf in SQLContext
- callUDF in SQLContext *** FAILED ***
  org.apache.spark.sql.AnalysisException: undefined function percentile_approx;
  at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:64)
  at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:64)
  at scala.Option.getOrElse(Option.scala:120)
  at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:63)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
  at org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)

SPARK-10671 is included.
For 1.5.1, I guess the absence of SPARK-10671 means that SparkSQL treats percentile_approx as normal UDF.

Experts can correct me, if there is any misunderstanding.


On Tue, Oct 13, 2015 at 6:09 AM, Umesh Kacha <umesh.kacha@gmail.com> wrote:

Hi Ted I am using the following line of code I can't paste entire code sorry but the following only line doesn't compile in my spark job

 sourceframe.select(callUDF("percentile_approx",col("mycol"), lit(0.25)))

I am using Intellij editor java and maven dependencies of spark core spark sql spark hive version 1.5.1

On Oct 13, 2015 18:21, "Ted Yu" <yuzhihong@gmail.com> wrote:
Can you pastebin your Java code and the command you used to compile ?


On Oct 13, 2015, at 1:42 AM, Umesh Kacha <umesh.kacha@gmail.com> wrote:

Hi Ted if fix went after 1.5.1 release then how come it's working with 1.5.1 binary in spark-shell.

On Oct 13, 2015 1:32 PM, "Ted Yu" <yuzhihong@gmail.com> wrote:
Looks like the fix went in after 1.5.1 was released. 

You may verify using master branch build. 


On Oct 13, 2015, at 12:21 AM, Umesh Kacha <umesh.kacha@gmail.com> wrote:

Hi Ted, thanks much I tried using percentile_approx in Spark-shell like you mentioned it works using 1.5.1 but it doesn't compile in Java using 1.5.1 maven libraries it still complains same that callUdf can have string and column types only. Please guide.

On Oct 13, 2015 12:34 AM, "Ted Yu" <yuzhihong@gmail.com> wrote:
SQL context available as sqlContext.

scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
df: org.apache.spark.sql.DataFrame = [id: string, value: int]

scala> df.select(callUDF("percentile_approx",col("value"), lit(0.25))).show()
|                           1.0|

Can you upgrade to 1.5.1 ?


On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha <umesh.kacha@gmail.com> wrote:
Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available in Spark 1.4.0 as per JAvadocx

On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha <umesh.kacha@gmail.com> wrote:
Hi Ted thanks much for the detailed answer and appreciate your efforts. Do we need to register Hive UDFs?

sqlContext.udf.register("percentile_approx");???//is it valid?

I am calling Hive UDF percentile_approx in the following manner which gives compilation error

df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile error

//compile error because callUdf() takes String and Column* as arguments.

Please guide. Thanks much.

On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu <yuzhihong@gmail.com> wrote:
Using spark-shell, I did the following exercise (master branch) :

SQL context available as sqlContext.

scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
df: org.apache.spark.sql.DataFrame = [id: string, value: int]

scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v * v + cnst)
res0: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,List())

scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
| id|'simpleUDF(value,25)|
|id1|                  26|
|id2|                  41|
|id3|                  50|

Which Spark release are you using ?

Can you pastebin the full stack trace where you got the error ?


On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <umesh.kacha@gmail.com> wrote:
I have a doubt Michael I tried to use callUDF in  the following code it does not work. 


Above code does not compile because callUdf() takes only two arguments function name in String and Column class type. Please guide.

On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <umesh.kacha@gmail.com> wrote:
thanks much Michael let me try. 

On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <michael@databricks.com> wrote:
This is confusing because I made a typo...

callUDF("percentile_approx", col("mycol"), lit(0.25))

The first argument is the name of the UDF, all other arguments need to be columns that are passed in as arguments.  lit is just saying to make a literal column that always has the value 0.25.

On Fri, Oct 9, 2015 at 12:16 PM, <Saif.A.Ellafi@wellsfargo.com> wrote:

Yes but I mean, this is rather curious. How is def lit(literal:Any) --> becomes a percentile function lit(25)


Thanks for clarification



From: Umesh Kacha [mailto:umesh.kacha@gmail.com]
Sent: Friday, October 09, 2015 4:10 PM
To: Ellafi, Saif A.
Cc: Michael Armbrust; user

Subject: Re: How to calculate percentile of a column of DataFrame?


I found it in 1.3 documentation lit says something else not percent


public static Column lit(Object literal)

Creates a Column of literal value.

The passed in object is returned directly if it is already a Column. If the object is a Scala Symbol, it is converted into a Column also. Otherwise, a new Column is created to represent the literal value.


On Sat, Oct 10, 2015 at 12:39 AM, <Saif.A.Ellafi@wellsfargo.com> wrote:

Where can we find other available functions such as lit() ? I can’t find lit in the api.




From: Michael Armbrust [mailto:michael@databricks.com]
Sent: Friday, October 09, 2015 4:04 PM
To: unk1102
Cc: user
Subject: Re: How to calculate percentile of a column of DataFrame?


You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from dataframes.


On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <umesh.kacha@gmail.com> wrote:

Hi how to calculate percentile of a column in a DataFrame? I cant find any
percentile_approx function in Spark aggregation functions. For e.g. in Hive
we have percentile_approx and we can use it in the following way

hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);

I can see ntile function but not sure how it is gonna give results same as
above query please guide.

View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org