spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cheng Lian <lian.cs....@gmail.com>
Subject Re: [sql] How to uniquely identify Dataframe?
Date Mon, 30 Mar 2015 12:38:37 GMT
This is because unlike SchemaRDD, DataFrame itself is no longer an RDD 
now. In the meanwhile, DataFrame.rdd is a function, which always returns 
a new RDD. I think you may use DataFrame.queryExecution.logical (the 
logical plan) as an ID. Maybe we should make it a "lazy val" rather than 
a "def". Personally I don't find a good reason that it has to be a 
"def", but maybe I miss something here.

Filed JIRA ticket and PR for this:

- https://issues.apache.org/jira/browse/SPARK-6608
- https://github.com/apache/spark/pull/5265

Cheng

On 3/30/15 8:02 PM, Peter Rudenko wrote:
> Hi i have some custom caching logic in my application. I need to 
> identify somehow Dataframe, to check whether i saw it previously. 
> Here’s a problem:
>
> |scala> val data = sc.parallelize(1 to 1000) data: 
> org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at 
> parallelize at <console>:21 scala> data.id res0: Int = 0 scala> 
> data.id res1: Int = 0 scala> val dataDF = data.toDF dataDF: 
> org.apache.spark.sql.DataFrame = [_1: int] scala> dataDF.rdd.id res3: 
> Int = 2 scala> dataDF.rdd.id res4: Int = 3 |
>
> For some reason it generates a new ID on each call. With schemaRDD i 
> was able to call SchemaRDD.id.
>
> Thanks,
> Peter Rudenko
>
> ​
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message