spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Rudenko <petro.rude...@gmail.com>
Subject [sql] How to uniquely identify Dataframe?
Date Mon, 30 Mar 2015 12:02:37 GMT
Hi i have some custom caching logic in my application. I need to 
identify somehow Dataframe, to check whether i saw it previously. Here’s 
a problem:

|scala> val data = sc.parallelize(1 to 1000) data: 
org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize 
at <console>:21 scala> data.id res0: Int = 0 scala> data.id res1: Int = 
0 scala> val dataDF = data.toDF dataDF: org.apache.spark.sql.DataFrame = 
[_1: int] scala> dataDF.rdd.id res3: Int = 2 scala> dataDF.rdd.id res4: 
Int = 3 |

For some reason it generates a new ID on each call. With schemaRDD i was 
able to call SchemaRDD.id.

Thanks,
Peter Rudenko

​

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message