spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chengi Liu <>
Subject distinct in data frame in spark
Date Mon, 24 Mar 2014 17:21:37 GMT
  I have a very simple use case:

I have an rdd as following:

d = [[1,2,3,4],[1,5,2,3],[2,3,4,5]]

Now, I want to remove all the duplicates from a column and return the
remaining frame..
For example:
If i want to remove the duplicate based on column 1.
Then basically I would remove either row 1 or row 2 in my final result..
because the column 1 of both first and second row is the same element (1)
.. and hence the duplicate..
So, a possible result is:

output = [[1,2,3,4],[2,3,4,5]]

How do I do this in spark?

View raw message