spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Suniti Singh <suniti.si...@gmail.com>
Subject Compare a column in two different tables/find the distance between column data
Date Tue, 15 Mar 2016 03:46:38 GMT
Hi All,

I have two tables with same schema but different data. I have to join the
tables based on one column and then do a group by the same column name.

now the data in that column in two table might/might not exactly match. (Ex
- column name is "title". Table1. title = "doctor"   and Table2. title =
"doc") doctor and doc are actually same titles.

>From performance point of view where i have data volume in TB , i am not
sure if i can achieve this using the sql statement. What would be the best
approach of solving this problem. Should i look for MLLIB apis?

Spark Gurus any pointers?

Thanks,
Suniti

Mime
View raw message