spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Davidson <A...@SantaCruzIntegration.com>
Subject Re: how to use sparkR or spark MLlib load csv file on hdfs then calculate covariance
Date Mon, 28 Dec 2015 18:23:45 GMT
Hi Yanbo

I use spark.csv to load my data set. I work with both Java and Python. I
would recommend you print the first couple of rows and also print the schema
to make sure your data is loaded as you expect. You might find the following
code example helpful. You may need to programmatically set the schema
depending on what you data looks like


public class LoadTidyDataFrame {

    static  DataFrame fromCSV(SQLContext sqlContext, String file) {

        DataFrame df = sqlContext.read()

                .format("com.databricks.spark.csv")

                .option("inferSchema", "true")

                .option("header", "true")

                .load(file);

        

        return df;

    }

}




From:  Yanbo Liang <ybliang8@gmail.com>
Date:  Monday, December 28, 2015 at 2:30 AM
To:  zhangjp <592426860@qq.com>
Cc:  "user @spark" <user@spark.apache.org>
Subject:  Re: how to use sparkR or spark MLlib load csv file on hdfs then
calculate covariance

> Load csv file:
> df <- read.df(sqlContext, "file-path", source = "com.databricks.spark.csv",
> header = "true")
> Calculate covariance:
> cov <- cov(df, "col1", "col2")
> 
> Cheers
> Yanbo
> 
> 
> 2015-12-28 17:21 GMT+08:00 zhangjp <592426860@qq.com>:
>> hi  all,
>>     I want  to use sparkR or spark MLlib  load csv file on hdfs then
>> calculate  covariance, how to do it .
>>     thks.
> 



Mime
View raw message