spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olivier Chapelle <oliv...@chapelle.cc>
Subject Dataset announcement
Date Thu, 16 Apr 2015 00:58:14 GMT
Dear Spark users,

I would like to draw your attention to a dataset that we recently released,
which is as of now the largest machine learning dataset ever released; see
the following blog announcements:
 - http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/
 -
http://blogs.technet.com/b/machinelearning/archive/2015/04/01/now-available-on-azure-ml-criteo-39-s-1tb-click-prediction-dataset.aspx

The characteristics of this dataset are:
 - 1 TB of data
 - binary classification
 - 13 integer features
 - 26 categorical features, some of them taking millions of values.
 - 4B rows

Hopefully this dataset will be useful to assess and push further the
scalability of Spark and MLlib.

Cheers,
Olivier



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Dataset-announcement-tp22507.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message