spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xiangrui Meng (JIRA)" <>
Subject [jira] [Commented] (SPARK-10388) Public dataset loader interface
Date Tue, 15 Sep 2015 16:57:45 GMT


Xiangrui Meng commented on SPARK-10388:

[~lewuathe] Thanks for the discussion! Agree that it would be great to cache the data at local
and other enhancement. But let's design an MVP version first. Improvements could be done as

For example, I don't think json and orc are commonly used for ML datasets. LIBSVM and CSV
are more common. But all depend on fetching data over HTTP. A proper implementation would
be implementing HTTP as a Hadoop FileSystem. The initial version might not support file split.
A hacky implementation would be `sc.parallelize(Seq(1)).flatMap( ... // download and generate

It would be great if you can help the design. Please keep the features minimal. Thanks!

> Public dataset loader interface
> -------------------------------
>                 Key: SPARK-10388
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
> It is very useful to have a public dataset loader to fetch ML datasets from popular repos,
e.g., libsvm and UCI. This JIRA is to discuss the design, requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets ="libsvm") // returns a local DataFrame
> // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the API and
implementation are pending discussion. Note that this requires http and https support.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message