spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kai Sasaki (JIRA)" <>
Subject [jira] [Commented] (SPARK-10388) Public dataset loader interface
Date Tue, 15 Sep 2015 13:08:45 GMT


Kai Sasaki commented on SPARK-10388:

It seems very useful for the beginners who want to try Spark ML on their projects and who
want to see the behaviour of Pipeline API. I have several comments.

* It might be better to do lazy download. Some datasets are very large, so it will be good
to download them when it is realy needed. In above example, datasets are downloaded at {{}}.
* Once datasets are downloaded, it will be better to cache these data at the local. And it
requires repository API to publicate the latest update. Therefore public dataset loader can
update its local cache properly.
* I agree with the idea to allow 3rd-party to create their repositories. It requires to fix
the design of repository itself. We can create the specification and also some SDK if possible.
(Should these be included Spark projects?)
* We should not restrict the format which public dataset loader can load. Current {{DataFrameReader}}
can read such as json, libsvm or orc. There might be various kind of format at the public.
So it may be reasonable to support also these kind of format which is currently not supported
in future.
* Although this is a little whim, integration between public dataset loader and kaggle datasets
increases the use cases of Spark ML.

In general, searching data and loading data are troublesome. This feature makes it easier
for developers. I want to help this design and implementation. Thank you.

> Public dataset loader interface
> -------------------------------
>                 Key: SPARK-10388
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
> It is very useful to have a public dataset loader to fetch ML datasets from popular repos,
e.g., libsvm and UCI. This JIRA is to discuss the design, requirements, and initial implementation.
> {code}
> val loader = new DatasetLoader(sqlContext)
> val df = loader.get("libsvm", "rcv1_train.binary")
> {code}
> User should be able to list (or preview) datasets, e.g.
> {code}
> val datasets ="libsvm") // returns a local DataFrame
> // list all datasets under libsvm repo
> {code}
> It would be nice to allow 3rd-party packages to register new repos. Both the API and
implementation are pending discussion. Note that this requires http and https support.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message