spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Assigned] (SPARK-22666) Spark datasource for image format
Date Tue, 04 Sep 2018 10:04:02 GMT


Apache Spark reassigned SPARK-22666:

    Assignee:     (was: Apache Spark)

> Spark datasource for image format
> ---------------------------------
>                 Key: SPARK-22666
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Timothy Hunter
>            Priority: Major
> The current API for the new image format is implemented as a standalone feature, in order
to make it reside within the mllib package. As discussed in SPARK-21866, users should be able
to load images through the more common spark source reader interface.
> This ticket is concerned with adding image reading support in the spark source API, through
either of the following interfaces:
>  - {{"image")...}}
>  - {{}}
> The output is a dataframe that contains images (and the file names for example), following
the semantics discussed already in SPARK-21866.
> A few technical notes:
> * since the functionality is implemented in {{mllib}}, calling this function may fail
at runtime if users have not imported the {{spark-mllib}} dependency
> * How to deal with very flat directories? It is common to have millions of files in a
single "directory" (like in S3), which seems to have caused some issues to some users. If
this issue is too complex to handle in this ticket, it can be dealt with separately.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message