spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Hunter <timhun...@databricks.com>
Subject Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
Date Thu, 28 Sep 2017 15:31:16 GMT
Thank you everyone for the comments and the votes. We will follow up
shortly with a pull request.

On Wed, Sep 27, 2017 at 6:32 PM, Joseph Bradley <joseph@databricks.com>
wrote:

> This vote passes with 11 +1s (4 binding) and no +0s or -1s.
>
> +1:
> Sean Owen (binding)
> Holden Karau
> Denny Lee
> Reynold Xin (binding)
> Joseph Bradley (binding)
> Noman Khan
> Weichen Xu
> Yanbo Liang
> Dongjoon Hyun
> Matei Zaharia (binding)
> Vaquar Khan
>
> Thanks everyone!
> Joseph
>
> On Sat, Sep 23, 2017 at 4:23 PM, vaquar khan <vaquar.khan@gmail.com>
> wrote:
>
>> +1 looks good,
>>
>> Regards,
>> Vaquar khan
>>
>> On Sat, Sep 23, 2017 at 12:22 PM, Matei Zaharia <matei.zaharia@gmail.com>
>> wrote:
>>
>>> +1; we should consider something similar for multi-dimensional tensors
>>> too.
>>>
>>> Matei
>>>
>>> > On Sep 23, 2017, at 7:27 AM, Yanbo Liang <ybliang8@gmail.com> wrote:
>>> >
>>> > +1
>>> >
>>> > On Sat, Sep 23, 2017 at 7:08 PM, Noman Khan <nomanbplmp@live.com>
>>> wrote:
>>> > +1
>>> >
>>> > Regards
>>> > Noman
>>> > From: Denny Lee <denny.g.lee@gmail.com>
>>> > Sent: Friday, September 22, 2017 2:59:33 AM
>>> > To: Apache Spark Dev; Sean Owen; Tim Hunter
>>> > Cc: Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Sudarshan
>>> > Subject: Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
>>> >
>>> > +1
>>> >
>>> > On Thu, Sep 21, 2017 at 11:15 Sean Owen <sowen@cloudera.com> wrote:
>>> > Am I right that this doesn't mean other packages would use this
>>> representation, but that they could?
>>> >
>>> > The representation looked fine to me w.r.t. what DL frameworks need.
>>> >
>>> > My previous comment was that this is actually quite lightweight. It's
>>> kind of like how I/O support is provided for CSV and JSON, so makes enough
>>> sense to add to Spark. It doesn't really preclude other solutions.
>>> >
>>> > For those reasons I think it's fine. +1
>>> >
>>> > On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter <timhunter@databricks.com>
>>> wrote:
>>> > Hello community,
>>> >
>>> > I would like to call for a vote on SPARK-21866. It is a short proposal
>>> that has important applications for image processing and deep learning.
>>> Joseph Bradley has offered to be the shepherd.
>>> >
>>> > JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866
>>> > PDF version: https://issues.apache.org/jira
>>> /secure/attachment/12884792/SPIP%20-%20Image%20support%20for
>>> %20Apache%20Spark%20V1.1.pdf
>>> >
>>> > Background and motivation
>>> > As Apache Spark is being used more and more in the industry, some new
>>> use cases are emerging for different data formats beyond the traditional
>>> SQL types or the numerical types (vectors and matrices). Deep Learning
>>> applications commonly deal with image processing. A number of projects add
>>> some Deep Learning capabilities to Spark (see list below), but they
>>> struggle to communicate with each other or with MLlib pipelines because
>>> there is no standard way to represent an image in Spark DataFrames. We
>>> propose to federate efforts for representing images in Spark by defining a
>>> representation that caters to the most common needs of users and library
>>> developers.
>>> > This SPIP proposes a specification to represent images in Spark
>>> DataFrames and Datasets (based on existing industrial standards), and an
>>> interface for loading sources of images. It is not meant to be a
>>> full-fledged image processing library, but rather the core description that
>>> other libraries and users can rely on. Several packages already offer
>>> various processing facilities for transforming images or doing more complex
>>> operations, and each has various design tradeoffs that make them better as
>>> standalone solutions.
>>> > This project is a joint collaboration between Microsoft and
>>> Databricks, which have been testing this design in two open source
>>> packages: MMLSpark and Deep Learning Pipelines.
>>> > The proposed image format is an in-memory, decompressed representation
>>> that targets low-level applications. It is significantly more liberal in
>>> memory usage than compressed image representations such as JPEG, PNG, etc.,
>>> but it allows easy communication with popular image processing libraries
>>> and has no decoding overhead.
>>> > Targets users and personas:
>>> > Data scientists, data engineers, library developers.
>>> > The following libraries define primitives for loading and representing
>>> images, and will gain from a common interchange format (in alphabetical
>>> order):
>>> >       • BigDL
>>> >       • DeepLearning4J
>>> >       • Deep Learning Pipelines
>>> >       • MMLSpark
>>> >       • TensorFlow (Spark connector)
>>> >       • TensorFlowOnSpark
>>> >       • TensorFrames
>>> >       • Thunder
>>> > Goals:
>>> >       • Simple representation of images in Spark DataFrames, based on
>>> pre-existing industrial standards (OpenCV)
>>> >       • This format should eventually allow the development of
>>> high-performance integration points with image processing libraries such as
>>> libOpenCV, Google TensorFlow, CNTK, and other C libraries.
>>> >       • The reader should be able to read popular formats of images
>>> from distributed sources.
>>> > Non-Goals:
>>> > Images are a versatile medium and encompass a very wide range of
>>> formats and representations. This SPIP explicitly aims at the most common
>>> use case in the industry currently: multi-channel matrices of binary,
>>> int32, int64, float or double data that can fit comfortably in the heap of
>>> the JVM:
>>> >       • the total size of an image should be restricted to less than
>>> 2GB (roughly)
>>> >       • the meaning of color channels is application-specific and is
>>> not mandated by the standard (in line with the OpenCV standard)
>>> >       • specialized formats used in meteorology, the medical field,
>>> etc. are not supported
>>> >       • this format is specialized to images and does not attempt to
>>> solve the more general problem of representing n-dimensional tensors in
>>> Spark
>>> > Proposed API changes
>>> > We propose to add a new package in the package structure, under the
>>> MLlib project:
>>> > org.apache.spark.image
>>> > Data format
>>> > We propose to add the following structure:
>>> > imageSchema = StructType([
>>> >       • StructField("mode", StringType(), False),
>>> >               • The exact representation of the data.
>>> >               • The values are described in the following OpenCV
>>> convention. Basically, the type has both "depth" and "number of channels"
>>> info: in particular, type "CV_8UC3" means "3 channel unsigned bytes". BGRA
>>> format would be CV_8UC4 (value 32 in the table) with the channel order
>>> specified by convention.
>>> >               • The exact channel ordering and meaning of each channel
>>> is dictated by convention. By default, the order is RGB (3 channels) and
>>> BGRA (4 channels).
>>> > If the image failed to load, the value is the empty string "".
>>> >       • StructField("origin", StringType(), True),
>>> >               • Some information about the origin of the image. The
>>> content of this is application-specific.
>>> >               • When the image is loaded from files, users should
>>> expect to find the file name in this field.
>>> >       • StructField("height", IntegerType(), False),
>>> >               • the height of the image, pixels
>>> >               • If the image fails to load, the value is -1.
>>> >       • StructField("width", IntegerType(), False),
>>> >               • the width of the image, pixels
>>> >               • If the image fails to load, the value is -1.
>>> >       • StructField("nChannels", IntegerType(), False),
>>> >               • The number of channels in this image: it is typically
>>> a value of 1 (B&W), 3 (RGB), or 4 (BGRA)
>>> >               • If the image fails to load, the value is -1.
>>> >       • StructField("data", BinaryType(), False)
>>> >               • packed array content. Due to implementation
>>> limitation, it cannot currently store more than 2 billions of pixels.
>>> >               • The data is stored in a pixel-by-pixel BGR row-wise
>>> order. This follows the OpenCV convention.
>>> >               • If the image fails to load, this array is empty.
>>> > For more information about image types, here is an OpenCV guide on
>>> types: http://docs.opencv.org/2.4/modules/core/doc/intro.html#fixed
>>> -pixel-types-limited-use-of-templates
>>> > The reference implementation provides some functions to convert
>>> popular formats (JPEG, PNG, etc.) to the image specification above, and
>>> some functions to verify if an image is valid.
>>> > Image ingest API
>>> > We propose the following function to load images from a remote
>>> distributed source as a DataFrame. Here is the signature in Scala. The
>>> python interface is similar. For compatibility with java, this function
>>> should be made available through a builder pattern or through the
>>> DataSource API. The exact mechanics can be discussed during implementation;
>>> the goal of the proposal below is to propose a specification of the
>>> behavior and of the options:
>>> > def readImages(
>>> >     path:
>>> > String
>>> > ,
>>> >     session: SparkSession =
>>> > null
>>> > ,
>>> >     recursive:
>>> > Boolean = false
>>> > ,
>>> >     numPartitions: Int = 0,
>>> >     dropImageFailures:
>>> > Boolean = false
>>> > ,
>>> >
>>> > // Experimental options
>>> >
>>> >     sampleRatio: Double
>>> >  = 1.0): DataFrame
>>> >
>>> > The type of the returned DataFrame should be the structure type above,
>>> with the expectation that all the file names be filled.
>>> > Mandatory parameters:
>>> >       • path: a directory for a file system that contains images
>>> > Optional parameters:
>>> >       • session (SparkSession, default null): the Spark Session to use
>>> to create the dataframe. If not provided, it will use the current default
>>> Spark session via SparkSession.getOrCreate().
>>> >       • recursive (bool, default false): take the top-level images or
>>> look into directory recursively
>>> >       • numPartitions (int, default null): the number of partitions of
>>> the final dataframe. By default uses the default number of partitions from
>>> Spark.
>>> >       • dropImageFailures (bool, default false): drops the files that
>>> failed to load. If false (do not drop), some invalid images are kept.
>>> > Parameters that are experimental/may be quickly deprecated. These
>>> would be useful to have but are not critical for a first cut:
>>> >       • sampleRatio (float, in (0,1), default 1): if less than 1,
>>> returns a fraction of the data. There is no statistical guarantee about how
>>> the sampling is performed. This proved to be very helpful for fast
>>> prototyping. Marked as experimental since it should be pushed to the Spark
>>> core.
>>> > The implementation is expected to be in Scala for performance, with a
>>> wrapper for python.
>>> > This function should be lazy to the extent possible: it should not
>>> trigger access to the data when called. Ideally, any file system supported
>>> by Spark should be supported when loading images. There may be restrictions
>>> for some options such as zip files, etc.
>>> > The reference implementation has also some experimental options
>>> (undocumented here).
>>> > Reference implementation
>>> > A reference implementation is available as an open-source Spark
>>> package in this repository (Apache 2.0 license):
>>> > https://github.com/Microsoft/spark-images
>>> > This Spark package will also be published in a binary form on
>>> spark-packages.org .
>>> > Comments about the API should be addressed in this ticket.
>>> > Optional Rejected Designs
>>> > The use of User-Defined Types was considered. It adds some burden to
>>> the implementation of various languages and does not provide significant
>>> advantages.
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>>
>>>
>>
>>
>> --
>> Regards,
>> Vaquar Khan
>> +1 -224-436-0783 <(224)%20436-0783>
>> Greater Chicago
>>
>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>

Mime
View raw message