spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tyson Condie (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-27006) SPIP: .NET bindings for Apache Spark
Date Thu, 14 Mar 2019 18:23:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-27006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16792939#comment-16792939
] 

Tyson Condie commented on SPARK-27006:
--------------------------------------

I would like to briefly illuminate what I think this SPIP is trying to accomplish. I have
worked in the Apache community for the better part of my career. Early on doing research at
UC Berkeley related to Hadoop, then joining the Pig team at Yahoo! Research, and being part
of the Microsoft CISL team that created Apache REEF, which turned out to be Microsoft’s
first ever top-level Apache project and remains so to this day. I also had the brief pleasure
of working with the Structured Stream team at Databricks and witnessed first-hand some of
the exceptional minds behind Apache Spark.

So, what is this SPIP about? In my honest opinion, it is about bringing two very large communities
together under a common shared goal: *to democratize data for all developers*. Given my roots,
I am a Java developer at heart, but I see a tremendous value in the .NET stack and in its
languages. Not surprisingly then, I see a significant barrier of entry when telling long time
.NET developers that if they want to use Apache Spark, they must code in either Scala/Java,
Python, or R. The .NET team conducted a survey---with 1000+ responses---revealing a strong desire from
the .NET developer community to learn and use Spark. This SPIP is about making that process
much more familiar, but that’s not all its about. 

This SPIP is about the Microsoft community wanting to learn and contribute to the Apache Spark
community, and we are fully funded to do just that. Our leadership team includes Michael Rys
and Rahul Potharaju from the Big Data organization, along with Ankit Asthana and Dan Moseley
from .NET organization. Our development team includes Terry Kim, Steve Suh, Stephen Toub,
Eric Erhardt, Aaron Robinson, and me, where I am again in the company of equally exceptional
minds. Together, our goal is to develop .NET bindings for Spark in accordance to best practices
from the Apache Foundation and Spark guidelines. We would welcome the opportunity to partner
with leaders in the Apache Spark community, not only for their guidance on the work items
described in this SPIP, but also on engagements that will bring our communities closer together
and lead us to mutually beneficial outcomes.  

Regarding the work items in this SPIP, as recommended by earlier comments, we will develop
externally (and openly) on a fork of Apache Spark. We only ask that a shepherd be available
to provide us with occasional guidance towards getting our fork in a state that is acceptable
for a contribution back to Apache Spark master. We recognize that such a contribution will
not happen overnight, and that we will need to prove to the Spark community that we will continue
to maintain it for the foreseeable future. That is why building a +diverse+ community is
a very high priority for us, as it will ensure the future investments in .NET bindings for
Apache Spark. All of this will take time. For now, we only ask if there is a Spark PMC member
who is willing to step up and be our shepherd. 

Thank you for reading this far and we look forward to seeing you at the SF Spark Summit in
April where we will be presenting our early progress on enabling .NET bindings for Apache
Spark. 

 

> SPIP: .NET bindings for Apache Spark
> ------------------------------------
>
>                 Key: SPARK-27006
>                 URL: https://issues.apache.org/jira/browse/SPARK-27006
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Terry Kim
>            Priority: Minor
>   Original Estimate: 4,032h
>  Remaining Estimate: 4,032h
>
> h4. Background and Motivation: 
> Apache Spark provides programming language support for Scala/Java (native), and extensions
for Python and R. While a variety of other language extensions are possible to include in
Apache Spark, .NET would bring one of the largest developer community to the table.
Presently, no good Big Data solution exists for .NET developers in open source.  This
SPIP aims at discussing how we can bring Apache Spark goodness to the .NET development
platform.  
> .NET is a free, cross-platform, open source developer platform for building many different
types of applications. With .NET, you can use multiple languages, editors, and libraries
to build for web, mobile, desktop, gaming, and IoT types of applications. Even with .NET
serving millions of developers, there is no good Big Data solution that exists today, which
this SPIP aims to address.  
> The .NET developer community is one of the largest programming language communities in
the world. Its flagship programming language C# is listed as one of the most popular programming
languages in a variety of articles and statistics: 
>  * Most popular Technologies on Stack Overflow: [https://insights.stackoverflow.com/survey/2018/#most-popular-technologies|https://insights.stackoverflow.com/survey/2018/]  
>  * Most popular languages on GitHub 2018: [https://www.businessinsider.com/the-10-most-popular-programming-languages-according-to-github-2018-10#2-java-9|https://www.businessinsider.com/the-10-most-popular-programming-languages-according-to-github-2018-10] 
>  * 1M+ new developers last 1 year  
>  * Second most demanded technology on LinkedIn 
>  * Top 30 High velocity OSS projects on GitHub 
> Including a C# language extension in Apache Spark will enable millions of .NET developers
to author Big Data applications in their preferred programming language, developer environment,
and tooling support. We aim to promote the .NET bindings for Spark through engagements with
the Spark community (e.g., we are scheduled to present an early prototype at the SF Spark
Summit 2019) and the .NET developer community (e.g., similar presentations will be held at
.NET developer conferences this year).  As such, we believe that our efforts will help grow
the Spark community by making it accessible to the millions of .NET developers. 
> Furthermore, our early discussions with some large .NET development teams got an enthusiastic
reception. 
> We recognize that earlier attempts at this goal (specifically Mobius [https://github.com/Microsoft/Mobius])
were unsuccessful primarily due to the lack of communication with the Spark community. Therefore,
another goal of this proposal is to not only develop .NET bindings for Spark in open source,
but also continuously seek feedback from the Spark community via posted Jira’s (like this
one) and the Spark developer mailing list. Our hope is that through these engagements, we
can build a community of developers that are eager to contribute to this effort or want to
leverage the resulting .NET bindings for Spark in their respective Big Data applications. 
> h4. Target Personas: 
> .NET developers looking to build big data solutions.  
> h4. Goals: 
> Our primary goal is to help grow Apache Spark by making it accessible to the large
.NET developer base and ecosystem. We will also look for opportunities to generalize the
interop layers for Spark for adding other language extensions in the future. [SPARK-26257]( https://issues.apache.org/jira/browse/SPARK-26257)
proposes such a generalized interop layer, which we hope to address over the course of this
project.  
> Another important goal for us is to not only enable Spark as an application solution
for .NET developers, but also opening the door for .NET developers to make contributions
to Apache Spark itself.   
> Lastly, we aim to develop a .NET extension in the open, while continually engaging
with the Spark community for feedback on designs and code. We will welcome PRs from the
Spark community throughout this project and aim to grow a community of developers that want
to contribute to this project.  
> h4. Non-Goals: 
> This proposal is focused on adding .NET bindings to Apache Spark, and leave any
performance related tasks for future work. Further, we aim to provide support only at
the Dataframe level. 
> h4. Proposed API Changes: 
> This work mostly involves introducing new .NET binding APIs. For example, we would
introduce .NET UDF related classes such as DotnetUDF, UserDefinedDotnetFunction, etc., along
with classes responsible for running .NET UDFs such as DotnetRunner, DotnetWorkerFactory,
etc. 
> This work should have minimal impact on existing Spark APIs. However, in order to provide
a clean solution, we foresee the possibility of introducing .NET specific hooks in the Dataset
API for collecting data in the driver program, for example. 
> We also will be introducing Catalyst rules that will plan the physical operator (that
we will introduce) for the DotnetUDF expression in the logical plan. 
> On the C# side, similar to existing language extensions, we will introduce proxy artifacts
that mimic the SparkSession, Dataframe, and other APIs related to Spark SQL e.g., column, functions
native to Spark SQL, etc.    
> We will also look into augmenting the existing spark-submit and spark-shell scripts
with the ability to recognize a .NET environment.  
> h4. Optional Design Sketch: 
> Our design will largely follow the design of Python Spark support, including how worker
orchestration is performed (i.e., two-process solution, IPC communication). As such, we will
introduce “Runners” specific to executing Dotnet driver and UDF workers.  
> h4. Optional Rejected Designs: 
> The clear alternative is the status quo; developers that want to leverage Apache Spark
do so through one of the existing supported languages i.e., Scala/Java, Python, or R. This
has some costly consequences, such as: 
>  * Learning a new programming language and development environment. 
>  * Integrating with existing .NET technologies through complex interop. 
>  * Migrating legacy code and library dependencies to a supported language. 
> Another alternative is that third-party languages should only interact with Spark via
pure-SQL; possibly via REST. However, this does not enable UDFs or UDAFs written in C#;
a key desideratum in this effort, which most notably takes the form of legacy code/UDFs that
would need to be ported to a supported language e.g., Scala. This exercise is extremely cumbersome
and not always feasible due to the code no longer being available i.e., only the compiled
library exists. As mentioned earlier, the .NET developer community is one of the largest
in the world, and as such there exist many instances of legacy code (e.g., machine learning routines) that
would be difficult to port without the existing .NET library dependencies. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message