spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cody Koeninger <c...@koeninger.org>
Subject Re: Odp.: Spark Improvement Proposals
Date Mon, 31 Oct 2016 17:34:09 GMT
Now that spark summit europe is over, are any committers interested in
moving forward with this?

https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md

Or are we going to let this discussion die on the vine?

On Mon, Oct 17, 2016 at 10:05 AM, Tomasz Gawęda
<tomasz.gaweda@outlook.com> wrote:
> Maybe my mail was not clear enough.
>
>
> I didn't want to write "lets focus on Flink" or any other framework. The
> idea with benchmarks was to show two things:
>
> - why some people are doing bad PR for Spark
>
> - how - in easy way - we can change it and show that Spark is still on the
> top
>
>
> No more, no less. Benchmarks will be helpful, but I don't think they're the
> most important thing in Spark :) On the Spark main page there is still chart
> "Spark vs Hadoop". It is important to show that framework is not the same
> Spark with other API, but much faster and optimized, comparable or even
> faster than other frameworks.
>
>
> About real-time streaming, I think it would be just good to see it in Spark.
> I very like current Spark model, but many voices that says "we need more" -
> community should listen also them and try to help them. With SIPs it would
> be easier, I've just posted this example as "thing that may be changed with
> SIP".
>
>
> I very like unification via Datasets, but there is a lot of algorithms
> inside - let's make easy API, but with strong background (articles,
> benchmarks, descriptions, etc) that shows that Spark is still modern
> framework.
>
>
> Maybe now my intention will be clearer :) As I said organizational ideas
> were already mentioned and I agree with them, my mail was just to show some
> aspects from my side, so from theside of developer and person who is trying
> to help others with Spark (via StackOverflow or other ways)
>
>
> Pozdrawiam / Best regards,
>
> Tomasz
>
>
> ________________________________
> Od: Cody Koeninger <cody@koeninger.org>
> Wysłane: 17 października 2016 16:46
> Do: Debasish Das
> DW: Tomasz Gawęda; dev@spark.apache.org
> Temat: Re: Spark Improvement Proposals
>
> I think narrowly focusing on Flink or benchmarks is missing my point.
>
> My point is evolve or die.  Spark's governance and organization is
> hampering its ability to evolve technologically, and it needs to
> change.
>
> On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das <debasish.das83@gmail.com>
> wrote:
>> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
>> soon as I looked into it since compared to writing Java map-reduce and
>> Cascading code, Spark made writing distributed code fun...But now as we
>> went
>> deeper with Spark and real-time streaming use-case gets more prominent, I
>> think it is time to bring a messaging model in conjunction with the
>> batch/micro-batch API that Spark is good at....akka-streams close
>> integration with spark micro-batching APIs looks like a great direction to
>> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
>> batch with the assumption is that micro-batching is sufficient to run SQL
>> commands on stream but do we really have time to do SQL processing at
>> streaming data within 1-2 seconds ?
>>
>> After reading the email chain, I started to look into Flink documentation
>> and if you compare it with Spark documentation, I think we have major work
>> to do detailing out Spark internals so that more people from community
>> start
>> to take active role in improving the issues so that Spark stays strong
>> compared to Flink.
>>
>> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>>
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>>
>> Spark is no longer an engine that works for micro-batch and batch...We
>> (and
>> I am sure many others) are pushing spark as an engine for stream and query
>> processing.....we need to make it a state-of-the-art engine for high speed
>> streaming data and user queries as well !
>>
>> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda <tomasz.gaweda@outlook.com>
>> wrote:
>>>
>>> Hi everyone,
>>>
>>> I'm quite late with my answer, but I think my suggestions may help a
>>> little bit. :) Many technical and organizational topics were mentioned,
>>> but I want to focus on these negative posts about Spark and about
>>> "haters"
>>>
>>> I really like Spark. Easy of use, speed, very good community - it's
>>> everything here. But Every project has to "flight" on "framework market"
>>> to be still no 1. I'm following many Spark and Big Data communities,
>>> maybe my mail will inspire someone :)
>>>
>>> You (every Spark developer; so far I didn't have enough time to join
>>> contributing to Spark) has done excellent job. So why are some people
>>> saying that Flink (or other framework) is better, like it was posted in
>>> this mailing list? No, not because that framework is better in all
>>> cases.. In my opinion, many of these discussions where started after
>>> Flink marketing-like posts. Please look at StackOverflow "Flink vs ...."
>>> posts, almost every post in "winned" by Flink. Answers are sometimes
>>> saying nothing about other frameworks, Flink's users (often PMC's) are
>>> just posting same information about real-time streaming, about delta
>>> iterations, etc. It look smart and very often it is marked as an aswer,
>>> even if - in my opinion - there wasn't told all the truth.
>>>
>>>
>>> My suggestion: I don't have enough money and knowledgle to perform huge
>>> performance test. Maybe some company, that supports Spark (Databricks,
>>> Cloudera? - just saying you're most visible in community :) ) could
>>> perform performance test of:
>>>
>>> - streaming engine - probably Spark will loose because of mini-batch
>>> model, however currently the difference should be much lower that in
>>> previous versions
>>>
>>> - Machine Learning models
>>>
>>> - batch jobs
>>>
>>> - Graph jobs
>>>
>>> - SQL queries
>>>
>>> People will see that Spark is envolving and is also a modern framework,
>>> because after reading posts mentioned above people may think "it is
>>> outdated, future is in framework X".
>>>
>>> Matei Zaharia posted excellent blog post about how Spark Structured
>>> Streaming beats every other framework in terms of easy-of-use and
>>> reliability. Performance tests, done in various environments (in
>>> example: laptop, small 2 node cluster, 10-node cluster, 20-node
>>> cluster), could be also very good marketing stuff to say "hey, you're
>>> telling that you're better, but Spark is still faster and is still
>>> getting even more fast!". This would be based on facts (just numbers),
>>> not opinions. It would be good for companies, for marketing puproses and
>>> for every Spark developer
>>>
>>>
>>> Second: real-time streaming. I've written some time ago about real-time
>>> streaming support in Spark Structured Streaming. Some work should be
>>> done to make SSS more low-latency, but I think it's possible. Maybe
>>> Spark may look at Gearpump, which is also built on top of Akka? I don't
>>> know yet, it is good topic for SIP. However I think that Spark should
>>> have real-time streaming support. Currently I see many posts/comments
>>> that "Spark has too big latency". Spark Streaming is doing very good
>>> jobs with micro-batches, however I think it is possible to add also more
>>> real-time processing.
>>>
>>> Other people said much more and I agree with proposal of SIP. I'm also
>>> happy that PMC's are not saying that they will not listen to users, but
>>> they really want to make Spark better for every user.
>>>
>>>
>>> What do you think about these two topics? Especially I'm looking at Cody
>>> (who has started this topic) and PMCs :)
>>>
>>> Pozdrawiam / Best regards,
>>>
>>> Tomasz
>>>
>>>
>>> W dniu 2016-10-07 o 04:51, Cody Koeninger pisze:
>>> > I love Spark.  3 or 4 years ago it was the first distributed computing
>>> > environment that felt usable, and the community was welcoming.
>>> >
>>> > But I just got back from the Reactive Summit, and this is what I
>>> > observed:
>>> >
>>> > - Industry leaders on stage making fun of Spark's streaming model
>>> > - Open source project leaders saying they looked at Spark's governance
>>> > as a model to avoid
>>> > - Users saying they chose Flink because it was technically superior
>>> > and they couldn't get any answers on the Spark mailing lists
>>> >
>>> > Whether you agree with the substance of any of this, when this stuff
>>> > gets repeated enough people will believe it.
>>> >
>>> > Right now Spark is suffering from its own success, and I think
>>> > something needs to change.
>>> >
>>> > - We need a clear process for planning significant changes to the
>>> > codebase.
>>> > I'm not saying you need to adopt Kafka Improvement Proposals exactly,
>>> > but you need a documented process with a clear outcome (e.g. a vote).
>>> > Passing around google docs after an implementation has largely been
>>> > decided on doesn't cut it.
>>> >
>>> > - All technical communication needs to be public.
>>> > Things getting decided in private chat, or when 1/3 of the committers
>>> > work for the same company and can just talk to each other...
>>> > Yes, it's convenient, but it's ultimately detrimental to the health of
>>> > the project.
>>> > The way structured streaming has played out has shown that there are
>>> > significant technical blind spots (myself included).
>>> > One way to address that is to get the people who have domain knowledge
>>> > involved, and listen to them.
>>> >
>>> > - We need more committers, and more committer diversity.
>>> > Per committer there are, what, more than 20 contributors and 10 new
>>> > jira tickets a month?  It's too much.
>>> > There are people (I am _not_ referring to myself) who have been around
>>> > for years, contributed thousands of lines of code, helped educate the
>>> > public around Spark... and yet are never going to be voted in.
>>> >
>>> > - We need a clear process for managing volunteer work.
>>> > Too many tickets sit around unowned, unclosed, uncertain.
>>> > If someone proposed something and it isn't up to snuff, tell them and
>>> > close it.  It may be blunt, but it's clearer than "silent no".
>>> > If someone wants to work on something, let them own the ticket and set
>>> > a deadline. If they don't meet it, close it or reassign it.
>>> >
>>> > This is not me putting on an Apache Bureaucracy hat.  This is me
>>> > saying, as a fellow hacker and loyal dissenter, something is wrong
>>> > with the culture and process.
>>> >
>>> > Please, let's change it.
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>>> >
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Mime
View raw message