spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: [VOTE] Apache Spark 2.2.0 (RC1)
Date Wed, 03 May 2017 23:45:23 GMT
I'm going to -1 this given the number of small bug fixes that have gone
into the release branch.  I'll follow with another RC shortly.

On Tue, May 2, 2017 at 7:35 AM, Nick Pentreath <nick.pentreath@gmail.com>
wrote:

> I won't +1 just given that it seems certain there will be another RC and
> there are the outstanding ML QA blocker issues.
>
> But clean build and test for JVM and Python tests LGTM on CentOS Linux
> 7.2.1511, OpenJDK 1.8.0_111
>
>
> On Mon, 1 May 2017 at 22:42 Frank Austin Nothaft <fnothaft@berkeley.edu>
> wrote:
>
>> Hi Ryan,
>>
>> IMO, the problem is that the Spark Avro version conflicts with the
>> Parquet Avro version. As discussed upthread, I don’t think there’s a way to
>> *reliably *make sure that Avro 1.8 is on the classpath first while using
>> spark-submit. Relocating avro in our project wouldn’t solve the problem,
>> because the MethodNotFoundError is thrown from the internals of the
>> ParquetAvroOutputFormat, not from code in our project.
>>
>> Regards,
>>
>> Frank Austin Nothaft
>> fnothaft@berkeley.edu
>> fnothaft@eecs.berkeley.edu
>> 202-340-0466 <(202)%20340-0466>
>>
>> On May 1, 2017, at 12:33 PM, Ryan Blue <rblue@netflix.com> wrote:
>>
>> Michael, I think that the problem is with your classpath.
>>
>> Spark has a dependency to 1.7.7, which can't be changed. Your project is
>> what pulls in parquet-avro and transitively Avro 1.8. Spark has no runtime
>> dependency on Avro 1.8. It is understandably annoying that using the same
>> version of Parquet for your parquet-avro dependency is what causes your
>> project to depend on Avro 1.8, but Spark's dependencies aren't a problem
>> because its Parquet dependency doesn't bring in Avro.
>>
>> There are a few ways around this:
>> 1. Make sure Avro 1.8 is found in the classpath first
>> 2. Shade Avro 1.8 in your project (assuming Avro classes aren't shared)
>> 3. Use parquet-avro 1.8.1 in your project, which I think should work with
>> 1.8.2 and avoid the Avro change
>>
>> The work-around in Spark is for tests, which do use parquet-avro. We can
>> look at a Parquet 1.8.3 that avoids this issue, but I think this is
>> reasonable for the 2.2.0 release.
>>
>> rb
>>
>> On Mon, May 1, 2017 at 12:08 PM, Michael Heuer <heuermh@gmail.com> wrote:
>>
>>> Please excuse me if I'm misunderstanding -- the problem is not with our
>>> library or our classpath.
>>>
>>> There is a conflict within Spark itself, in that Parquet 1.8.2 expects
>>> to find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead.  Spark
>>> already has to work around this for unit tests to pass.
>>>
>>>
>>>
>>> On Mon, May 1, 2017 at 2:00 PM, Ryan Blue <rblue@netflix.com> wrote:
>>>
>>>> Thanks for the extra context, Frank. I agree that it sounds like your
>>>> problem comes from the conflict between your Jars and what comes with
>>>> Spark. Its the same concern that makes everyone shudder when anything has
a
>>>> public dependency on Jackson. :)
>>>>
>>>> What we usually do to get around situations like this is to relocate
>>>> the problem library inside the shaded Jar. That way, Spark uses its version
>>>> of Avro and your classes use a different version of Avro. This works if you
>>>> don't need to share classes between the two. Would that work for your
>>>> situation?
>>>>
>>>> rb
>>>>
>>>> On Mon, May 1, 2017 at 11:55 AM, Koert Kuipers <koert@tresata.com>
>>>> wrote:
>>>>
>>>>> sounds like you are running into the fact that you cannot really put
>>>>> your classes before spark's on classpath? spark's switches to support
this
>>>>> never really worked for me either.
>>>>>
>>>>> inability to control the classpath + inconsistent jars => trouble
?
>>>>>
>>>>> On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft <
>>>>> fnothaft@berkeley.edu> wrote:
>>>>>
>>>>>> Hi Ryan,
>>>>>>
>>>>>> We do set Avro to 1.8 in our downstream project. We also set Spark
as
>>>>>> a provided dependency, and build an überjar. We run via spark-submit,
which
>>>>>> builds the classpath with our überjar and all of the Spark deps.
This leads
>>>>>> to avro 1.7.1 getting picked off of the classpath at runtime, which
causes
>>>>>> the no such method exception to occur.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Frank Austin Nothaft
>>>>>> fnothaft@berkeley.edu
>>>>>> fnothaft@eecs.berkeley.edu
>>>>>> 202-340-0466 <(202)%20340-0466>
>>>>>>
>>>>>> On May 1, 2017, at 11:31 AM, Ryan Blue <rblue@netflix.com>
wrote:
>>>>>>
>>>>>> Frank,
>>>>>>
>>>>>> The issue you're running into is caused by using parquet-avro with
>>>>>> Avro 1.7. Can't your downstream project set the Avro dependency to
1.8?
>>>>>> Spark can't update Avro because it is a breaking change that would
force
>>>>>> users to rebuilt specific Avro classes in some cases. But you should
be
>>>>>> free to use Avro 1.8 to avoid the problem.
>>>>>>
>>>>>> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft <
>>>>>> fnothaft@berkeley.edu> wrote:
>>>>>>
>>>>>>> Hi Ryan et al,
>>>>>>>
>>>>>>> The issue we’ve seen using a build of the Spark 2.2.0 branch
from a
>>>>>>> downstream project is that parquet-avro uses one of the new Avro
1.8.0
>>>>>>> methods, and you get a NoSuchMethodError since Spark puts Avro
1.7.7 as a
>>>>>>> dependency. My colleague Michael (who posted earlier on this
thread)
>>>>>>> documented this in Spark-19697
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-19697>. I
know that
>>>>>>> Spark has unit tests that check this compatibility issue, but
it looks like
>>>>>>> there was a recent change that sets a test scope dependency on
Avro
>>>>>>> 1.8.0
>>>>>>> <https://github.com/apache/spark/commit/0077bfcb93832d93009f73f4b80f2e3d98fd2fa4>,
>>>>>>> which masks this issue in the unit tests. With this error, you
can’t use
>>>>>>> the ParquetAvroOutputFormat from a application running on Spark
2.2.0.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Frank Austin Nothaft
>>>>>>> fnothaft@berkeley.edu
>>>>>>> fnothaft@eecs.berkeley.edu
>>>>>>> 202-340-0466 <(202)%20340-0466>
>>>>>>>
>>>>>>> On May 1, 2017, at 10:02 AM, Ryan Blue <rblue@netflix.com.INVALID
>>>>>>> <rblue@netflix.com.invalid>> wrote:
>>>>>>>
>>>>>>> I agree with Sean. Spark only pulls in parquet-avro for tests.
For
>>>>>>> execution, it implements the record materialization APIs in Parquet
to go
>>>>>>> directly to Spark SQL rows. This doesn't actually leak an Avro
1.8
>>>>>>> dependency into Spark as far as I can tell.
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>> On Mon, May 1, 2017 at 8:34 AM, Sean Owen <sowen@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> See discussion at https://github.com/apache/spark/pull/17163
-- I
>>>>>>>> think the issue is that fixing this trades one problem for
a slightly
>>>>>>>> bigger one.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, May 1, 2017 at 4:13 PM Michael Heuer <heuermh@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Version 2.2.0 bumps the dependency version for parquet
to 1.8.2
>>>>>>>>> but does not bump the dependency version for avro (currently
at 1.7.7).
>>>>>>>>> Though perhaps not clear from the issue I reported [0],
this means that
>>>>>>>>> Spark is internally inconsistent, in that a call through
parquet (which
>>>>>>>>> depends on avro 1.8.0 [1]) may throw errors at runtime
when it hits avro
>>>>>>>>> 1.7.7 on the classpath.  Avro 1.8.0 is not binary compatible
with 1.7.7.
>>>>>>>>>
>>>>>>>>> [0] - https://issues.apache.org/jira/browse/SPARK-19697
>>>>>>>>> [1] - https://github.com/apache/parquet-mr/blob/apache-
>>>>>>>>> parquet-1.8.2/pom.xml#L96
>>>>>>>>>
>>>>>>>>> On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen <sowen@cloudera.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I have one more issue that, if it needs to be fixed,
needs to be
>>>>>>>>>> fixed for 2.2.0.
>>>>>>>>>>
>>>>>>>>>> I'm fixing build warnings for the release and noticed
that
>>>>>>>>>> checkstyle actually complains there are some Java
methods named in
>>>>>>>>>> TitleCase, like `ProcessingTimeTimeout`:
>>>>>>>>>>
>>>>>>>>>> https://github.com/apache/spark/pull/17803/files#r113934080
>>>>>>>>>>
>>>>>>>>>> Easy enough to fix and it's right, that's not conventional.
>>>>>>>>>> However I wonder if it was done on purpose to match
a class name?
>>>>>>>>>>
>>>>>>>>>> I think this is one for @tdas
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust
<
>>>>>>>>>> michael@databricks.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Please vote on releasing the following candidate
as Apache
>>>>>>>>>>> Spark version 2.2.0. The vote is open until Tues,
May 2nd, 2017
>>>>>>>>>>> at 12:00 PST and passes if a majority of at least
3 +1 PMC votes are
>>>>>>>>>>> cast.
>>>>>>>>>>>
>>>>>>>>>>> [ ] +1 Release this package as Apache Spark 2.2.0
>>>>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> To learn more about Apache Spark, please see
>>>>>>>>>>> http://spark.apache.org/
>>>>>>>>>>>
>>>>>>>>>>> The tag to be voted on is v2.2.0-rc1
>>>>>>>>>>> <https://github.com/apache/spark/tree/v2.2.0-rc1>
(
>>>>>>>>>>> 8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)
>>>>>>>>>>>
>>>>>>>>>>> List of JIRA tickets resolved can be found with
this filter
>>>>>>>>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>>>>>>>>>>> .
>>>>>>>>>>>
>>>>>>>>>>> The release files, including signatures, digests,
etc. can be
>>>>>>>>>>> found at:
>>>>>>>>>>> http://home.apache.org/~pwendell/spark-releases/spark-
>>>>>>>>>>> 2.2.0-rc1-bin/
>>>>>>>>>>>
>>>>>>>>>>> Release artifacts are signed with the following
key:
>>>>>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>>>>>
>>>>>>>>>>> The staging repository for this release can be
found at:
>>>>>>>>>>> https://repository.apache.org/content/repositories/
>>>>>>>>>>> orgapachespark-1235/
>>>>>>>>>>>
>>>>>>>>>>> The documentation corresponding to this release
can be found at:
>>>>>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-
>>>>>>>>>>> 2.2.0-rc1-docs/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *FAQ*
>>>>>>>>>>>
>>>>>>>>>>> *How can I help test this release?*
>>>>>>>>>>>
>>>>>>>>>>> If you are a Spark user, you can help us test
this release by
>>>>>>>>>>> taking an existing Spark workload and running
on this release candidate,
>>>>>>>>>>> then reporting any regressions.
>>>>>>>>>>>
>>>>>>>>>>> *What should happen to JIRA tickets still targeting
2.2.0?*
>>>>>>>>>>>
>>>>>>>>>>> Committers should look at those and triage. Extremely
important
>>>>>>>>>>> bug fixes, documentation, and API tweaks that
impact compatibility should
>>>>>>>>>>> be worked on immediately. Everything else please
retarget to 2.3.0 or 2.2.1.
>>>>>>>>>>>
>>>>>>>>>>> *But my bug isn't fixed!??!*
>>>>>>>>>>>
>>>>>>>>>>> In order to make timely releases, we will typically
not hold the
>>>>>>>>>>> release unless the bug in question is a regression
from 2.1.1.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Netflix
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>>
>>

Mime
View raw message