Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 0828A200C77 for ; Mon, 1 May 2017 22:42:36 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 0383A160BAE; Mon, 1 May 2017 20:42:36 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C3439160BA0 for ; Mon, 1 May 2017 22:42:33 +0200 (CEST) Received: (qmail 65796 invoked by uid 500); 1 May 2017 20:42:31 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 65785 invoked by uid 99); 1 May 2017 20:42:31 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 01 May 2017 20:42:31 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 9A13EC01DB for ; Mon, 1 May 2017 20:42:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.979 X-Spam-Level: * X-Spam-Status: No, score=1.979 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=berkeley-edu.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 2hTpXTi5mOm1 for ; Mon, 1 May 2017 20:42:26 +0000 (UTC) Received: from mail-pg0-f41.google.com (mail-pg0-f41.google.com [74.125.83.41]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 0C18E5F642 for ; Mon, 1 May 2017 20:42:26 +0000 (UTC) Received: by mail-pg0-f41.google.com with SMTP id o3so44510302pgn.2 for ; Mon, 01 May 2017 13:42:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=berkeley-edu.20150623.gappssmtp.com; s=20150623; h=mime-version:subject:from:in-reply-to:date:cc:message-id:references :to; bh=Fvj8rfWlwlMSuLxX4ALzcbTtyhYdRGAX9mRH2CZRRnQ=; b=Nxme34pn1bWrmbKy0OLF/0/QXI9Wl3ChjbIUlVmfRB5h/Zh4pOJojxRRfc8gW6zc9Z LzgJtLqCPtGgukK0gWmKGAjQwdZAgfFoXluItJA7/SXM60kCbc3Qp32dPJjVjfXRwF+j uYG1HrfVMYBVLeYGRjUaMxhHLyRYvFwNiyFY7gueHCzDDve6MPQqYhvEmEVoZtM4hPvC pGds9hgA2T5UYhgeWUe9mwgkYEtVnW6V+Vse6Je/4R2M9eD88oWJrCvJ5vb79KtbZpGc SQX8aAZh7vxawBMHWaF2PYrdpN1iBsCAyUmKMcKDbxlupT47VuYsPz7wfUJdBLbLxSjD 1wUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :message-id:references:to; bh=Fvj8rfWlwlMSuLxX4ALzcbTtyhYdRGAX9mRH2CZRRnQ=; b=n1LnD/YqA/nUx56WwbW8oEM1Bp8BZS4Zizd0Vf14WLdxz84NDxk+/B/G9seRbUxLP9 44id2xgkyYkCXYqf7Syp60+VarKn8Uw9mcq99XVHHltbJazA3VrvvrzdaIetuhdky+U7 anRPKSb5Z+zgT22JNJdpStAOxDfG1+qc4YcCf/4pBR4yJJRPaYUTJXdvl6TtC3fXzC7P 9EQkjy1DHQiIh868iY3vTPF3edPGIx8v+G5jLz8eL4vfNxiy6T/QE9ZsuuEy2tOWlc9J JQ73Nq2kZK1vHH6T5qNMMEMnpc30L0iGn6MlEykiTuntDTrKApV3S8ciH9586W53oX0z WZKQ== X-Gm-Message-State: AN3rC/5768CJLPbsCn+2Xzil4ZEjJv7/JxJXA1oVLH1thZWuJgr3adK2 wMQ2g/U+me9ilnM8 X-Received: by 10.98.31.141 with SMTP id l13mr28450970pfj.259.1493671344968; Mon, 01 May 2017 13:42:24 -0700 (PDT) Received: from dhcp-46-233.eecs.berkeley.edu (dhcp-46-233.EECS.Berkeley.EDU. [128.32.46.233]) by smtp.gmail.com with ESMTPSA id n5sm18564985pfa.78.2017.05.01.13.42.23 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Mon, 01 May 2017 13:42:24 -0700 (PDT) Content-Type: multipart/alternative; boundary="Apple-Mail=_B209D7C8-8AEA-4936-97B9-5D3161F4B89B" Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: [VOTE] Apache Spark 2.2.0 (RC1) From: Frank Austin Nothaft In-Reply-To: Date: Mon, 1 May 2017 13:42:23 -0700 Cc: Michael Heuer , Koert Kuipers , Sean Owen , Michael Armbrust , "dev@spark.apache.org" Message-Id: References: To: rblue@netflix.com X-Mailer: Apple Mail (2.3124) archived-at: Mon, 01 May 2017 20:42:36 -0000 --Apple-Mail=_B209D7C8-8AEA-4936-97B9-5D3161F4B89B Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Hi Ryan, IMO, the problem is that the Spark Avro version conflicts with the = Parquet Avro version. As discussed upthread, I don=E2=80=99t think = there=E2=80=99s a way to reliably make sure that Avro 1.8 is on the = classpath first while using spark-submit. Relocating avro in our project = wouldn=E2=80=99t solve the problem, because the MethodNotFoundError is = thrown from the internals of the ParquetAvroOutputFormat, not from code = in our project. Regards, Frank Austin Nothaft fnothaft@berkeley.edu fnothaft@eecs.berkeley.edu 202-340-0466 > On May 1, 2017, at 12:33 PM, Ryan Blue wrote: >=20 > Michael, I think that the problem is with your classpath. >=20 > Spark has a dependency to 1.7.7, which can't be changed. Your project = is what pulls in parquet-avro and transitively Avro 1.8. Spark has no = runtime dependency on Avro 1.8. It is understandably annoying that using = the same version of Parquet for your parquet-avro dependency is what = causes your project to depend on Avro 1.8, but Spark's dependencies = aren't a problem because its Parquet dependency doesn't bring in Avro. >=20 > There are a few ways around this: > 1. Make sure Avro 1.8 is found in the classpath first > 2. Shade Avro 1.8 in your project (assuming Avro classes aren't = shared) > 3. Use parquet-avro 1.8.1 in your project, which I think should work = with 1.8.2 and avoid the Avro change >=20 > The work-around in Spark is for tests, which do use parquet-avro. We = can look at a Parquet 1.8.3 that avoids this issue, but I think this is = reasonable for the 2.2.0 release. >=20 > rb >=20 > On Mon, May 1, 2017 at 12:08 PM, Michael Heuer > wrote: > Please excuse me if I'm misunderstanding -- the problem is not with = our library or our classpath. >=20 > There is a conflict within Spark itself, in that Parquet 1.8.2 expects = to find Avro 1.8.0 on the runtime classpath and sees 1.7.7 instead. = Spark already has to work around this for unit tests to pass. >=20 >=20 >=20 > On Mon, May 1, 2017 at 2:00 PM, Ryan Blue > wrote: > Thanks for the extra context, Frank. I agree that it sounds like your = problem comes from the conflict between your Jars and what comes with = Spark. Its the same concern that makes everyone shudder when anything = has a public dependency on Jackson. :) >=20 > What we usually do to get around situations like this is to relocate = the problem library inside the shaded Jar. That way, Spark uses its = version of Avro and your classes use a different version of Avro. This = works if you don't need to share classes between the two. Would that = work for your situation? >=20 > rb >=20 > On Mon, May 1, 2017 at 11:55 AM, Koert Kuipers > wrote: > sounds like you are running into the fact that you cannot really put = your classes before spark's on classpath? spark's switches to support = this never really worked for me either. >=20 > inability to control the classpath + inconsistent jars =3D> trouble ? >=20 > On Mon, May 1, 2017 at 2:36 PM, Frank Austin Nothaft = > wrote: > Hi Ryan, >=20 > We do set Avro to 1.8 in our downstream project. We also set Spark as = a provided dependency, and build an =C3=BCberjar. We run via = spark-submit, which builds the classpath with our =C3=BCberjar and all = of the Spark deps. This leads to avro 1.7.1 getting picked off of the = classpath at runtime, which causes the no such method exception to = occur. >=20 > Regards, >=20 > Frank Austin Nothaft > fnothaft@berkeley.edu > fnothaft@eecs.berkeley.edu > 202-340-0466 >> On May 1, 2017, at 11:31 AM, Ryan Blue > wrote: >>=20 >> Frank, >>=20 >> The issue you're running into is caused by using parquet-avro with = Avro 1.7. Can't your downstream project set the Avro dependency to 1.8? = Spark can't update Avro because it is a breaking change that would force = users to rebuilt specific Avro classes in some cases. But you should be = free to use Avro 1.8 to avoid the problem. >>=20 >> On Mon, May 1, 2017 at 11:08 AM, Frank Austin Nothaft = > wrote: >> Hi Ryan et al, >>=20 >> The issue we=E2=80=99ve seen using a build of the Spark 2.2.0 branch = from a downstream project is that parquet-avro uses one of the new Avro = 1.8.0 methods, and you get a NoSuchMethodError since Spark puts Avro = 1.7.7 as a dependency. My colleague Michael (who posted earlier on this = thread) documented this in Spark-19697 = . I know that Spark = has unit tests that check this compatibility issue, but it looks like = there was a recent change that sets a test scope dependency on Avro = 1.8.0 = , which masks this issue in the unit tests. With this error, you = can=E2=80=99t use the ParquetAvroOutputFormat from a application running = on Spark 2.2.0. >>=20 >> Regards, >>=20 >> Frank Austin Nothaft >> fnothaft@berkeley.edu >> fnothaft@eecs.berkeley.edu >> 202-340-0466 >>=20 >>> On May 1, 2017, at 10:02 AM, Ryan Blue > wrote: >>>=20 >>> I agree with Sean. Spark only pulls in parquet-avro for tests. For = execution, it implements the record materialization APIs in Parquet to = go directly to Spark SQL rows. This doesn't actually leak an Avro 1.8 = dependency into Spark as far as I can tell. >>>=20 >>> rb >>>=20 >>> On Mon, May 1, 2017 at 8:34 AM, Sean Owen > wrote: >>> See discussion at https://github.com/apache/spark/pull/17163 = -- I think the issue is = that fixing this trades one problem for a slightly bigger one. >>>=20 >>>=20 >>> On Mon, May 1, 2017 at 4:13 PM Michael Heuer > wrote: >>> Version 2.2.0 bumps the dependency version for parquet to 1.8.2 but = does not bump the dependency version for avro (currently at 1.7.7). = Though perhaps not clear from the issue I reported [0], this means that = Spark is internally inconsistent, in that a call through parquet (which = depends on avro 1.8.0 [1]) may throw errors at runtime when it hits avro = 1.7.7 on the classpath. Avro 1.8.0 is not binary compatible with 1.7.7. >>>=20 >>> [0] - https://issues.apache.org/jira/browse/SPARK-19697 = >>> [1] - = https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/pom.xml#L96= = >>>=20 >>> On Sun, Apr 30, 2017 at 3:28 AM, Sean Owen > wrote: >>> I have one more issue that, if it needs to be fixed, needs to be = fixed for 2.2.0. >>>=20 >>> I'm fixing build warnings for the release and noticed that = checkstyle actually complains there are some Java methods named in = TitleCase, like `ProcessingTimeTimeout`: >>>=20 >>> https://github.com/apache/spark/pull/17803/files#r113934080 = >>>=20 >>> Easy enough to fix and it's right, that's not conventional. However = I wonder if it was done on purpose to match a class name? >>>=20 >>> I think this is one for @tdas >>>=20 >>> On Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust = > wrote: >>> Please vote on releasing the following candidate as Apache Spark = version 2.2.0. The vote is open until Tues, May 2nd, 2017 at 12:00 PST = and passes if a majority of at least 3 +1 PMC votes are cast. >>>=20 >>> [ ] +1 Release this package as Apache Spark 2.2.0 >>> [ ] -1 Do not release this package because ... >>>=20 >>>=20 >>> To learn more about Apache Spark, please see = http://spark.apache.org/ >>>=20 >>> The tag to be voted on is v2.2.0-rc1 = = (8ccb4a57c82146c1a8f8966c7e64010cf5632cb6) >>>=20 >>> List of JIRA tickets resolved can be found with this filter = . >>>=20 >>> The release files, including signatures, digests, etc. can be found = at: >>> http://home.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-bin/ = >>>=20 >>> Release artifacts are signed with the following key: >>> https://people.apache.org/keys/committer/pwendell.asc = >>>=20 >>> The staging repository for this release can be found at: >>> = https://repository.apache.org/content/repositories/orgapachespark-1235/ = >>>=20 >>> The documentation corresponding to this release can be found at: >>> = http://people.apache.org/~pwendell/spark-releases/spark-2.2.0-rc1-docs/ = >>>=20 >>>=20 >>> FAQ >>>=20 >>> How can I help test this release? >>>=20 >>> If you are a Spark user, you can help us test this release by taking = an existing Spark workload and running on this release candidate, then = reporting any regressions. >>>=20 >>> What should happen to JIRA tickets still targeting 2.2.0? >>>=20 >>> Committers should look at those and triage. Extremely important bug = fixes, documentation, and API tweaks that impact compatibility should be = worked on immediately. Everything else please retarget to 2.3.0 or = 2.2.1. >>>=20 >>> But my bug isn't fixed!??! >>>=20 >>> In order to make timely releases, we will typically not hold the = release unless the bug in question is a regression from 2.1.1. >>>=20 >>>=20 >>>=20 >>>=20 >>> --=20 >>> Ryan Blue >>> Software Engineer >>> Netflix >>=20 >>=20 >>=20 >>=20 >> --=20 >> Ryan Blue >> Software Engineer >> Netflix >=20 >=20 >=20 >=20 >=20 > --=20 > Ryan Blue > Software Engineer > Netflix >=20 >=20 >=20 >=20 > --=20 > Ryan Blue > Software Engineer > Netflix --Apple-Mail=_B209D7C8-8AEA-4936-97B9-5D3161F4B89B Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 Hi Ryan,

IMO, the problem is that the Spark Avro version conflicts = with the Parquet Avro version. As discussed upthread, I don=E2=80=99t = think there=E2=80=99s a way to reliably make sure = that Avro 1.8 is on the classpath first while using spark-submit. = Relocating avro in our project wouldn=E2=80=99t solve the problem, = because the MethodNotFoundError is thrown from the internals of the = ParquetAvroOutputFormat, not from code in our project.

Regards,

Frank Austin Nothaft
202-340-0466

On May 1, 2017, at 12:33 PM, Ryan Blue <rblue@netflix.com> = wrote:

Michael, I think that the problem = is with your classpath.

Spark has a dependency to 1.7.7, which can't be changed. Your = project is what pulls in parquet-avro and transitively Avro 1.8. Spark = has no runtime dependency on Avro 1.8. It is understandably annoying = that using the same version of Parquet for your parquet-avro dependency = is what causes your project to depend on Avro 1.8, but Spark's = dependencies aren't a problem because its Parquet dependency doesn't = bring in Avro.

There are a few = ways around this:
1. Make sure Avro 1.8 is found in the = classpath first
2. Shade Avro 1.8 in your project = (assuming Avro classes aren't shared)
3. Use = parquet-avro 1.8.1 in your project, which I think should work with 1.8.2 = and avoid the Avro change

The work-around in Spark is for tests, which do use = parquet-avro. We can look at a Parquet 1.8.3 that avoids this issue, but = I think this is reasonable for the 2.2.0 release.
rb

On Mon, = May 1, 2017 at 12:08 PM, Michael Heuer <heuermh@gmail.com> wrote:
Please excuse me if I'm misunderstanding -- = the problem is not with our library or our classpath.

There is a conflict within Spark itself, in that Parquet = 1.8.2 expects to find Avro 1.8.0 on the runtime classpath and sees 1.7.7 = instead.  Spark already has to work around this for unit tests to = pass.



On Mon, = May 1, 2017 at 2:00 PM, Ryan Blue <rblue@netflix.com> wrote:
Thanks for the extra context, Frank. I agree that it sounds = like your problem comes from the conflict between your Jars and what = comes with Spark. Its the same concern that makes everyone shudder when = anything has a public dependency on Jackson. :)

What we usually do to get around = situations like this is to relocate the problem library inside the = shaded Jar. That way, Spark uses its version of Avro and your classes = use a different version of Avro. This works if you don't need to share = classes between the two. Would that work for your situation?

rb

On Mon, May 1, 2017 at 11:55 AM, = Koert Kuipers <koert@tresata.com> wrote:
sounds like you are running into the fact that you cannot = really put your classes before spark's on classpath? spark's switches to = support this never really worked for me either.

inability to control the classpath + inconsistent jars = =3D> trouble ?

On Mon, = May 1, 2017 at 2:36 PM, Frank Austin Nothaft <fnothaft@berkeley.edu> wrote:
Hi Ryan,

We do set Avro to 1.8 in our downstream = project. We also set Spark as a provided dependency, and build an = =C3=BCberjar. We run via spark-submit, which builds the classpath with = our =C3=BCberjar and all of the Spark deps. This leads to avro 1.7.1 = getting picked off of the classpath at runtime, which causes the no such = method exception to occur.

Regards,


On= May 1, 2017, at 11:31 AM, Ryan Blue <rblue@netflix.com> wrote:

Frank,

The issue you're running into is caused by using parquet-avro = with Avro 1.7. Can't your downstream project set the Avro dependency to = 1.8? Spark can't update Avro because it is a breaking change that would = force users to rebuilt specific Avro classes in some cases. But you = should be free to use Avro 1.8 to avoid the problem.

On Mon, = May 1, 2017 at 11:08 AM, Frank Austin Nothaft <fnothaft@berkeley.edu> wrote:
Hi Ryan et al,

The issue we=E2=80=99ve = seen using a build of the Spark 2.2.0 branch from a downstream project = is that parquet-avro uses one of the new Avro 1.8.0 methods, and you get = a NoSuchMethodError since Spark puts Avro 1.7.7 as a dependency. My = colleague Michael (who posted earlier on this thread) documented this = in Spark-19697. I know that Spark has unit = tests that check this compatibility issue, but it looks like there was = a recent change that sets a = test scope dependency on Avro 1.8.0, which masks this issue in the = unit tests. With this error, you can=E2=80=99t use the = ParquetAvroOutputFormat from a application running on Spark = 2.2.0.

Regards,


On May 1, 2017, at 10:02 AM, Ryan Blue <rblue@netflix.com.INVALID> wrote:

I agree with Sean. Spark only = pulls in parquet-avro for tests. For execution, it implements the record = materialization APIs in Parquet to go directly to Spark SQL rows. This = doesn't actually leak an Avro 1.8 dependency into Spark as far as I can = tell.

rb

On Mon, May 1, 2017 at 8:34 AM, Sean Owen <sowen@cloudera.com> wrote:
See discussion at https://github.com/apache/spark/pull/17163 -- I think the issue is that fixing = this trades one problem for a slightly bigger one.


On = Mon, May 1, 2017 at 4:13 PM Michael Heuer <heuermh@gmail.com> wrote:
Version 2.2.0 bumps the dependency version for parquet to = 1.8.2 but does not bump the dependency version for avro (currently at = 1.7.7).  Though perhaps not clear from the issue I reported [0], = this means that Spark is internally inconsistent, in that a call through = parquet (which depends on avro 1.8.0 [1]) may throw errors at runtime = when it hits avro 1.7.7 on the classpath.  Avro 1.8.0 is not binary = compatible with 1.7.7.

[0] - https://issues.apache.org/jira/browse/SPARK-19697
[1] - = https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/pom.xml#L96

On Sun, = Apr 30, 2017 at 3:28 AM, Sean Owen <sowen@cloudera.com> wrote:
I have one more issue that, if it needs to be fixed, needs to = be fixed for 2.2.0.

I'm fixing build warnings for the release and noticed that = checkstyle actually complains there are some Java methods named in = TitleCase, like `ProcessingTimeTimeout`:


Easy enough to fix and it's right, = that's not conventional. However I wonder if it was done on purpose to = match a class name?

I think this is one for @tdas

On = Thu, Apr 27, 2017 at 7:31 PM Michael Armbrust <michael@databricks.com> wrote:
Please vot= e on releasing the following candidate as Apache Spark = version 2.2.0. The vot= e is open until Tues, May 2nd, 2017 at 12:00 PST and passes = if a majority of at least 3 +1 PMC vot= es are cast.

[ ] +1 Release this package as Apache Spark 2.2.0
[ ] -1 Do not release this package = because ...


To learn = more about Apache Spark, please see http://spark.apache.org/

The tag to = be voted on is v2.2.0-rc1 (8ccb4a57c82146c1a8f8966c7e64010cf5632cb6)

List of JIRA tickets resolved can = be found with this filter.

The release files, including signatures, digests, etc. can be = found at:

Release artifacts are signed with the following = key:

The staging repository for this = release can be found at:

The documentation corresponding to this release can be found = at:


FAQ

How can I help test this release?

If you = are a Spark user, you can help us test this release by taking an = existing Spark workload and running on this release candidate, then = reporting any regressions.

What should happen = to JIRA tickets still targeting 2.2.0?

Committers= should look at those and triage. Extremely important bug fixes, = documentation, and API tweaks that impact compatibility should be worked = on immediately. Everything else please retarget to 2.3.0 or = 2.2.1.

But my bug isn't fixed!??!

In order to make timely releases, = we will typically not hold the release unless the bug in question is a = regression from 2.1.1.




--
Ryan Blue
Software = Engineer
Netflix




--
Ryan Blue
Software = Engineer
Netflix





--
Ryan Blue
Software = Engineer
Netflix




--
Ryan = Blue
Software Engineer
Netflix

= --Apple-Mail=_B209D7C8-8AEA-4936-97B9-5D3161F4B89B--