spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Liang <...@databricks.com>
Subject Re: [VOTE] Apache Spark 2.1.1 (RC3)
Date Mon, 24 Apr 2017 19:18:36 GMT
-1 (non-binding)

I also agree with using NEVER_INFER for 2.1.1. The migration cost is
unexpected for a point release.

On Mon, Apr 24, 2017 at 11:08 AM Holden Karau <holden@pigscanfly.ca> wrote:

> Whoops, sorry finger slipped on that last message.
> It sounds like whatever we do is going to break some existing users
> (either with the tables by case sensitivity or with the unexpected scan).
>
> Personally I agree with Michael Allman on this, I believe we should
> use INFER_NEVER for 2.1.1.
>
> On Mon, Apr 24, 2017 at 11:01 AM, Holden Karau <holden@pigscanfly.ca>
> wrote:
>
>> It
>>
>> On Mon, Apr 24, 2017 at 10:33 AM, Michael Allman <michael@videoamp.com>
>> wrote:
>>
>>> The trouble we ran into is that this upgrade was blocking access to our
>>> tables, and we didn't know why. This sounds like a kind of migration
>>> operation, but it was not apparent that this was the case. It took an
>>> expert examining a stack trace and source code to figure this out. Would a
>>> more naive end user be able to debug this issue? Maybe we're an unusual
>>> case, but our particular experience was pretty bad. I have my doubts that
>>> the schema inference on our largest tables would ever complete without
>>> throwing some kind of timeout (which we were in fact receiving) or the end
>>> user just giving up and killing our job. We ended up doing a rollback while
>>> we investigated the source of the issue. In our case, INFER_NEVER is
>>> clearly the best configuration. We're going to add that to our default
>>> configuration files.
>>>
>>> My expectation is that a minor point release is a pretty safe bug fix
>>> release. We were a bit hasty in not doing better due diligence pre-upgrade.
>>>
>>> One suggestion the Spark team might consider is releasing 2.1.1 with
>>> INVER_NEVER and 2.2.0 with INFER_AND_SAVE. Clearly some kind of
>>> up-front migration notes would help in identifying this new behavior in 2.2.
>>>
>>> Thanks,
>>>
>>> Michael
>>>
>>>
>>> On Apr 24, 2017, at 2:09 AM, Wenchen Fan <wenchen@databricks.com> wrote:
>>>
>>> see https://issues.apache.org/jira/browse/SPARK-19611
>>>
>>> On Mon, Apr 24, 2017 at 2:22 PM, Holden Karau <holden@pigscanfly.ca>
>>> wrote:
>>>
>>>> Whats the regression this fixed in 2.1 from 2.0?
>>>>
>>>> On Fri, Apr 21, 2017 at 7:45 PM, Wenchen Fan <wenchen@databricks.com>
>>>> wrote:
>>>>
>>>>> IIRC, the new "spark.sql.hive.caseSensitiveInferenceMode" stuff will
>>>>> only scan all table files only once, and write back the inferred schema
to
>>>>> metastore so that we don't need to do the schema inference again.
>>>>>
>>>>> So technically this will introduce a performance regression for the
>>>>> first query, but compared to branch-2.0, it's not performance regression.
>>>>> And this patch fixed a regression in branch-2.1, which can run in
>>>>> branch-2.0. Personally, I think we should keep INFER_AND_SAVE as the
>>>>> default mode.
>>>>>
>>>>> + [Eric], what do you think?
>>>>>
>>>>> On Sat, Apr 22, 2017 at 1:37 AM, Michael Armbrust <
>>>>> michael@databricks.com> wrote:
>>>>>
>>>>>> Thanks for pointing this out, Michael.  Based on the conversation
on
>>>>>> the PR
>>>>>> <https://github.com/apache/spark/pull/16944#issuecomment-285529275>
>>>>>> this seems like a risky change to include in a release branch with
a
>>>>>> default other than NEVER_INFER.
>>>>>>
>>>>>> +Wenchen?  What do you think?
>>>>>>
>>>>>> On Thu, Apr 20, 2017 at 4:14 PM, Michael Allman <michael@videoamp.com
>>>>>> > wrote:
>>>>>>
>>>>>>> We've identified the cause of the change in behavior. It is related
>>>>>>> to the SQL conf key "spark.sql.hive.caseSensitiveInferenceMode".
This key
>>>>>>> and its related functionality was absent from our previous build.
The
>>>>>>> default setting in the current build was causing Spark to attempt
to scan
>>>>>>> all table files during query analysis. Changing this setting
to NEVER_INFER
>>>>>>> disabled this operation and resolved the issue we had.
>>>>>>>
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>> On Apr 20, 2017, at 3:42 PM, Michael Allman <michael@videoamp.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> I want to caution that in testing a build from this morning's
>>>>>>> branch-2.1 we found that Hive partition pruning was not working.
We found
>>>>>>> that Spark SQL was fetching all Hive table partitions for a very
simple
>>>>>>> query whereas in a build from several weeks ago it was fetching
only the
>>>>>>> required partitions. I cannot currently think of a reason for
the
>>>>>>> regression outside of some difference between branch-2.1 from
our previous
>>>>>>> build and branch-2.1 from this morning.
>>>>>>>
>>>>>>> That's all I know right now. We are actively investigating to
find
>>>>>>> the root cause of this problem, and specifically whether this
is a problem
>>>>>>> in the Spark codebase or not. I will report back when I have
an answer to
>>>>>>> that question.
>>>>>>>
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>>> On Apr 18, 2017, at 11:59 AM, Michael Armbrust <
>>>>>>> michael@databricks.com> wrote:
>>>>>>>
>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>> version 2.1.1. The vote is open until Fri, April 21st, 2018 at
>>>>>>> 13:00 PST and passes if a majority of at least 3 +1 PMC votes
are
>>>>>>> cast.
>>>>>>>
>>>>>>> [ ] +1 Release this package as Apache Spark 2.1.1
>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>
>>>>>>>
>>>>>>> To learn more about Apache Spark, please see
>>>>>>> http://spark.apache.org/
>>>>>>>
>>>>>>> The tag to be voted on is v2.1.1-rc3
>>>>>>> <https://github.com/apache/spark/tree/v2.1.1-rc3> (
>>>>>>> 2ed19cff2f6ab79a718526e5d16633412d8c4dd4)
>>>>>>>
>>>>>>> List of JIRA tickets resolved can be found with this filter
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-20134?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.1.1>
>>>>>>> .
>>>>>>>
>>>>>>> The release files, including signatures, digests, etc. can be
found
>>>>>>> at:
>>>>>>> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-bin/
>>>>>>>
>>>>>>> Release artifacts are signed with the following key:
>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>
>>>>>>> The staging repository for this release can be found at:
>>>>>>>
>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1230/
>>>>>>>
>>>>>>> The documentation corresponding to this release can be found
at:
>>>>>>>
>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc3-docs/
>>>>>>>
>>>>>>>
>>>>>>> *FAQ*
>>>>>>>
>>>>>>> *How can I help test this release?*
>>>>>>>
>>>>>>> If you are a Spark user, you can help us test this release by
taking
>>>>>>> an existing Spark workload and running on this release candidate,
then
>>>>>>> reporting any regressions.
>>>>>>>
>>>>>>> *What should happen to JIRA tickets still targeting 2.1.1?*
>>>>>>>
>>>>>>> Committers should look at those and triage. Extremely important
bug
>>>>>>> fixes, documentation, and API tweaks that impact compatibility
should be
>>>>>>> worked on immediately. Everything else please retarget to 2.1.2
or 2.2.0.
>>>>>>>
>>>>>>> *But my bug isn't fixed!??!*
>>>>>>>
>>>>>>> In order to make timely releases, we will typically not hold
the
>>>>>>> release unless the bug in question is a regression from 2.1.0.
>>>>>>>
>>>>>>> *What happened to RC1?*
>>>>>>>
>>>>>>> There were issues with the release packaging and as a result
was
>>>>>>> skipped.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Cell : 425-233-8271 <(425)%20233-8271>
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>
>>>
>>>
>>
>>
>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>
> Twitter: https://twitter.com/holdenkarau
>

Mime
View raw message