spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject [VOTE] Release Apache Spark 1.6.0 (RC2)
Date Sat, 12 Dec 2015 17:39:21 GMT
Please vote on releasing the following candidate as Apache Spark version
1.6.0!

The vote is open until Tuesday, December 15, 2015 at 6:00 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.6.0
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see http://spark.apache.org/

The tag to be voted on is *v1.6.0-rc2
(23f8dfd45187cb8f2216328ab907ddb5fbdffd0b)
<https://github.com/apache/spark/tree/v1.6.0-rc2>*

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1169/

The test repository (versioned as v1.6.0-rc2) for this release can be found
at:
https://repository.apache.org/content/repositories/orgapachespark-1168/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc2-docs/

=======================================
== How can I help test this release? ==
=======================================
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions.

================================================
== What justifies a -1 vote for this release? ==
================================================
This vote is happening towards the end of the 1.6 QA period, so -1 votes
should only occur for significant regressions from 1.5. Bugs already
present in 1.5, minor regressions, or bugs related to new features will not
block this release.

===============================================================
== What should happen to JIRA tickets still targeting 1.6.0? ==
===============================================================
1. It is OK for documentation patches to target 1.6.0 and still go into
branch-1.6, since documentations will be published separately from the
release.
2. New features for non-alpha-modules should target 1.7+.
3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
version.


==================================================
== Major changes to help you focus your testing ==
==================================================

Spark 1.6.0 PreviewNotable changes since 1.6 RC1Spark Streaming

   - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629>
   trackStateByKey has been renamed to mapWithState

Spark SQL

   - SPARK-12165 <https://issues.apache.org/jira/browse/SPARK-12165>
   SPARK-12189 <https://issues.apache.org/jira/browse/SPARK-12189> Fix bugs
   in eviction of storage memory by execution.
   - SPARK-12258 <https://issues.apache.org/jira/browse/SPARK-12258> correct
   passing null into ScalaUDF

Notable Features Since 1.5Spark SQL

   - SPARK-11787 <https://issues.apache.org/jira/browse/SPARK-11787> Parquet
   Performance - Improve Parquet scan performance when using flat schemas.
   - SPARK-10810 <https://issues.apache.org/jira/browse/SPARK-10810>
   Session Management - Isolated devault database (i.e USE mydb) even on
   shared clusters.
   - SPARK-9999  <https://issues.apache.org/jira/browse/SPARK-9999> Dataset
   API - A type-safe API (similar to RDDs) that performs many operations on
   serialized binary data and code generation (i.e. Project Tungsten).
   - SPARK-10000 <https://issues.apache.org/jira/browse/SPARK-10000> Unified
   Memory Management - Shared memory for execution and caching instead of
   exclusive division of the regions.
   - SPARK-11197 <https://issues.apache.org/jira/browse/SPARK-11197> SQL
   Queries on Files - Concise syntax for running SQL queries over files of
   any supported format without registering a table.
   - SPARK-11745 <https://issues.apache.org/jira/browse/SPARK-11745> Reading
   non-standard JSON files - Added options to read non-standard JSON files
   (e.g. single-quotes, unquoted attributes)
   - SPARK-10412 <https://issues.apache.org/jira/browse/SPARK-10412>
Per-operator
   Metrics for SQL Execution - Display statistics on a peroperator basis
   for memory usage and spilled data size.
   - SPARK-11329 <https://issues.apache.org/jira/browse/SPARK-11329> Star
   (*) expansion for StructTypes - Makes it easier to nest and unest
   arbitrary numbers of columns
   - SPARK-10917 <https://issues.apache.org/jira/browse/SPARK-10917>,
   SPARK-11149 <https://issues.apache.org/jira/browse/SPARK-11149> In-memory
   Columnar Cache Performance - Significant (up to 14x) speed up when
   caching data that contains complex types in DataFrames or SQL.
   - SPARK-11111 <https://issues.apache.org/jira/browse/SPARK-11111> Fast
   null-safe joins - Joins using null-safe equality (<=>) will now execute
   using SortMergeJoin instead of computing a cartisian product.
   - SPARK-11389 <https://issues.apache.org/jira/browse/SPARK-11389> SQL
   Execution Using Off-Heap Memory - Support for configuring query
   execution to occur using off-heap memory to avoid GC overhead
   - SPARK-10978 <https://issues.apache.org/jira/browse/SPARK-10978> Datasource
   API Avoid Double Filter - When implemeting a datasource with filter
   pushdown, developers can now tell Spark SQL to avoid double evaluating a
   pushed-down filter.
   - SPARK-4849  <https://issues.apache.org/jira/browse/SPARK-4849> Advanced
   Layout of Cached Data - storing partitioning and ordering schemes in
   In-memory table scan, and adding distributeBy and localSort to DF API
   - SPARK-9858  <https://issues.apache.org/jira/browse/SPARK-9858> Adaptive
   query execution - Intial support for automatically selecting the number
   of reducers for joins and aggregations.
   - SPARK-9241  <https://issues.apache.org/jira/browse/SPARK-9241> Improved
   query planner for queries having distinct aggregations - Query plans of
   distinct aggregations are more robust when distinct columns have high
   cardinality.

Spark Streaming

   - API Updates
      - SPARK-2629  <https://issues.apache.org/jira/browse/SPARK-2629> New
      improved state management - mapWithState - a DStream transformation
      for stateful stream processing, supercedes updateStateByKey in
      functionality and performance.
      - SPARK-11198 <https://issues.apache.org/jira/browse/SPARK-11198> Kinesis
      record deaggregation - Kinesis streams have been upgraded to use KCL
      1.4.0 and supports transparent deaggregation of KPL-aggregated records.
      - SPARK-10891 <https://issues.apache.org/jira/browse/SPARK-10891> Kinesis
      message handler function - Allows arbitraray function to be applied
      to a Kinesis record in the Kinesis receiver before to customize what data
      is to be stored in memory.
      - SPARK-6328  <https://issues.apache.org/jira/browse/SPARK-6328> Python
      Streamng Listener API - Get streaming statistics (scheduling delays,
      batch processing times, etc.) in streaming.


   - UI Improvements
      - Made failures visible in the streaming tab, in the timelines, batch
      list, and batch details page.
      - Made output operations visible in the streaming tab as progress
      bars.

MLlibNew algorithms/models

   - SPARK-8518  <https://issues.apache.org/jira/browse/SPARK-8518> Survival
   analysis - Log-linear model for survival analysis
   - SPARK-9834  <https://issues.apache.org/jira/browse/SPARK-9834> Normal
   equation for least squares - Normal equation solver, providing R-like
   model summary statistics
   - SPARK-3147  <https://issues.apache.org/jira/browse/SPARK-3147> Online
   hypothesis testing - A/B testing in the Spark Streaming framework
   - SPARK-9930  <https://issues.apache.org/jira/browse/SPARK-9930> New
   feature transformers - ChiSqSelector, QuantileDiscretizer, SQL
   transformer
   - SPARK-6517  <https://issues.apache.org/jira/browse/SPARK-6517> Bisecting
   K-Means clustering - Fast top-down clustering variant of K-Means

API improvements

   - ML Pipelines
      - SPARK-6725  <https://issues.apache.org/jira/browse/SPARK-6725> Pipeline
      persistence - Save/load for ML Pipelines, with partial coverage of
      spark.ml algorithms
      - SPARK-5565  <https://issues.apache.org/jira/browse/SPARK-5565> LDA
      in ML Pipelines - API for Latent Dirichlet Allocation in ML Pipelines
   - R API
      - SPARK-9836  <https://issues.apache.org/jira/browse/SPARK-9836> R-like
      statistics for GLMs - (Partial) R-like stats for ordinary least
      squares via summary(model)
      - SPARK-9681  <https://issues.apache.org/jira/browse/SPARK-9681> Feature
      interactions in R formula - Interaction operator ":" in R formula
   - Python API - Many improvements to Python API to approach feature parity

Misc improvements

   - SPARK-7685  <https://issues.apache.org/jira/browse/SPARK-7685>,
   SPARK-9642  <https://issues.apache.org/jira/browse/SPARK-9642> Instance
   weights for GLMs - Logistic and Linear Regression can take instance
   weights
   - SPARK-10384 <https://issues.apache.org/jira/browse/SPARK-10384>,
   SPARK-10385 <https://issues.apache.org/jira/browse/SPARK-10385> Univariate
   and bivariate statistics in DataFrames - Variance, stddev, correlations,
   etc.
   - SPARK-10117 <https://issues.apache.org/jira/browse/SPARK-10117> LIBSVM
   data source - LIBSVM as a SQL data sourceDocumentation improvements
   - SPARK-7751  <https://issues.apache.org/jira/browse/SPARK-7751> @since
   versions - Documentation includes initial version when classes and
   methods were added
   - SPARK-11337 <https://issues.apache.org/jira/browse/SPARK-11337> Testable
   example code - Automated testing for code in user guide examples

Deprecations

   - In spark.mllib.clustering.KMeans, the "runs" parameter has been
   deprecated.
   - In spark.ml.classification.LogisticRegressionModel and
   spark.ml.regression.LinearRegressionModel, the "weights" field has been
   deprecated, in favor of the new name "coefficients." This helps
   disambiguate from instance (row) weights given to algorithms.

Changes of behavior

   - spark.mllib.tree.GradientBoostedTrees validationTol has changed
   semantics in 1.6. Previously, it was a threshold for absolute change in
   error. Now, it resembles the behavior of GradientDescent convergenceTol:
   For large errors, it uses relative error (relative to the previous error);
   for small errors (< 0.01), it uses absolute error.
   - spark.ml.feature.RegexTokenizer: Previously, it did not convert
   strings to lowercase before tokenizing. Now, it converts to lowercase by
   default, with an option not to. This matches the behavior of the simpler
   Tokenizer transformer.
   - Spark SQL's partition discovery has been changed to only discover
   partition directories that are children of the given path. (i.e. if
   path="/my/data/x=1" then x=1 will no longer be considered a partition
   but only children of x=1.) This behavior can be overridden by manually
   specifying the basePath that partitioning discovery should start with (
   SPARK-11678 <https://issues.apache.org/jira/browse/SPARK-11678>).
   - When casting a value of an integral type to timestamp (e.g. casting a
   long value to timestamp), the value is treated as being in seconds instead
   of milliseconds (SPARK-11724
   <https://issues.apache.org/jira/browse/SPARK-11724>).
   - With the improved query planner for queries having distinct
   aggregations (SPARK-9241
   <https://issues.apache.org/jira/browse/SPARK-9241>), the plan of a query
   having a single distinct aggregation has been changed to a more robust
   version. To switch back to the plan generated by Spark 1.5's planner,
   please set spark.sql.specializeSingleDistinctAggPlanning to true (
   SPARK-12077 <https://issues.apache.org/jira/browse/SPARK-12077>).

Mime
View raw message