spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: [VOTE] Release Apache Spark 1.5.0 (RC2)
Date Tue, 01 Sep 2015 14:17:00 GMT
Any 1.5 RC comes from the latest state of the 1.5 branch at some point
in time. The next RC will be cut from whatever the latest commit is.
You can see the tags in git for the specific commits for each RC.
There's no such thing as "1.5.1 SNAPSHOT" commits, just commits to
branch 1.5. I would ignore the "SNAPSHOT" version for your purpose.

You can always build from the exact commit that an RC did by looking
at tags. There is no 1.5.0 yet so you can't build that, but once it's
released, you would be able to find its tag as well. You can always
build the latest 1.5.x branch by building from HEAD of that branch.

On Tue, Sep 1, 2015 at 3:13 PM,  <chester@alpinenow.com> wrote:
> Thanks for the explanation. Since 1.5.0 rc3 is not yet released, I assume it would cut
from 1.5 branch, doesn't that bring 1.5.1 snapshot code ?
>
> The reason I am asking these questions is that I would like to know If I want build 1.5.0
 myself, which commit should I use ?
>
> Sent from my iPad
>
>> On Sep 1, 2015, at 6:57 AM, Sean Owen <sowen@cloudera.com> wrote:
>>
>> The head of branch 1.5 will always be a "1.5.x-SNAPSHOT" version. Yeah
>> technically you would expect it to be 1.5.0-SNAPSHOT until 1.5.0 is
>> released. In practice I think it's simpler to follow the defaults of
>> the Maven release plugin, which will set this to 1.5.1-SNAPSHOT after
>> any 1.5.0-rc is released. It doesn't affect later RCs. This has
>> nothing to do with what commits go into 1.5.0; it's an ignorable
>> detail of the version in POMs in the source tree, which don't mean
>> much anyway as the source tree itself is not a released version.
>>
>>> On Tue, Sep 1, 2015 at 2:48 PM,  <chester@alpinenow.com> wrote:
>>> Sorry, I am still not follow. I assume the release would build from 1.5.0 before
moving to 1.5.1. Are you saying the 1.5.0 rc3 could build from 1.5.1 snapshot during release
? Or 1.5.0 rc3 would build from the last commit of 1.5.0 (before changing to 1.5.1 snapshot)
?
>>>
>>>
>>>
>>> Sent from my iPad
>>>
>>>> On Sep 1, 2015, at 1:52 AM, Sean Owen <sowen@cloudera.com> wrote:
>>>>
>>>> That's correct for the 1.5 branch, right? this doesn't mean that the
>>>> next RC would have this value. You choose the release version during
>>>> the release process.
>>>>
>>>>> On Tue, Sep 1, 2015 at 2:40 AM, Chester Chen <chester@alpinenow.com>
wrote:
>>>>> Seems that Github branch-1.5 already changing the version to 1.5.1-SNAPSHOT,
>>>>>
>>>>> I am a bit confused are we still on 1.5.0 RC3 or we are in 1.5.1 ?
>>>>>
>>>>> Chester
>>>>>
>>>>>> On Mon, Aug 31, 2015 at 3:52 PM, Reynold Xin <rxin@databricks.com>
wrote:
>>>>>>
>>>>>> I'm going to -1 the release myself since the issue @yhuai identified
is
>>>>>> pretty serious. It basically OOMs the driver for reading any files
with a
>>>>>> large number of partitions. Looks like the patch for that has already
been
>>>>>> merged.
>>>>>>
>>>>>> I'm going to cut rc3 momentarily.
>>>>>>
>>>>>>
>>>>>> On Sun, Aug 30, 2015 at 11:30 AM, Sandy Ryza <sandy.ryza@cloudera.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> +1 (non-binding)
>>>>>>> built from source and ran some jobs against YARN
>>>>>>>
>>>>>>> -Sandy
>>>>>>>
>>>>>>> On Sat, Aug 29, 2015 at 5:50 AM, vaquar khan <vaquar.khan@gmail.com>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> +1 (1.5.0 RC2)Compiled on Windows with YARN.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Vaquar khan
>>>>>>>>
>>>>>>>> +1 (non-binding, of course)
>>>>>>>>
>>>>>>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 42:36 min
>>>>>>>>    mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
>>>>>>>> 2. Tested pyspark, mllib
>>>>>>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>>>>>> 2.2. Linear/Ridge/Laso Regression OK
>>>>>>>> 2.3. Decision Tree, Naive Bayes OK
>>>>>>>> 2.4. KMeans OK
>>>>>>>>      Center And Scale OK
>>>>>>>> 2.5. RDD operations OK
>>>>>>>>     State of the Union Texts - MapReduce, Filter,sortByKey
(word
>>>>>>>> count)
>>>>>>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings)
OK
>>>>>>>>      Model evaluation/optimization (rank, numIter, lambda)
with
>>>>>>>> itertools OK
>>>>>>>> 3. Scala - MLlib
>>>>>>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK
>>>>>>>> 3.2. LinearRegressionWithSGD OK
>>>>>>>> 3.3. Decision Tree OK
>>>>>>>> 3.4. KMeans OK
>>>>>>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings)
OK
>>>>>>>> 3.6. saveAsParquetFile OK
>>>>>>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile,
>>>>>>>> registerTempTable, sql OK
>>>>>>>> 3.8. result = sqlContext.sql("SELECT
>>>>>>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM
Orders INNER
>>>>>>>> JOIN OrderDetails ON Orders.OrderID = OrderDetails.OrderID")
OK
>>>>>>>> 4.0. Spark SQL from Python OK
>>>>>>>> 4.1. result = sqlContext.sql("SELECT * from people WHERE
State = 'WA'")
>>>>>>>> OK
>>>>>>>> 5.0. Packages
>>>>>>>> 5.1. com.databricks.spark.csv - read/write OK
>>>>>>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn’t
work. But
>>>>>>>> com.databricks:spark-csv_2.11:1.2.0 worked)
>>>>>>>> 6.0. DataFrames
>>>>>>>> 6.1. cast,dtypes OK
>>>>>>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK
>>>>>>>> 6.3. joins,sql,set operations,udf OK
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>> <k/>
>>>>>>>>
>>>>>>>> On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin <rxin@databricks.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Please vote on releasing the following candidate as Apache
Spark
>>>>>>>>> version 1.5.0. The vote is open until Friday, Aug 29,
2015 at 5:00 UTC and
>>>>>>>>> passes if a majority of at least 3 +1 PMC votes are cast.
>>>>>>>>>
>>>>>>>>> [ ] +1 Release this package as Apache Spark 1.5.0
>>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>>
>>>>>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The tag to be voted on is v1.5.0-rc2:
>>>>>>>>>
>>>>>>>>> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>>>>>>>>>
>>>>>>>>> The release files, including signatures, digests, etc.
can be found at:
>>>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>>>>>>>>>
>>>>>>>>> Release artifacts are signed with the following key:
>>>>>>>>> https://people.apache.org/keys/committer/pwendell.asc
>>>>>>>>>
>>>>>>>>> The staging repository for this release (published as
1.5.0-rc2) can be
>>>>>>>>> found at:
>>>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1141/
>>>>>>>>>
>>>>>>>>> The staging repository for this release (published as
1.5.0) can be
>>>>>>>>> found at:
>>>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1140/
>>>>>>>>>
>>>>>>>>> The documentation corresponding to this release can be
found at:
>>>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> =======================================
>>>>>>>>> How can I help test this release?
>>>>>>>>> =======================================
>>>>>>>>> If you are a Spark user, you can help us test this release
by taking an
>>>>>>>>> existing Spark workload and running on this release candidate,
then
>>>>>>>>> reporting any regressions.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ================================================
>>>>>>>>> What justifies a -1 vote for this release?
>>>>>>>>> ================================================
>>>>>>>>> This vote is happening towards the end of the 1.5 QA
period, so -1
>>>>>>>>> votes should only occur for significant regressions from
1.4. Bugs already
>>>>>>>>> present in 1.4, minor regressions, or bugs related to
new features will not
>>>>>>>>> block this release.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ===============================================================
>>>>>>>>> What should happen to JIRA tickets still targeting 1.5.0?
>>>>>>>>> ===============================================================
>>>>>>>>> 1. It is OK for documentation patches to target 1.5.0
and still go into
>>>>>>>>> branch-1.5, since documentations will be packaged separately
from the
>>>>>>>>> release.
>>>>>>>>> 2. New features for non-alpha-modules should target 1.6+.
>>>>>>>>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0,
or drop the
>>>>>>>>> target version.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ==================================================
>>>>>>>>> Major changes to help you focus your testing
>>>>>>>>> ==================================================
>>>>>>>>>
>>>>>>>>> As of today, Spark 1.5 contains more than 1000 commits
from 220+
>>>>>>>>> contributors. I've curated a list of important changes
for 1.5. For the
>>>>>>>>> complete list, please refer to Apache JIRA changelog.
>>>>>>>>>
>>>>>>>>> RDD/DataFrame/SQL APIs
>>>>>>>>>
>>>>>>>>> - New UDAF interface
>>>>>>>>> - DataFrame hints for broadcast join
>>>>>>>>> - expr function for turning a SQL expression into DataFrame
column
>>>>>>>>> - Improved support for NaN values
>>>>>>>>> - StructType now supports ordering
>>>>>>>>> - TimestampType precision is reduced to 1us
>>>>>>>>> - 100 new built-in expressions, including date/time,
string, math
>>>>>>>>> - memory and local disk only checkpointing
>>>>>>>>>
>>>>>>>>> DataFrame/SQL Backend Execution
>>>>>>>>>
>>>>>>>>> - Code generation on by default
>>>>>>>>> - Improved join, aggregation, shuffle, sorting with cache
friendly
>>>>>>>>> algorithms and external algorithms
>>>>>>>>> - Improved window function performance
>>>>>>>>> - Better metrics instrumentation and reporting for DF/SQL
execution
>>>>>>>>> plans
>>>>>>>>>
>>>>>>>>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>>>>>>>>>
>>>>>>>>> - Dynamic allocation support in all resource managers
(Mesos, YARN,
>>>>>>>>> Standalone)
>>>>>>>>> - Improved Mesos support (framework authentication, roles,
dynamic
>>>>>>>>> allocation, constraints)
>>>>>>>>> - Improved YARN support (dynamic allocation with preferred
locations)
>>>>>>>>> - Improved Hive support (metastore partition pruning,
metastore
>>>>>>>>> connectivity to 0.13 to 1.2, internal Hive upgrade to
1.2)
>>>>>>>>> - Support persisting data in Hive compatible format in
metastore
>>>>>>>>> - Support data partitioning for JSON data sources
>>>>>>>>> - Parquet improvements (upgrade to 1.7, predicate pushdown,
faster
>>>>>>>>> metadata discovery and schema merging, support reading
non-standard legacy
>>>>>>>>> Parquet files generated by other libraries)
>>>>>>>>> - Faster and more robust dynamic partition insert
>>>>>>>>> - DataSourceRegister interface for external data sources
to specify
>>>>>>>>> short names
>>>>>>>>>
>>>>>>>>> SparkR
>>>>>>>>>
>>>>>>>>> - YARN cluster mode in R
>>>>>>>>> - GLMs with R formula, binomial/Gaussian families, and
elastic-net
>>>>>>>>> regularization
>>>>>>>>> - Improved error messages
>>>>>>>>> - Aliases to make DataFrame functions more R-like
>>>>>>>>>
>>>>>>>>> Streaming
>>>>>>>>>
>>>>>>>>> - Backpressure for handling bursty input streams.
>>>>>>>>> - Improved Python support for streaming sources (Kafka
offsets,
>>>>>>>>> Kinesis, MQTT, Flume)
>>>>>>>>> - Improved Python streaming machine learning algorithms
(K-Means,
>>>>>>>>> linear regression, logistic regression)
>>>>>>>>> - Native reliable Kinesis stream support
>>>>>>>>> - Input metadata like Kafka offsets made visible in the
batch details
>>>>>>>>> UI
>>>>>>>>> - Better load balancing and scheduling of receivers across
cluster
>>>>>>>>> - Include streaming storage in web UI
>>>>>>>>>
>>>>>>>>> Machine Learning and Advanced Analytics
>>>>>>>>>
>>>>>>>>> - Feature transformers: CountVectorizer, Discrete Cosine
>>>>>>>>> transformation, MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover,
and
>>>>>>>>> VectorSlicer.
>>>>>>>>> - Estimators under pipeline APIs: naive Bayes, k-means,
and isotonic
>>>>>>>>> regression.
>>>>>>>>> - Algorithms: multilayer perceptron classifier, PrefixSpan
for
>>>>>>>>> sequential pattern mining, association rule generation,
1-sample
>>>>>>>>> Kolmogorov-Smirnov test.
>>>>>>>>> - Improvements to existing algorithms: LDA, trees/ensembles,
GMMs
>>>>>>>>> - More efficient Pregel API implementation for GraphX
>>>>>>>>> - Model summary for linear and logistic regression.
>>>>>>>>> - Python API: distributed matrices, streaming k-means
and linear
>>>>>>>>> models, LDA, power iteration clustering, etc.
>>>>>>>>> - Tuning and evaluation: train-validation split and multiclass
>>>>>>>>> classification evaluator.
>>>>>>>>> - Documentation: document the release version of public
API methods
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message