Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E09B518B21 for ; Sat, 5 Sep 2015 04:57:54 +0000 (UTC) Received: (qmail 24653 invoked by uid 500); 5 Sep 2015 04:57:52 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 24563 invoked by uid 500); 5 Sep 2015 04:57:52 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 24553 invoked by uid 99); 5 Sep 2015 04:57:52 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 05 Sep 2015 04:57:52 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id E55C9181BDE for ; Sat, 5 Sep 2015 04:57:51 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.78 X-Spam-Level: X-Spam-Status: No, score=0.78 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id tcqtZdAIqBj0 for ; Sat, 5 Sep 2015 04:57:42 +0000 (UTC) Received: from mail-lb0-f173.google.com (mail-lb0-f173.google.com [209.85.217.173]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 9EF58213B7 for ; Sat, 5 Sep 2015 04:57:41 +0000 (UTC) Received: by lbpo4 with SMTP id o4so20322980lbp.2 for ; Fri, 04 Sep 2015 21:57:34 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=RRTHo97MgOCaaZZkrqkW8e8i5w6yVTzdJASuAo67TOM=; b=RMm6G0uOB+gTSmyZ9vFE8utHfIVMccutSyaRWb2ioyoAx+R7ZPGCdTBZSZkM+XEqRk lE0cJgrfynPx2HW6ahePw2K14LuUv9QbEh9byBGksqSinLFjMPpJQISfDRCeFwwW47Do 7L72h4jhmQeUpHcXLg1H4jh+5WdzHAs9gCYfJHOHVSUb32F3E/4SuwURsMlzfL4ESldV 7Mf2WykiAZIExpE0Ted6UZcc7ifMNmFdKq6/5wZ90Q8cli7H92RRgL0xnOWMcfwLt87Q jOUnvl0qKQnlwrC632tI+mDs5WQNyqPvvXubI0RgPh42aTrlU37f0xl5jQK4DZNWJU8b ArCA== X-Gm-Message-State: ALoCoQnM9gySKpVTfXLRjQ3JGKZKmZmpzq+hRRnjOWuMWtO/ZRhYoP5wfoF0Xx4zVv1tYr8hm/Gk MIME-Version: 1.0 X-Received: by 10.152.170.130 with SMTP id am2mr7075016lac.54.1441429054729; Fri, 04 Sep 2015 21:57:34 -0700 (PDT) Received: by 10.25.21.203 with HTTP; Fri, 4 Sep 2015 21:57:34 -0700 (PDT) In-Reply-To: References: <855317293.1558744.1441377008518.JavaMail.yahoo@mail.yahoo.com> Date: Fri, 4 Sep 2015 21:57:34 -0700 Message-ID: Subject: Re: [VOTE] Release Apache Spark 1.5.0 (RC3) From: Davies Liu To: Krishna Sankar Cc: Yin Huai , Tom Graves , Reynold Xin , "dev@spark.apache.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Could you update the notebook to use builtin SQL function month and year, instead of Python UDF? (they are introduced in 1.5). Once remove those two udfs, it runs successfully, also much faster. On Fri, Sep 4, 2015 at 2:22 PM, Krishna Sankar wrote: > Yin, > It is the > https://github.com/xsankar/global-bd-conf/blob/master/004-Orders.ipynb. > Cheers > > > On Fri, Sep 4, 2015 at 9:58 AM, Yin Huai wrote: >> >> Hi Krishna, >> >> Can you share your code to reproduce the memory allocation issue? >> >> Thanks, >> >> Yin >> >> On Fri, Sep 4, 2015 at 8:00 AM, Krishna Sankar >> wrote: >>> >>> Thanks Tom. Interestingly it happened between RC2 and RC3. >>> Now my vote is +1/2 unless the memory error is known and has a >>> workaround. >>> >>> Cheers >>> >>> >>> >>> On Fri, Sep 4, 2015 at 7:30 AM, Tom Graves wrote= : >>>> >>>> The upper/lower case thing is known. >>>> https://issues.apache.org/jira/browse/SPARK-9550 >>>> I assume it was decided to be ok and its going to be in the release >>>> notes but Reynold or Josh can probably speak to it more. >>>> >>>> Tom >>>> >>>> >>>> >>>> On Thursday, September 3, 2015 10:21 PM, Krishna Sankar >>>> wrote: >>>> >>>> >>>> +? >>>> >>>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:09 min >>>> mvn clean package -Pyarn -Phadoop-2.6 -DskipTests >>>> 2. Tested pyspark, mllib >>>> 2.1. statistics (min,max,mean,Pearson,Spearman) OK >>>> 2.2. Linear/Ridge/Laso Regression OK >>>> 2.3. Decision Tree, Naive Bayes OK >>>> 2.4. KMeans OK >>>> Center And Scale OK >>>> 2.5. RDD operations OK >>>> State of the Union Texts - MapReduce, Filter,sortByKey (word >>>> count) >>>> 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK >>>> Model evaluation/optimization (rank, numIter, lambda) with >>>> itertools OK >>>> 3. Scala - MLlib >>>> 3.1. statistics (min,max,mean,Pearson,Spearman) OK >>>> 3.2. LinearRegressionWithSGD OK >>>> 3.3. Decision Tree OK >>>> 3.4. KMeans OK >>>> 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK >>>> 3.6. saveAsParquetFile OK >>>> 3.7. Read and verify the 4.3 save(above) - sqlContext.parquetFile, >>>> registerTempTable, sql OK >>>> 3.8. result =3D sqlContext.sql("SELECT >>>> OrderDetails.OrderID,ShipCountry,UnitPrice,Qty,Discount FROM Orders IN= NER >>>> JOIN OrderDetails ON Orders.OrderID =3D OrderDetails.OrderID") OK >>>> 4.0. Spark SQL from Python OK >>>> 4.1. result =3D sqlContext.sql("SELECT * from people WHERE State =3D '= WA'") >>>> OK >>>> 5.0. Packages >>>> 5.1. com.databricks.spark.csv - read/write OK >>>> (--packages com.databricks:spark-csv_2.11:1.2.0-s_2.11 didn=E2=80=99t = work. But >>>> com.databricks:spark-csv_2.11:1.2.0 worked) >>>> 6.0. DataFrames >>>> 6.1. cast,dtypes OK >>>> 6.2. groupBy,avg,crosstab,corr,isNull,na.drop OK >>>> 6.3. All joins,sql,set operations,udf OK >>>> >>>> Two Problems: >>>> >>>> 1. The synthetic column names are lowercase ( i.e. now >>>> =E2=80=98sum(OrderPrice)=E2=80=99; previously =E2=80=98SUM(OrderPrice)= =E2=80=99, now =E2=80=98avg(Total)=E2=80=99; >>>> previously 'AVG(Total)'). So programs that depend on the case of the >>>> synthetic column names would fail. >>>> 2. orders_3.groupBy("Year","Month").sum('Total').show() >>>> fails with the error =E2=80=98java.io.IOException: Unable to acqui= re 4194304 >>>> bytes of memory=E2=80=99 >>>> orders_3.groupBy("CustomerID","Year").sum('Total').show() - fails >>>> with the same error >>>> Is this a known bug ? >>>> Cheers >>>> >>>> P.S: Sorry for the spam, forgot Reply All >>>> >>>> On Tue, Sep 1, 2015 at 1:41 PM, Reynold Xin wrot= e: >>>> >>>> Please vote on releasing the following candidate as Apache Spark versi= on >>>> 1.5.0. The vote is open until Friday, Sep 4, 2015 at 21:00 UTC and pas= ses if >>>> a majority of at least 3 +1 PMC votes are cast. >>>> >>>> [ ] +1 Release this package as Apache Spark 1.5.0 >>>> [ ] -1 Do not release this package because ... >>>> >>>> To learn more about Apache Spark, please see http://spark.apache.org/ >>>> >>>> >>>> The tag to be voted on is v1.5.0-rc3: >>>> >>>> https://github.com/apache/spark/commit/908e37bcc10132bb2aa7f80ae694a9d= f6e40f31a >>>> >>>> The release files, including signatures, digests, etc. can be found at= : >>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-bin/ >>>> >>>> Release artifacts are signed with the following key: >>>> https://people.apache.org/keys/committer/pwendell.asc >>>> >>>> The staging repository for this release (published as 1.5.0-rc3) can b= e >>>> found at: >>>> https://repository.apache.org/content/repositories/orgapachespark-1143= / >>>> >>>> The staging repository for this release (published as 1.5.0) can be >>>> found at: >>>> https://repository.apache.org/content/repositories/orgapachespark-1142= / >>>> >>>> The documentation corresponding to this release can be found at: >>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs= / >>>> >>>> >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>>> How can I help test this release? >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>>> If you are a Spark user, you can help us test this release by taking a= n >>>> existing Spark workload and running on this release candidate, then >>>> reporting any regressions. >>>> >>>> >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>>> What justifies a -1 vote for this release? >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>>> This vote is happening towards the end of the 1.5 QA period, so -1 vot= es >>>> should only occur for significant regressions from 1.4. Bugs already p= resent >>>> in 1.4, minor regressions, or bugs related to new features will not bl= ock >>>> this release. >>>> >>>> >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>>> What should happen to JIRA tickets still targeting 1.5.0? >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >>>> 1. It is OK for documentation patches to target 1.5.0 and still go int= o >>>> branch-1.5, since documentations will be packaged separately from the >>>> release. >>>> 2. New features for non-alpha-modules should target 1.6+. >>>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the >>>> target version. >>>> >>>> >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D >>>> Major changes to help you focus your testing >>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D >>>> >>>> As of today, Spark 1.5 contains more than 1000 commits from 220+ >>>> contributors. I've curated a list of important changes for 1.5. For th= e >>>> complete list, please refer to Apache JIRA changelog. >>>> >>>> RDD/DataFrame/SQL APIs >>>> >>>> - New UDAF interface >>>> - DataFrame hints for broadcast join >>>> - expr function for turning a SQL expression into DataFrame column >>>> - Improved support for NaN values >>>> - StructType now supports ordering >>>> - TimestampType precision is reduced to 1us >>>> - 100 new built-in expressions, including date/time, string, math >>>> - memory and local disk only checkpointing >>>> >>>> DataFrame/SQL Backend Execution >>>> >>>> - Code generation on by default >>>> - Improved join, aggregation, shuffle, sorting with cache friendly >>>> algorithms and external algorithms >>>> - Improved window function performance >>>> - Better metrics instrumentation and reporting for DF/SQL execution >>>> plans >>>> >>>> Data Sources, Hive, Hadoop, Mesos and Cluster Management >>>> >>>> - Dynamic allocation support in all resource managers (Mesos, YARN, >>>> Standalone) >>>> - Improved Mesos support (framework authentication, roles, dynamic >>>> allocation, constraints) >>>> - Improved YARN support (dynamic allocation with preferred locations) >>>> - Improved Hive support (metastore partition pruning, metastore >>>> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2) >>>> - Support persisting data in Hive compatible format in metastore >>>> - Support data partitioning for JSON data sources >>>> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster >>>> metadata discovery and schema merging, support reading non-standard le= gacy >>>> Parquet files generated by other libraries) >>>> - Faster and more robust dynamic partition insert >>>> - DataSourceRegister interface for external data sources to specify >>>> short names >>>> >>>> SparkR >>>> >>>> - YARN cluster mode in R >>>> - GLMs with R formula, binomial/Gaussian families, and elastic-net >>>> regularization >>>> - Improved error messages >>>> - Aliases to make DataFrame functions more R-like >>>> >>>> Streaming >>>> >>>> - Backpressure for handling bursty input streams. >>>> - Improved Python support for streaming sources (Kafka offsets, Kinesi= s, >>>> MQTT, Flume) >>>> - Improved Python streaming machine learning algorithms (K-Means, line= ar >>>> regression, logistic regression) >>>> - Native reliable Kinesis stream support >>>> - Input metadata like Kafka offsets made visible in the batch details = UI >>>> - Better load balancing and scheduling of receivers across cluster >>>> - Include streaming storage in web UI >>>> >>>> Machine Learning and Advanced Analytics >>>> >>>> - Feature transformers: CountVectorizer, Discrete Cosine transformatio= n, >>>> MinMaxScaler, NGram, PCA, RFormula, StopWordsRemover, and VectorSlicer= . >>>> - Estimators under pipeline APIs: naive Bayes, k-means, and isotonic >>>> regression. >>>> - Algorithms: multilayer perceptron classifier, PrefixSpan for >>>> sequential pattern mining, association rule generation, 1-sample >>>> Kolmogorov-Smirnov test. >>>> - Improvements to existing algorithms: LDA, trees/ensembles, GMMs >>>> - More efficient Pregel API implementation for GraphX >>>> - Model summary for linear and logistic regression. >>>> - Python API: distributed matrices, streaming k-means and linear model= s, >>>> LDA, power iteration clustering, etc. >>>> - Tuning and evaluation: train-validation split and multiclass >>>> classification evaluator. >>>> - Documentation: document the release version of public API methods >>>> >>>> >>>> >>>> >>>> >>> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org For additional commands, e-mail: dev-help@spark.apache.org