From reviews-return-723726-archive-asf-public=cust-asf.ponee.io@spark.apache.org Tue Oct 30 08:26:58 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id C4FC5180652 for ; Tue, 30 Oct 2018 08:26:56 +0100 (CET) Received: (qmail 99243 invoked by uid 500); 30 Oct 2018 07:26:55 -0000 Mailing-List: contact reviews-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list reviews@spark.apache.org Received: (qmail 99232 invoked by uid 99); 30 Oct 2018 07:26:54 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Oct 2018 07:26:54 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 65166E08DE; Tue, 30 Oct 2018 07:26:54 +0000 (UTC) From: gjhkael To: reviews@spark.apache.org Reply-To: reviews@spark.apache.org Message-ID: Subject: [GitHub] spark pull request #22886: Hadoop config should overwrite by users conf Content-Type: text/plain Date: Tue, 30 Oct 2018 07:26:54 +0000 (UTC) GitHub user gjhkael opened a pull request: https://github.com/apache/spark/pull/22886 Hadoop config should overwrite by users conf ## What changes were proposed in this pull request? Hadoop conf which is set by user which is use sparksql's set command should not overwrite by sparkContext's conf which is read from spark-default.conf. ## How was this patch tested? manually verified with 2.2.0 You can merge this pull request into a Git repository by running: $ git pull https://github.com/gjhkael/spark hadoopConfigShouldOverwriteByUsersConf Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22886.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22886 ---- commit 0d4ef2f690e378cade0a3ec84d535a535dc20dfc Author: WeichenXu Date: 2017-08-28T06:41:42Z [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSummarizer.variance generate negative result Because of numerical error, MultivariateOnlineSummarizer.variance is possible to generate negative variance. **This is a serious bug because many algos in MLLib** **use stddev computed from** `sqrt(variance)` **it will generate NaN and crash the whole algorithm.** we can reproduce this bug use the following code: ``` val summarizer1 = (new MultivariateOnlineSummarizer) .add(Vectors.dense(3.0), 0.7) val summarizer2 = (new MultivariateOnlineSummarizer) .add(Vectors.dense(3.0), 0.4) val summarizer3 = (new MultivariateOnlineSummarizer) .add(Vectors.dense(3.0), 0.5) val summarizer4 = (new MultivariateOnlineSummarizer) .add(Vectors.dense(3.0), 0.4) val summarizer = summarizer1 .merge(summarizer2) .merge(summarizer3) .merge(summarizer4) println(summarizer.variance(0)) ``` This PR fix the bugs in `mllib.stat.MultivariateOnlineSummarizer.variance` and `ml.stat.SummarizerBuffer.variance`, and several places in `WeightedLeastSquares` test cases added. Author: WeichenXu Closes #19029 from WeichenXu123/fix_summarizer_var_bug. (cherry picked from commit 0456b4050817e64f27824720e695bbfff738d474) Signed-off-by: Sean Owen commit 59bb7ebfb83c292cea853d6cd6fdf9748baa6ce2 Author: pgandhi Date: 2017-08-28T13:51:22Z [SPARK-21798] No config to replace deprecated SPARK_CLASSPATH config for launching daemons like History Server History Server Launch uses SparkClassCommandBuilder for launching the server. It is observed that SPARK_CLASSPATH has been removed and deprecated. For spark-submit this takes a different route and spark.driver.extraClasspath takes care of specifying additional jars in the classpath that were previously specified in the SPARK_CLASSPATH. Right now the only way specify the additional jars for launching daemons such as history server is using SPARK_DIST_CLASSPATH (https://spark.apache.org/docs/latest/hadoop-provided.html) but this I presume is a distribution classpath. It would be nice to have a similar config like spark.driver.extraClasspath for launching daemons similar to history server. Added new environment variable SPARK_DAEMON_CLASSPATH to set classpath for launching daemons. Tested and verified for History Server and Standalone Mode. ## How was this patch tested? Initially, history server start script would fail for the reason being that it could not find the required jars for launching the server in the java classpath. Same was true for running Master and Worker in standalone mode. By adding the environment variable SPARK_DAEMON_CLASSPATH to the java classpath, both the daemons(History Server, Standalone daemons) are starting up and running. Author: pgandhi Author: pgandhi999 Closes #19047 from pgandhi999/master. (cherry picked from commit 24e6c187fbaa6874eedbdda6b3b5dc6ff9e1de36) Signed-off-by: Tom Graves commit 59529b21a99f3c4db16b31da9dc7fce62349cf11 Author: jerryshao Date: 2017-08-29T17:50:03Z [SPARK-21714][CORE][BACKPORT-2.2] Avoiding re-uploading remote resources in yarn client mode ## What changes were proposed in this pull request? This is a backport PR to fix issue of re-uploading remote resource in yarn client mode. The original PR is #18962. ## How was this patch tested? Tested in local UT. Author: jerryshao Closes #19074 from jerryshao/SPARK-21714-2.2-backport. commit 917fe6635891ea76b22a3bcba282040afd14651d Author: Marcelo Vanzin Date: 2017-08-29T19:51:27Z Revert "[SPARK-21714][CORE][BACKPORT-2.2] Avoiding re-uploading remote resources in yarn client mode" This reverts commit 59529b21a99f3c4db16b31da9dc7fce62349cf11. commit a6a9944140bbb336146d0d868429cb01839375c7 Author: Dmitry Parfenchik Date: 2017-08-30T08:42:15Z [SPARK-21254][WEBUI] History UI performance fixes ## This is a backport of PR #18783 to the latest released branch 2.2. ## What changes were proposed in this pull request? As described in JIRA ticket, History page is taking ~1min to load for cases when amount of jobs is 10k+. Most of the time is currently being spent on DOM manipulations and all additional costs implied by this (browser repaints and reflows). PR's goal is not to change any behavior but to optimize time of History UI rendering: 1. The most costly operation is setting `innerHTML` for `duration` column within a loop, which is [extremely unperformant](https://jsperf.com/jquery-append-vs-html-list-performance/24). [Refactoring ](https://github.com/criteo-forks/spark/commit/b7e56eef4d66af977bd05af58a81e14faf33c211) this helped to get page load time **down to 10-15s** 2. Second big gain bringing page load time **down to 4s** was [was achieved](https://github.com/criteo-forks/spark/commit/3630ca212baa94d60c5fe7e4109cf6da26288cec) by detaching table's DOM before parsing it with DataTables jQuery plugin. 3. Another chunk of improvements ([1]https://github.com/criteo-forks/spark/commit/aeeeeb520d156a7293a707aa6bc053a2f83b9ac2), [2](https://github.com/criteo-forks/spark/commit/e25be9a66b018ba0cc53884f242469b515cb2bf4), [3](https://github.com/criteo-forks/spark/commit/91697079a29138b7581e64f2aa79247fa1a4e4af)) was focused on removing unnecessary DOM manipulations that in total contributed ~250ms to page load time. ## How was this patch tested? Tested by existing Selenium tests in `org.apache.spark.deploy.history.HistoryServerSuite`. Changes were also tested on Criteo's spark-2.1 fork with 20k+ number of rows in the table, reducing load time to 4s. Author: Dmitry Parfenchik Closes #18860 from 2ooom/history-ui-perf-fix-2.2. commit d10c9dc3f631a26dbbbd8f5c601ca2001a5d7c80 Author: jerryshao Date: 2017-08-30T19:30:24Z [SPARK-21714][CORE][BACKPORT-2.2] Avoiding re-uploading remote resources in yarn client mode ## What changes were proposed in this pull request? This is a backport PR to fix issue of re-uploading remote resource in yarn client mode. The original PR is #18962. ## How was this patch tested? Tested in local UT. Author: jerryshao Closes #19074 from jerryshao/SPARK-21714-2.2-backport. commit 14054ffc5fd3399d04d69e26efb31d8b24b60bdc Author: Sital Kedia Date: 2017-08-30T21:19:13Z [SPARK-21834] Incorrect executor request in case of dynamic allocation ## What changes were proposed in this pull request? killExecutor api currently does not allow killing an executor without updating the total number of executors needed. In case of dynamic allocation is turned on and the allocator tries to kill an executor, the scheduler reduces the total number of executors needed ( see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635) which is incorrect because the allocator already takes care of setting the required number of executors itself. ## How was this patch tested? Ran a job on the cluster and made sure the executor request is correct Author: Sital Kedia Closes #19081 from sitalkedia/skedia/oss_fix_executor_allocation. (cherry picked from commit 6949a9c5c6120fdde1b63876ede661adbd1eb15e) Signed-off-by: Marcelo Vanzin commit 50f86e1fe2aad67e4472b24d910ea519b9ad746f Author: gatorsmile Date: 2017-09-01T20:48:50Z [SPARK-21884][SPARK-21477][BACKPORT-2.2][SQL] Mark LocalTableScanExec's input data transient This PR is to backport https://github.com/apache/spark/pull/18686 for resolving the issue in https://github.com/apache/spark/pull/19094 --- ## What changes were proposed in this pull request? This PR is to mark the parameter `rows` and `unsafeRow` of LocalTableScanExec transient. It can avoid serializing the unneeded objects. ## How was this patch tested? N/A Author: gatorsmile Closes #19101 from gatorsmile/backport-21477. commit fb1b5f08adaf4ec7c786b7a8b6283b62683f1324 Author: Sean Owen Date: 2017-09-04T21:02:59Z [SPARK-21418][SQL] NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true ## What changes were proposed in this pull request? If no SparkConf is available to Utils.redact, simply don't redact. ## How was this patch tested? Existing tests Author: Sean Owen Closes #19123 from srowen/SPARK-21418. (cherry picked from commit ca59445adb30ed796189532df2a2898ecd33db68) Signed-off-by: Herman van Hovell commit 1f7c4869b811f9a05cd1fb54e168e739cde7933f Author: Burak Yavuz Date: 2017-09-05T20:10:32Z [SPARK-21925] Update trigger interval documentation in docs with behavior change in Spark 2.2 Forgot to update docs with behavior change. Author: Burak Yavuz Closes #19138 from brkyvz/trigger-doc-fix. (cherry picked from commit 8c954d2cd10a2cf729d2971fbeb19b2dd751a178) Signed-off-by: Tathagata Das commit 7da8fbf08b492ae899bef5ea5a08e2bcf4c6db93 Author: Dongjoon Hyun Date: 2017-09-05T21:35:09Z [MINOR][DOC] Update `Partition Discovery` section to enumerate all available file sources ## What changes were proposed in this pull request? All built-in data sources support `Partition Discovery`. We had better update the document to give the users more benefit clearly. **AFTER** 1 ## How was this patch tested? ``` SKIP_API=1 jekyll serve --watch ``` Author: Dongjoon Hyun Closes #19139 from dongjoon-hyun/partitiondiscovery. (cherry picked from commit 9e451bcf36151bf401f72dcd66001b9ceb079738) Signed-off-by: gatorsmile commit 9afab9a524c287a5c87c0ff54e5c1b757b32747c Author: Riccardo Corbella Date: 2017-09-06T07:22:57Z [SPARK-21924][DOCS] Update structured streaming programming guide doc ## What changes were proposed in this pull request? Update the line "For example, the data (12:09, cat) is out of order and late, and it falls in windows 12:05 - 12:15 and 12:10 - 12:20." as follow "For example, the data (12:09, cat) is out of order and late, and it falls in windows 12:00 - 12:10 and 12:05 - 12:15." under the programming structured streaming programming guide. Author: Riccardo Corbella Closes #19137 from riccardocorbella/bugfix. (cherry picked from commit 4ee7dfe41b27abbd4c32074ecc8f268f6193c3f4) Signed-off-by: Sean Owen commit 342cc2a4cad4b8491f4689b66570d14e5fcba33b Author: Jacek Laskowski Date: 2017-09-06T22:48:48Z [SPARK-21901][SS] Define toString for StateOperatorProgress ## What changes were proposed in this pull request? Just `StateOperatorProgress.toString` + few formatting fixes ## How was this patch tested? Local build. Waiting for OK from Jenkins. Author: Jacek Laskowski Closes #19112 from jaceklaskowski/SPARK-21901-StateOperatorProgress-toString. (cherry picked from commit fa0092bddf695a757f5ddaed539e55e2dc9fccb7) Signed-off-by: Shixiong Zhu commit 49968de526e76a75abafb636cbd5ed84f9a496e9 Author: Tucker Beck Date: 2017-09-07T00:38:00Z Fixed pandoc dependency issue in python/setup.py ## Problem Description When pyspark is listed as a dependency of another package, installing the other package will cause an install failure in pyspark. When the other package is being installed, pyspark's setup_requires requirements are installed including pypandoc. Thus, the exception handling on setup.py:152 does not work because the pypandoc module is indeed available. However, the pypandoc.convert() function fails if pandoc itself is not installed (in our use cases it is not). This raises an OSError that is not handled, and setup fails. The following is a sample failure: ``` $ which pandoc $ pip freeze | grep pypandoc pypandoc==1.4 $ pip install pyspark Collecting pyspark Downloading pyspark-2.2.0.post0.tar.gz (188.3MB) 100% |████████████████████████████████| 188.3MB 16.8MB/s Complete output from command python setup.py egg_info: Maybe try: sudo apt-get install pandoc See http://johnmacfarlane.net/pandoc/installing.html for installation options --------------------------------------------------------------- Traceback (most recent call last): File "", line 1, in File "/tmp/pip-build-mfnizcwa/pyspark/setup.py", line 151, in long_description = pypandoc.convert('README.md', 'rst') File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 69, in convert outputfile=outputfile, filters=filters) File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 260, in _convert_input _ensure_pandoc_path() File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py", line 544, in _ensure_pandoc_path raise OSError("No pandoc was found: either install pandoc and add it\n" OSError: No pandoc was found: either install pandoc and add it to your PATH or or call pypandoc.download_pandoc(...) or install pypandoc wheels with included pandoc. ---------------------------------------- Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-mfnizcwa/pyspark/ ``` ## What changes were proposed in this pull request? This change simply adds an additional exception handler for the OSError that is raised. This allows pyspark to be installed client-side without requiring pandoc to be installed. ## How was this patch tested? I tested this by building a wheel package of pyspark with the change applied. Then, in a clean virtual environment with pypandoc installed but pandoc not available on the system, I installed pyspark from the wheel. Here is the output ``` $ pip freeze | grep pypandoc pypandoc==1.4 $ which pandoc $ pip install --no-cache-dir ../spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl Processing /home/tbeck/work/spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl Requirement already satisfied: py4j==0.10.6 in /home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages (from pyspark==2.3.0.dev0) Installing collected packages: pyspark Successfully installed pyspark-2.3.0.dev0 ``` Author: Tucker Beck Closes #18981 from dusktreader/dusktreader/fix-pandoc-dependency-issue-in-setup_py. (cherry picked from commit aad2125475dcdeb4a0410392b6706511db17bac4) Signed-off-by: hyukjinkwon commit 0848df1bb6f27fc7182e0e52efeef1407fd532d2 Author: Sanket Chintapalli Date: 2017-09-07T17:20:39Z [SPARK-21890] Credentials not being passed to add the tokens ## What changes were proposed in this pull request? I observed this while running a oozie job trying to connect to hbase via spark. It look like the creds are not being passed in thehttps://github.com/apache/spark/blob/branch-2.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/security/HadoopFSCredentialProvider.scala#L53 for 2.2 release. More Info as to why it fails on secure grid: Oozie client gets the necessary tokens the application needs before launching. It passes those tokens along to the oozie launcher job (MR job) which will then actually call the Spark client to launch the spark app and pass the tokens along. The oozie launcher job cannot get anymore tokens because all it has is tokens ( you can't get tokens with tokens, you need tgt or keytab). The error here is because the launcher job runs the Spark Client to submit the spark job but the spark client doesn't see that it already has the hdfs tokens so it tries to get more, which ends with the exception. There was a change with SPARK-19021 to generalize the hdfs credentials provider that changed it so we don't pass the existing credentials into the call to get tokens so it doesn't realize it already has the necessary tokens. https://issues.apache.org/jira/browse/SPARK-21890 Modified to pass creds to get delegation tokens ## How was this patch tested? Manual testing on our secure cluster Author: Sanket Chintapalli Closes #19103 from redsanket/SPARK-21890. commit 4304d0bf05eb51c13ae1b9ee9a2970a945b51cac Author: Takuya UESHIN Date: 2017-09-08T05:26:07Z [SPARK-21950][SQL][PYTHON][TEST] pyspark.sql.tests.SQLTests2 should stop SparkContext. ## What changes were proposed in this pull request? `pyspark.sql.tests.SQLTests2` doesn't stop newly created spark context in the test and it might affect the following tests. This pr makes `pyspark.sql.tests.SQLTests2` stop `SparkContext`. ## How was this patch tested? Existing tests. Author: Takuya UESHIN Closes #19158 from ueshin/issues/SPARK-21950. (cherry picked from commit 57bc1e9eb452284cbed090dbd5008eb2062f1b36) Signed-off-by: Takuya UESHIN commit 781a1f83c538a80ce1f1876e4786b02cb7984e16 Author: MarkTab marktab.net Date: 2017-09-08T07:08:09Z [SPARK-21915][ML][PYSPARK] Model 1 and Model 2 ParamMaps Missing dongjoon-hyun HyukjinKwon Error in PySpark example code: /examples/src/main/python/ml/estimator_transformer_param_example.py The original Scala code says println("Model 2 was fit using parameters: " + model2.parent.extractParamMap) The parent is lr There is no method for accessing parent as is done in Scala. This code has been tested in Python, and returns values consistent with Scala ## What changes were proposed in this pull request? Proposing to call the lr variable instead of model1 or model2 ## How was this patch tested? This patch was tested with Spark 2.1.0 comparing the Scala and PySpark results. Pyspark returns nothing at present for those two print lines. The output for model2 in PySpark should be {Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0).'): 1e-06, Param(parent='LogisticRegression_4187be538f744d5a9090', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', doc='prediction column name.'): 'prediction', Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', doc='features column name.'): 'features', Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', doc='label column name.'): 'label', Param(parent='LogisticRegression_4187be538f744d5a9090', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.'): 'myProbability', Param(parent='LogisticRegression_4187be538f744d5a9090', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name.'): 'rawPrediction', Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial'): 'auto', Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', doc='whether to fit an intercept term.'): True, Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', doc='Threshold in binary classification prediction, in range [0, 1]. If threshold and thresholds are both set, they must match.e.g. if threshold is p, then thresholds must be equal to [1-p, p].'): 0.55, Param(parent='LogisticRegression_4187be538f744d5a9090', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2, Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', doc='max number of iterations (>= 0).'): 30, Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', doc='regularization parameter (>= 0).'): 0.1, Param(parent='LogisticRegression_4187be538f744d5a9090', name='standardization', doc='whether to standardize the training features before fitting the model.'): True} Please review http://spark.apache.org/contributing.html before opening a pull request. Author: MarkTab marktab.net Closes #19152 from marktab/branch-2.2. commit 08cb06af20f87d40b78b521f82774cf1b6f9c80a Author: Wenchen Fan Date: 2017-09-08T16:35:41Z [SPARK-21936][SQL][2.2] backward compatibility test framework for HiveExternalCatalog backport https://github.com/apache/spark/pull/19148 to 2.2 Author: Wenchen Fan Closes #19163 from cloud-fan/test. commit 9ae7c96ce33d3d67f49059b5b83ef1d9d3d8e8e5 Author: Kazuaki Ishizaki Date: 2017-09-08T16:39:20Z [SPARK-21946][TEST] fix flaky test: "alter table: rename cached table" in InMemoryCatalogedDDLSuite ## What changes were proposed in this pull request? This PR fixes flaky test `InMemoryCatalogedDDLSuite "alter table: rename cached table"`. Since this test validates distributed DataFrame, the result should be checked by using `checkAnswer`. The original version used `df.collect().Seq` method that does not guaranty an order of each element of the result. ## How was this patch tested? Use existing test case Author: Kazuaki Ishizaki Closes #19159 from kiszk/SPARK-21946. (cherry picked from commit 8a4f228dc0afed7992695486ecab6bc522f1e392) Signed-off-by: gatorsmile commit 9876821603ec12e77ee58e8ef6f5841c9c310c93 Author: hyukjinkwon Date: 2017-09-08T16:47:45Z [SPARK-21128][R][BACKPORT-2.2] Remove both "spark-warehouse" and "metastore_db" before listing files in R tests ## What changes were proposed in this pull request? This PR proposes to list the files in test _after_ removing both "spark-warehouse" and "metastore_db" so that the next run of R tests pass fine. This is sometimes a bit annoying. ## How was this patch tested? Manually running multiple times R tests via `./R/run-tests.sh`. **Before** Second run: ``` SparkSQL functions: Spark package found in SPARK_HOME: .../spark ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ....................................................................................................1234....................... Failed ------------------------------------------------------------------------- 1. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384) length(list1) not equal to length(list2). 1/1 mismatches [1] 25 - 23 == 2 2. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3384) sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE). 10/25 mismatches x[16]: "metastore_db" y[16]: "pkg" x[17]: "pkg" y[17]: "R" x[18]: "R" y[18]: "README.md" x[19]: "README.md" y[19]: "run-tests.sh" x[20]: "run-tests.sh" y[20]: "SparkR_2.2.0.tar.gz" x[21]: "metastore_db" y[21]: "pkg" x[22]: "pkg" y[22]: "R" x[23]: "R" y[23]: "README.md" x[24]: "README.md" y[24]: "run-tests.sh" x[25]: "run-tests.sh" y[25]: "SparkR_2.2.0.tar.gz" 3. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388) length(list1) not equal to length(list2). 1/1 mismatches [1] 25 - 23 == 2 4. Failure: No extra files are created in SPARK_HOME by starting session and making calls (test_sparkSQL.R#3388) sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE). 10/25 mismatches x[16]: "metastore_db" y[16]: "pkg" x[17]: "pkg" y[17]: "R" x[18]: "R" y[18]: "README.md" x[19]: "README.md" y[19]: "run-tests.sh" x[20]: "run-tests.sh" y[20]: "SparkR_2.2.0.tar.gz" x[21]: "metastore_db" y[21]: "pkg" x[22]: "pkg" y[22]: "R" x[23]: "R" y[23]: "README.md" x[24]: "README.md" y[24]: "run-tests.sh" x[25]: "run-tests.sh" y[25]: "SparkR_2.2.0.tar.gz" DONE =========================================================================== ``` **After** Second run: ``` SparkSQL functions: Spark package found in SPARK_HOME: .../spark ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................................................... ............................................................................................................................... ``` Author: hyukjinkwon Closes #18335 from HyukjinKwon/SPARK-21128. Author: hyukjinkwon Closes #19166 from felixcheung/rbackport21128. commit 182478e030688b602bf95edfd82f700d6f5678d1 Author: Liang-Chi Hsieh Date: 2017-09-09T10:10:52Z [SPARK-21954][SQL] JacksonUtils should verify MapType's value type instead of key type ## What changes were proposed in this pull request? `JacksonUtils.verifySchema` verifies if a data type can be converted to JSON. For `MapType`, it now verifies the key type. However, in `JacksonGenerator`, when converting a map to JSON, we only care about its values and create a writer for the values. The keys in a map are treated as strings by calling `toString` on the keys. Thus, we should change `JacksonUtils.verifySchema` to verify the value type of `MapType`. ## How was this patch tested? Added tests. Author: Liang-Chi Hsieh Closes #19167 from viirya/test-jacksonutils. (cherry picked from commit 6b45d7e941eba8a36be26116787322d9e3ae25d0) Signed-off-by: hyukjinkwon commit b1b5a7fdc0f8fabfb235f0b31bde0f1bfb71591a Author: Peter Szalai Date: 2017-09-10T08:47:45Z [SPARK-20098][PYSPARK] dataType's typeName fix ## What changes were proposed in this pull request? `typeName` classmethod has been fixed by using type -> typeName map. ## How was this patch tested? local build Author: Peter Szalai Closes #17435 from szalai1/datatype-gettype-fix. (cherry picked from commit 520d92a191c3148498087d751aeeddd683055622) Signed-off-by: hyukjinkwon commit 10c68366e5474f131f7ea294e6abee4e02fca9f3 Author: FavioVazquez Date: 2017-09-12T09:33:35Z [SPARK-21976][DOC] Fix wrong documentation for Mean Absolute Error. ## What changes were proposed in this pull request? Fixed wrong documentation for Mean Absolute Error. Even though the code is correct for the MAE: ```scala Since("1.2.0") def meanAbsoluteError: Double = { summary.normL1(1) / summary.count } ``` In the documentation the division by N is missing. ## How was this patch tested? All of spark tests were run. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: FavioVazquez Author: faviovazquez Author: Favio André Vázquez Closes #19190 from FavioVazquez/mae-fix. (cherry picked from commit e2ac2f1c71a0f8b03743d0d916dc0ef28482a393) Signed-off-by: Sean Owen commit 63098dc3170bf4289091d97b7beb63dd0e2356c5 Author: Kousuke Saruta Date: 2017-09-12T14:07:04Z [DOCS] Fix unreachable links in the document ## What changes were proposed in this pull request? Recently, I found two unreachable links in the document and fixed them. Because of small changes related to the document, I don't file this issue in JIRA but please suggest I should do it if you think it's needed. ## How was this patch tested? Tested manually. Author: Kousuke Saruta Closes #19195 from sarutak/fix-unreachable-link. (cherry picked from commit 957558235b7537c706c6ab4779655aa57838ebac) Signed-off-by: Sean Owen commit b606dc177e177bdbf99e72638eb8baec12e9fb53 Author: Zheng RuiFeng Date: 2017-09-12T18:37:05Z [SPARK-18608][ML] Fix double caching ## What changes were proposed in this pull request? `df.rdd.getStorageLevel` => `df.storageLevel` using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed. Previous discussion in other PRs: https://github.com/apache/spark/pull/19107, https://github.com/apache/spark/pull/17014 ## How was this patch tested? existing tests Author: Zheng RuiFeng Closes #19197 from zhengruifeng/double_caching. (cherry picked from commit c5f9b89dda40ffaa4622a7ba2b3d0605dbe815c0) Signed-off-by: Joseph K. Bradley commit 3a692e355a786260c4a9c2ef210fe14e409af37a Author: donnyzone Date: 2017-09-13T17:06:53Z [SPARK-21980][SQL] References in grouping functions should be indexed with semanticEquals ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-21980 This PR fixes the issue in ResolveGroupingAnalytics rule, which indexes the column references in grouping functions without considering case sensitive configurations. The problem can be reproduced by: `val df = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b") df.cube("a").agg(grouping("A")).show()` ## How was this patch tested? unit tests Author: donnyzone Closes #19202 from DonnyZone/ResolveGroupingAnalytics. (cherry picked from commit 21c4450fb24635fab6481a3756fefa9c6f6d6235) Signed-off-by: gatorsmile commit 51e5a821dcaa1d5f529afafc88cb8cfb4ad48e09 Author: Yanbo Liang Date: 2017-09-14T06:09:44Z [SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest. ## What changes were proposed in this pull request? #19197 fixed double caching for MLlib algorithms, but missed PySpark ```OneVsRest```, this PR fixed it. ## How was this patch tested? Existing tests. Author: Yanbo Liang Closes #19220 from yanboliang/SPARK-18608. (cherry picked from commit c76153cc7dd25b8de5266fe119095066be7f78f5) Signed-off-by: Yanbo Liang commit 42852bb17121fb8067a4aea3e56d56f76a2e0d1d Author: Andrew Ray Date: 2017-09-17T17:46:27Z [SPARK-21985][PYSPARK] PairDeserializer is broken for double-zipped RDDs ## What changes were proposed in this pull request? (edited) Fixes a bug introduced in #16121 In PairDeserializer convert each batch of keys and values to lists (if they do not have `__len__` already) so that we can check that they are the same size. Normally they already are lists so this should not have a performance impact, but this is needed when repeated `zip`'s are done. ## How was this patch tested? Additional unit test Author: Andrew Ray Closes #19226 from aray/SPARK-21985. (cherry picked from commit 6adf67dd14b0ece342bb91adf800df0a7101e038) Signed-off-by: hyukjinkwon commit 309c401a5b3c76cc1b6b5aef97d03034fe4e1ce4 Author: Andrew Ash Date: 2017-09-18T02:42:24Z [SPARK-21953] Show both memory and disk bytes spilled if either is present As written now, there must be both memory and disk bytes spilled to show either of them. If there is only one of those types of spill recorded, it will be hidden. Author: Andrew Ash Closes #19164 from ash211/patch-3. (cherry picked from commit 6308c65f08b507408033da1f1658144ea8c1491f) Signed-off-by: Wenchen Fan commit a86831d618b05c789c2cea0afe5488c3234a14bc Author: hyukjinkwon Date: 2017-09-18T04:20:11Z [SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles ## What changes were proposed in this pull request? This PR proposes to improve error message from: ``` >>> sc.show_profiles() Traceback (most recent call last): File "", line 1, in File ".../spark/python/pyspark/context.py", line 1000, in show_profiles self.profiler_collector.show_profiles() AttributeError: 'NoneType' object has no attribute 'show_profiles' >>> sc.dump_profiles("/tmp/abc") Traceback (most recent call last): File "", line 1, in File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles self.profiler_collector.dump_profiles(path) AttributeError: 'NoneType' object has no attribute 'dump_profiles' ``` to ``` >>> sc.show_profiles() Traceback (most recent call last): File "", line 1, in File ".../spark/python/pyspark/context.py", line 1003, in show_profiles raise RuntimeError("'spark.python.profile' configuration must be set " RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile. >>> sc.dump_profiles("/tmp/abc") Traceback (most recent call last): File "", line 1, in File ".../spark/python/pyspark/context.py", line 1012, in dump_profiles raise RuntimeError("'spark.python.profile' configuration must be set " RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python profile. ``` ## How was this patch tested? Unit tests added in `python/pyspark/tests.py` and manual tests. Author: hyukjinkwon Closes #19260 from HyukjinKwon/profile-errors. (cherry picked from commit 7c7266208a3be984ac1ce53747dc0c3640f4ecac) Signed-off-by: hyukjinkwon ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org For additional commands, e-mail: reviews-help@spark.apache.org