spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gjhkael <...@git.apache.org>
Subject [GitHub] spark pull request #22886: Hadoop config should overwrite by users conf
Date Tue, 30 Oct 2018 07:26:54 GMT
GitHub user gjhkael opened a pull request:

    https://github.com/apache/spark/pull/22886

    Hadoop config should overwrite by users conf

    ## What changes were proposed in this pull request?
    Hadoop conf which is set by user which is use sparksql's set command should not overwrite
by sparkContext's conf which is read from spark-default.conf.
    
    
    ## How was this patch tested?
    manually verified with 2.2.0


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gjhkael/spark hadoopConfigShouldOverwriteByUsersConf

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22886.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22886
    
----
commit 0d4ef2f690e378cade0a3ec84d535a535dc20dfc
Author: WeichenXu <weichenxu123@...>
Date:   2017-08-28T06:41:42Z

    [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSummarizer.variance generate negative
result
    
    Because of numerical error, MultivariateOnlineSummarizer.variance is possible to generate
negative variance.
    
    **This is a serious bug because many algos in MLLib**
    **use stddev computed from** `sqrt(variance)`
    **it will generate NaN and crash the whole algorithm.**
    
    we can reproduce this bug use the following code:
    ```
        val summarizer1 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.7)
        val summarizer2 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.4)
        val summarizer3 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.5)
        val summarizer4 = (new MultivariateOnlineSummarizer)
          .add(Vectors.dense(3.0), 0.4)
    
        val summarizer = summarizer1
          .merge(summarizer2)
          .merge(summarizer3)
          .merge(summarizer4)
    
        println(summarizer.variance(0))
    ```
    This PR fix the bugs in `mllib.stat.MultivariateOnlineSummarizer.variance` and `ml.stat.SummarizerBuffer.variance`,
and several places in `WeightedLeastSquares`
    
    test cases added.
    
    Author: WeichenXu <WeichenXu123@outlook.com>
    
    Closes #19029 from WeichenXu123/fix_summarizer_var_bug.
    
    (cherry picked from commit 0456b4050817e64f27824720e695bbfff738d474)
    Signed-off-by: Sean Owen <sowen@cloudera.com>

commit 59bb7ebfb83c292cea853d6cd6fdf9748baa6ce2
Author: pgandhi <pgandhi@...>
Date:   2017-08-28T13:51:22Z

    [SPARK-21798] No config to replace deprecated SPARK_CLASSPATH config for launching daemons
like History Server
    
    History Server Launch uses SparkClassCommandBuilder for launching the server. It is observed
that SPARK_CLASSPATH has been removed and deprecated. For spark-submit this takes a different
route and spark.driver.extraClasspath takes care of specifying additional jars in the classpath
that were previously specified in the SPARK_CLASSPATH. Right now the only way specify the
additional jars for launching daemons such as history server is using SPARK_DIST_CLASSPATH
(https://spark.apache.org/docs/latest/hadoop-provided.html) but this I presume is a distribution
classpath. It would be nice to have a similar config like spark.driver.extraClasspath for
launching daemons similar to history server.
    
    Added new environment variable SPARK_DAEMON_CLASSPATH to set classpath for launching daemons.
Tested and verified for History Server and Standalone Mode.
    
    ## How was this patch tested?
    Initially, history server start script would fail for the reason being that it could not
find the required jars for launching the server in the java classpath. Same was true for running
Master and Worker in standalone mode. By adding the environment variable SPARK_DAEMON_CLASSPATH
to the java classpath, both the daemons(History Server, Standalone daemons) are starting up
and running.
    
    Author: pgandhi <pgandhi@yahoo-inc.com>
    Author: pgandhi999 <parthkgandhi9@gmail.com>
    
    Closes #19047 from pgandhi999/master.
    
    (cherry picked from commit 24e6c187fbaa6874eedbdda6b3b5dc6ff9e1de36)
    Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>

commit 59529b21a99f3c4db16b31da9dc7fce62349cf11
Author: jerryshao <sshao@...>
Date:   2017-08-29T17:50:03Z

    [SPARK-21714][CORE][BACKPORT-2.2] Avoiding re-uploading remote resources in yarn client
mode
    
    ## What changes were proposed in this pull request?
    
    This is a backport PR to fix issue of re-uploading remote resource in yarn client mode.
The original PR is #18962.
    
    ## How was this patch tested?
    
    Tested in local UT.
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes #19074 from jerryshao/SPARK-21714-2.2-backport.

commit 917fe6635891ea76b22a3bcba282040afd14651d
Author: Marcelo Vanzin <vanzin@...>
Date:   2017-08-29T19:51:27Z

    Revert "[SPARK-21714][CORE][BACKPORT-2.2] Avoiding re-uploading remote resources in yarn
client mode"
    
    This reverts commit 59529b21a99f3c4db16b31da9dc7fce62349cf11.

commit a6a9944140bbb336146d0d868429cb01839375c7
Author: Dmitry Parfenchik <d.parfenchik@...>
Date:   2017-08-30T08:42:15Z

    [SPARK-21254][WEBUI] History UI performance fixes
    
    ## This is a backport of PR #18783 to the latest released branch 2.2.
    
    ## What changes were proposed in this pull request?
    
    As described in JIRA ticket, History page is taking ~1min to load for cases when amount
of jobs is 10k+.
    Most of the time is currently being spent on DOM manipulations and all additional costs
implied by this (browser repaints and reflows).
    PR's goal is not to change any behavior but to optimize time of History UI rendering:
    
    1. The most costly operation is setting `innerHTML` for `duration` column within a loop,
which is [extremely unperformant](https://jsperf.com/jquery-append-vs-html-list-performance/24).
[Refactoring ](https://github.com/criteo-forks/spark/commit/b7e56eef4d66af977bd05af58a81e14faf33c211)
this helped to get page load time **down to 10-15s**
    
    2. Second big gain bringing page load time **down to 4s** was [was achieved](https://github.com/criteo-forks/spark/commit/3630ca212baa94d60c5fe7e4109cf6da26288cec)
by detaching table's DOM before parsing it with DataTables jQuery plugin.
    
    3. Another chunk of improvements ([1]https://github.com/criteo-forks/spark/commit/aeeeeb520d156a7293a707aa6bc053a2f83b9ac2),
[2](https://github.com/criteo-forks/spark/commit/e25be9a66b018ba0cc53884f242469b515cb2bf4),
[3](https://github.com/criteo-forks/spark/commit/91697079a29138b7581e64f2aa79247fa1a4e4af))
was focused on removing unnecessary DOM manipulations that in total contributed ~250ms to
page load time.
    
    ## How was this patch tested?
    
    Tested by existing Selenium tests in `org.apache.spark.deploy.history.HistoryServerSuite`.
    
    Changes were also tested on Criteo's spark-2.1 fork with 20k+ number of rows in the table,
reducing load time to 4s.
    
    Author: Dmitry Parfenchik <d.parfenchik@criteo.com>
    
    Closes #18860 from 2ooom/history-ui-perf-fix-2.2.

commit d10c9dc3f631a26dbbbd8f5c601ca2001a5d7c80
Author: jerryshao <sshao@...>
Date:   2017-08-30T19:30:24Z

    [SPARK-21714][CORE][BACKPORT-2.2] Avoiding re-uploading remote resources in yarn client
mode
    
    ## What changes were proposed in this pull request?
    
    This is a backport PR to fix issue of re-uploading remote resource in yarn client mode.
The original PR is #18962.
    
    ## How was this patch tested?
    
    Tested in local UT.
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes #19074 from jerryshao/SPARK-21714-2.2-backport.

commit 14054ffc5fd3399d04d69e26efb31d8b24b60bdc
Author: Sital Kedia <skedia@...>
Date:   2017-08-30T21:19:13Z

    [SPARK-21834] Incorrect executor request in case of dynamic allocation
    
    ## What changes were proposed in this pull request?
    
    killExecutor api currently does not allow killing an executor without updating the total
number of executors needed. In case of dynamic allocation is turned on and the allocator tries
to kill an executor, the scheduler reduces the total number of executors needed ( see https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L635)
which is incorrect because the allocator already takes care of setting the required number
of executors itself.
    
    ## How was this patch tested?
    
    Ran a job on the cluster and made sure the executor request is correct
    
    Author: Sital Kedia <skedia@fb.com>
    
    Closes #19081 from sitalkedia/skedia/oss_fix_executor_allocation.
    
    (cherry picked from commit 6949a9c5c6120fdde1b63876ede661adbd1eb15e)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

commit 50f86e1fe2aad67e4472b24d910ea519b9ad746f
Author: gatorsmile <gatorsmile@...>
Date:   2017-09-01T20:48:50Z

    [SPARK-21884][SPARK-21477][BACKPORT-2.2][SQL] Mark LocalTableScanExec's input data transient
    
    This PR is to backport https://github.com/apache/spark/pull/18686 for resolving the issue
in https://github.com/apache/spark/pull/19094
    
    ---
    
    ## What changes were proposed in this pull request?
    This PR is to mark the parameter `rows` and `unsafeRow` of LocalTableScanExec transient.
It can avoid serializing the unneeded objects.
    
    ## How was this patch tested?
    N/A
    
    Author: gatorsmile <gatorsmile@gmail.com>
    
    Closes #19101 from gatorsmile/backport-21477.

commit fb1b5f08adaf4ec7c786b7a8b6283b62683f1324
Author: Sean Owen <sowen@...>
Date:   2017-09-04T21:02:59Z

    [SPARK-21418][SQL] NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true
    
    ## What changes were proposed in this pull request?
    
    If no SparkConf is available to Utils.redact, simply don't redact.
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #19123 from srowen/SPARK-21418.
    
    (cherry picked from commit ca59445adb30ed796189532df2a2898ecd33db68)
    Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

commit 1f7c4869b811f9a05cd1fb54e168e739cde7933f
Author: Burak Yavuz <brkyvz@...>
Date:   2017-09-05T20:10:32Z

    [SPARK-21925] Update trigger interval documentation in docs with behavior change in Spark
2.2
    
    Forgot to update docs with behavior change.
    
    Author: Burak Yavuz <brkyvz@gmail.com>
    
    Closes #19138 from brkyvz/trigger-doc-fix.
    
    (cherry picked from commit 8c954d2cd10a2cf729d2971fbeb19b2dd751a178)
    Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

commit 7da8fbf08b492ae899bef5ea5a08e2bcf4c6db93
Author: Dongjoon Hyun <dongjoon@...>
Date:   2017-09-05T21:35:09Z

    [MINOR][DOC] Update `Partition Discovery` section to enumerate all available file sources
    
    ## What changes were proposed in this pull request?
    
    All built-in data sources support `Partition Discovery`. We had better update the document
to give the users more benefit clearly.
    
    **AFTER**
    
    <img width="906" alt="1" src="https://user-images.githubusercontent.com/9700541/30083628-14278908-9244-11e7-98dc-9ad45fe233a9.png">
    
    ## How was this patch tested?
    
    ```
    SKIP_API=1 jekyll serve --watch
    ```
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #19139 from dongjoon-hyun/partitiondiscovery.
    
    (cherry picked from commit 9e451bcf36151bf401f72dcd66001b9ceb079738)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>

commit 9afab9a524c287a5c87c0ff54e5c1b757b32747c
Author: Riccardo Corbella <r.corbella@...>
Date:   2017-09-06T07:22:57Z

    [SPARK-21924][DOCS] Update structured streaming programming guide doc
    
    ## What changes were proposed in this pull request?
    
    Update the line "For example, the data (12:09, cat) is out of order and late, and it falls
in windows 12:05 - 12:15 and 12:10 - 12:20." as follow "For example, the data (12:09, cat)
is out of order and late, and it falls in windows 12:00 - 12:10 and 12:05 - 12:15." under
the programming structured streaming programming guide.
    
    Author: Riccardo Corbella <r.corbella@reply.it>
    
    Closes #19137 from riccardocorbella/bugfix.
    
    (cherry picked from commit 4ee7dfe41b27abbd4c32074ecc8f268f6193c3f4)
    Signed-off-by: Sean Owen <sowen@cloudera.com>

commit 342cc2a4cad4b8491f4689b66570d14e5fcba33b
Author: Jacek Laskowski <jacek@...>
Date:   2017-09-06T22:48:48Z

    [SPARK-21901][SS] Define toString for StateOperatorProgress
    
    ## What changes were proposed in this pull request?
    
    Just `StateOperatorProgress.toString` + few formatting fixes
    
    ## How was this patch tested?
    
    Local build. Waiting for OK from Jenkins.
    
    Author: Jacek Laskowski <jacek@japila.pl>
    
    Closes #19112 from jaceklaskowski/SPARK-21901-StateOperatorProgress-toString.
    
    (cherry picked from commit fa0092bddf695a757f5ddaed539e55e2dc9fccb7)
    Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>

commit 49968de526e76a75abafb636cbd5ed84f9a496e9
Author: Tucker Beck <tucker.beck@...>
Date:   2017-09-07T00:38:00Z

    Fixed pandoc dependency issue in python/setup.py
    
    ## Problem Description
    
    When pyspark is listed as a dependency of another package, installing
    the other package will cause an install failure in pyspark. When the
    other package is being installed, pyspark's setup_requires requirements
    are installed including pypandoc. Thus, the exception handling on
    setup.py:152 does not work because the pypandoc module is indeed
    available. However, the pypandoc.convert() function fails if pandoc
    itself is not installed (in our use cases it is not). This raises an
    OSError that is not handled, and setup fails.
    
    The following is a sample failure:
    ```
    $ which pandoc
    $ pip freeze | grep pypandoc
    pypandoc==1.4
    $ pip install pyspark
    Collecting pyspark
      Downloading pyspark-2.2.0.post0.tar.gz (188.3MB)
        100% |████████████████████████████████|
188.3MB 16.8MB/s
        Complete output from command python setup.py egg_info:
        Maybe try:
    
            sudo apt-get install pandoc
        See http://johnmacfarlane.net/pandoc/installing.html
        for installation options
        ---------------------------------------------------------------
    
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-build-mfnizcwa/pyspark/setup.py", line 151, in <module>
            long_description = pypandoc.convert('README.md', 'rst')
          File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py",
line 69, in convert
            outputfile=outputfile, filters=filters)
          File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py",
line 260, in _convert_input
            _ensure_pandoc_path()
          File "/home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages/pypandoc/__init__.py",
line 544, in _ensure_pandoc_path
            raise OSError("No pandoc was found: either install pandoc and add it\n"
        OSError: No pandoc was found: either install pandoc and add it
        to your PATH or or call pypandoc.download_pandoc(...) or
        install pypandoc wheels with included pandoc.
    
        ----------------------------------------
    Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-mfnizcwa/pyspark/
    ```
    
    ## What changes were proposed in this pull request?
    
    This change simply adds an additional exception handler for the OSError
    that is raised. This allows pyspark to be installed client-side without requiring pandoc
to be installed.
    
    ## How was this patch tested?
    
    I tested this by building a wheel package of pyspark with the change applied. Then, in
a clean virtual environment with pypandoc installed but pandoc not available on the system,
I installed pyspark from the wheel.
    
    Here is the output
    
    ```
    $ pip freeze | grep pypandoc
    pypandoc==1.4
    $ which pandoc
    $ pip install --no-cache-dir ../spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl
    Processing /home/tbeck/work/spark/python/dist/pyspark-2.3.0.dev0-py2.py3-none-any.whl
    Requirement already satisfied: py4j==0.10.6 in /home/tbeck/.virtualenvs/cem/lib/python3.5/site-packages
(from pyspark==2.3.0.dev0)
    Installing collected packages: pyspark
    Successfully installed pyspark-2.3.0.dev0
    ```
    
    Author: Tucker Beck <tucker.beck@rentrakmail.com>
    
    Closes #18981 from dusktreader/dusktreader/fix-pandoc-dependency-issue-in-setup_py.
    
    (cherry picked from commit aad2125475dcdeb4a0410392b6706511db17bac4)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

commit 0848df1bb6f27fc7182e0e52efeef1407fd532d2
Author: Sanket Chintapalli <schintap@...>
Date:   2017-09-07T17:20:39Z

    [SPARK-21890] Credentials not being passed to add the tokens
    
    ## What changes were proposed in this pull request?
    I observed this while running a oozie job trying to connect to hbase via spark.
    It look like the creds are not being passed in thehttps://github.com/apache/spark/blob/branch-2.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/security/HadoopFSCredentialProvider.scala#L53
for 2.2 release.
    More Info as to why it fails on secure grid:
    Oozie client gets the necessary tokens the application needs before launching. It passes
those tokens along to the oozie launcher job (MR job) which will then actually call the Spark
client to launch the spark app and pass the tokens along.
    The oozie launcher job cannot get anymore tokens because all it has is tokens ( you can't
get tokens with tokens, you need tgt or keytab).
    The error here is because the launcher job runs the Spark Client to submit the spark job
but the spark client doesn't see that it already has the hdfs tokens so it tries to get more,
which ends with the exception.
    There was a change with SPARK-19021 to generalize the hdfs credentials provider that changed
it so we don't pass the existing credentials into the call to get tokens so it doesn't realize
it already has the necessary tokens.
    
    https://issues.apache.org/jira/browse/SPARK-21890
    Modified to pass creds to get delegation tokens
    
    ## How was this patch tested?
    Manual testing on our secure cluster
    
    Author: Sanket Chintapalli <schintap@yahoo-inc.com>
    
    Closes #19103 from redsanket/SPARK-21890.

commit 4304d0bf05eb51c13ae1b9ee9a2970a945b51cac
Author: Takuya UESHIN <ueshin@...>
Date:   2017-09-08T05:26:07Z

    [SPARK-21950][SQL][PYTHON][TEST] pyspark.sql.tests.SQLTests2 should stop SparkContext.
    
    ## What changes were proposed in this pull request?
    
    `pyspark.sql.tests.SQLTests2` doesn't stop newly created spark context in the test and
it might affect the following tests.
    This pr makes `pyspark.sql.tests.SQLTests2` stop `SparkContext`.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Takuya UESHIN <ueshin@databricks.com>
    
    Closes #19158 from ueshin/issues/SPARK-21950.
    
    (cherry picked from commit 57bc1e9eb452284cbed090dbd5008eb2062f1b36)
    Signed-off-by: Takuya UESHIN <ueshin@databricks.com>

commit 781a1f83c538a80ce1f1876e4786b02cb7984e16
Author: MarkTab marktab.net <marktab@...>
Date:   2017-09-08T07:08:09Z

    [SPARK-21915][ML][PYSPARK] Model 1 and Model 2 ParamMaps Missing
    
    dongjoon-hyun HyukjinKwon
    
    Error in PySpark example code:
    /examples/src/main/python/ml/estimator_transformer_param_example.py
    
    The original Scala code says
    println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)
    
    The parent is lr
    
    There is no method for accessing parent as is done in Scala.
    
    This code has been tested in Python, and returns values consistent with Scala
    
    ## What changes were proposed in this pull request?
    
    Proposing to call the lr variable instead of model1 or model2
    
    ## How was this patch tested?
    
    This patch was tested with Spark 2.1.0 comparing the Scala and PySpark results. Pyspark
returns nothing at present for those two print lines.
    
    The output for model2 in PySpark should be
    
    {Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the convergence
tolerance for iterative algorithms (>= 0).'): 1e-06,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='elasticNetParam', doc='the
ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty.
For alpha = 1, it is an L1 penalty.'): 0.0,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', doc='prediction
column name.'): 'prediction',
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', doc='features
column name.'): 'features',
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', doc='label column
name.'): 'label',
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='probabilityCol', doc='Column
name for predicted class conditional probabilities. Note: Not all models output well-calibrated
probability estimates! These probabilities should be treated as confidences, not precise probabilities.'):
'myProbability',
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='rawPredictionCol', doc='raw
prediction (a.k.a. confidence) column name.'): 'rawPrediction',
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', doc='The name of
family which is a description of the label distribution to be used in the model. Supported
options: auto, binomial, multinomial'): 'auto',
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', doc='whether
to fit an intercept term.'): True,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', doc='Threshold
in binary classification prediction, in range [0, 1]. If threshold and thresholds are both
set, they must match.e.g. if threshold is p, then thresholds must be equal to [1-p, p].'):
0.55,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='aggregationDepth', doc='suggested
depth for treeAggregate (>= 2).'): 2,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', doc='max number
of iterations (>= 0).'): 30,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', doc='regularization
parameter (>= 0).'): 0.1,
    Param(parent='LogisticRegression_4187be538f744d5a9090', name='standardization', doc='whether
to standardize the training features before fitting the model.'): True}
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: MarkTab marktab.net <marktab@users.noreply.github.com>
    
    Closes #19152 from marktab/branch-2.2.

commit 08cb06af20f87d40b78b521f82774cf1b6f9c80a
Author: Wenchen Fan <wenchen@...>
Date:   2017-09-08T16:35:41Z

    [SPARK-21936][SQL][2.2] backward compatibility test framework for HiveExternalCatalog
    
    backport https://github.com/apache/spark/pull/19148 to 2.2
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #19163 from cloud-fan/test.

commit 9ae7c96ce33d3d67f49059b5b83ef1d9d3d8e8e5
Author: Kazuaki Ishizaki <ishizaki@...>
Date:   2017-09-08T16:39:20Z

    [SPARK-21946][TEST] fix flaky test: "alter table: rename cached table" in InMemoryCatalogedDDLSuite
    
    ## What changes were proposed in this pull request?
    
    This PR fixes flaky test `InMemoryCatalogedDDLSuite "alter table: rename cached table"`.
    Since this test validates distributed DataFrame, the result should be checked by using
`checkAnswer`. The original version used `df.collect().Seq` method that does not guaranty
an order of each element of the result.
    
    ## How was this patch tested?
    
    Use existing test case
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #19159 from kiszk/SPARK-21946.
    
    (cherry picked from commit 8a4f228dc0afed7992695486ecab6bc522f1e392)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>

commit 9876821603ec12e77ee58e8ef6f5841c9c310c93
Author: hyukjinkwon <gurwls223@...>
Date:   2017-09-08T16:47:45Z

    [SPARK-21128][R][BACKPORT-2.2] Remove both "spark-warehouse" and "metastore_db" before
listing files in R tests
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to list the files in test _after_ removing both "spark-warehouse" and
"metastore_db" so that the next run of R tests pass fine. This is sometimes a bit annoying.
    
    ## How was this patch tested?
    
    Manually running multiple times R tests via `./R/run-tests.sh`.
    
    **Before**
    
    Second run:
    
    ```
    SparkSQL functions: Spark package found in SPARK_HOME: .../spark
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ....................................................................................................1234.......................
    
    Failed -------------------------------------------------------------------------
    1. Failure: No extra files are created in SPARK_HOME by starting session and making calls
(test_sparkSQL.R#3384)
    length(list1) not equal to length(list2).
    1/1 mismatches
    [1] 25 - 23 == 2
    
    2. Failure: No extra files are created in SPARK_HOME by starting session and making calls
(test_sparkSQL.R#3384)
    sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
    10/25 mismatches
    x[16]: "metastore_db"
    y[16]: "pkg"
    
    x[17]: "pkg"
    y[17]: "R"
    
    x[18]: "R"
    y[18]: "README.md"
    
    x[19]: "README.md"
    y[19]: "run-tests.sh"
    
    x[20]: "run-tests.sh"
    y[20]: "SparkR_2.2.0.tar.gz"
    
    x[21]: "metastore_db"
    y[21]: "pkg"
    
    x[22]: "pkg"
    y[22]: "R"
    
    x[23]: "R"
    y[23]: "README.md"
    
    x[24]: "README.md"
    y[24]: "run-tests.sh"
    
    x[25]: "run-tests.sh"
    y[25]: "SparkR_2.2.0.tar.gz"
    
    3. Failure: No extra files are created in SPARK_HOME by starting session and making calls
(test_sparkSQL.R#3388)
    length(list1) not equal to length(list2).
    1/1 mismatches
    [1] 25 - 23 == 2
    
    4. Failure: No extra files are created in SPARK_HOME by starting session and making calls
(test_sparkSQL.R#3388)
    sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
    10/25 mismatches
    x[16]: "metastore_db"
    y[16]: "pkg"
    
    x[17]: "pkg"
    y[17]: "R"
    
    x[18]: "R"
    y[18]: "README.md"
    
    x[19]: "README.md"
    y[19]: "run-tests.sh"
    
    x[20]: "run-tests.sh"
    y[20]: "SparkR_2.2.0.tar.gz"
    
    x[21]: "metastore_db"
    y[21]: "pkg"
    
    x[22]: "pkg"
    y[22]: "R"
    
    x[23]: "R"
    y[23]: "README.md"
    
    x[24]: "README.md"
    y[24]: "run-tests.sh"
    
    x[25]: "run-tests.sh"
    y[25]: "SparkR_2.2.0.tar.gz"
    
    DONE ===========================================================================
    ```
    
    **After**
    
    Second run:
    
    ```
    SparkSQL functions: Spark package found in SPARK_HOME: .../spark
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................................................
    ...............................................................................................................................
    ```
    
    Author: hyukjinkwon <gurwls223gmail.com>
    
    Closes #18335 from HyukjinKwon/SPARK-21128.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #19166 from felixcheung/rbackport21128.

commit 182478e030688b602bf95edfd82f700d6f5678d1
Author: Liang-Chi Hsieh <viirya@...>
Date:   2017-09-09T10:10:52Z

    [SPARK-21954][SQL] JacksonUtils should verify MapType's value type instead of key type
    
    ## What changes were proposed in this pull request?
    
    `JacksonUtils.verifySchema` verifies if a data type can be converted to JSON. For `MapType`,
it now verifies the key type. However, in `JacksonGenerator`, when converting a map to JSON,
we only care about its values and create a writer for the values. The keys in a map are treated
as strings by calling `toString` on the keys.
    
    Thus, we should change `JacksonUtils.verifySchema` to verify the value type of `MapType`.
    
    ## How was this patch tested?
    
    Added tests.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #19167 from viirya/test-jacksonutils.
    
    (cherry picked from commit 6b45d7e941eba8a36be26116787322d9e3ae25d0)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

commit b1b5a7fdc0f8fabfb235f0b31bde0f1bfb71591a
Author: Peter Szalai <szalaipeti.vagyok@...>
Date:   2017-09-10T08:47:45Z

    [SPARK-20098][PYSPARK] dataType's typeName fix
    
    ## What changes were proposed in this pull request?
    `typeName`  classmethod has been fixed by using type -> typeName map.
    
    ## How was this patch tested?
    local build
    
    Author: Peter Szalai <szalaipeti.vagyok@gmail.com>
    
    Closes #17435 from szalai1/datatype-gettype-fix.
    
    (cherry picked from commit 520d92a191c3148498087d751aeeddd683055622)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

commit 10c68366e5474f131f7ea294e6abee4e02fca9f3
Author: FavioVazquez <favio.vazquezp@...>
Date:   2017-09-12T09:33:35Z

    [SPARK-21976][DOC] Fix wrong documentation for Mean Absolute Error.
    
    ## What changes were proposed in this pull request?
    
    Fixed wrong documentation for Mean Absolute Error.
    
    Even though the code is correct for the MAE:
    
    ```scala
    Since("1.2.0")
      def meanAbsoluteError: Double = {
        summary.normL1(1) / summary.count
      }
    ```
    In the documentation the division by N is missing.
    
    ## How was this patch tested?
    
    All of spark tests were run.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: FavioVazquez <favio.vazquezp@gmail.com>
    Author: faviovazquez <favio.vazquezp@gmail.com>
    Author: Favio André Vázquez <favio.vazquezp@gmail.com>
    
    Closes #19190 from FavioVazquez/mae-fix.
    
    (cherry picked from commit e2ac2f1c71a0f8b03743d0d916dc0ef28482a393)
    Signed-off-by: Sean Owen <sowen@cloudera.com>

commit 63098dc3170bf4289091d97b7beb63dd0e2356c5
Author: Kousuke Saruta <sarutak@...>
Date:   2017-09-12T14:07:04Z

    [DOCS] Fix unreachable links in the document
    
    ## What changes were proposed in this pull request?
    
    Recently, I found two unreachable links in the document and fixed them.
    Because of small changes related to the document, I don't file this issue in JIRA but
please suggest I should do it if you think it's needed.
    
    ## How was this patch tested?
    
    Tested manually.
    
    Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
    
    Closes #19195 from sarutak/fix-unreachable-link.
    
    (cherry picked from commit 957558235b7537c706c6ab4779655aa57838ebac)
    Signed-off-by: Sean Owen <sowen@cloudera.com>

commit b606dc177e177bdbf99e72638eb8baec12e9fb53
Author: Zheng RuiFeng <ruifengz@...>
Date:   2017-09-12T18:37:05Z

    [SPARK-18608][ML] Fix double caching
    
    ## What changes were proposed in this pull request?
    `df.rdd.getStorageLevel` => `df.storageLevel`
    
    using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel"
{} && echo {}'` to make sure all algs involved in this issue are fixed.
    
    Previous discussion in other PRs: https://github.com/apache/spark/pull/19107, https://github.com/apache/spark/pull/17014
    
    ## How was this patch tested?
    existing tests
    
    Author: Zheng RuiFeng <ruifengz@foxmail.com>
    
    Closes #19197 from zhengruifeng/double_caching.
    
    (cherry picked from commit c5f9b89dda40ffaa4622a7ba2b3d0605dbe815c0)
    Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

commit 3a692e355a786260c4a9c2ef210fe14e409af37a
Author: donnyzone <wellfengzhu@...>
Date:   2017-09-13T17:06:53Z

    [SPARK-21980][SQL] References in grouping functions should be indexed with semanticEquals
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/browse/SPARK-21980
    
    This PR fixes the issue in ResolveGroupingAnalytics rule, which indexes the column references
in grouping functions without considering case sensitive configurations.
    
    The problem can be reproduced by:
    
    `val df = spark.createDataFrame(Seq((1, 1), (2, 1), (2, 2))).toDF("a", "b")
     df.cube("a").agg(grouping("A")).show()`
    
    ## How was this patch tested?
    unit tests
    
    Author: donnyzone <wellfengzhu@gmail.com>
    
    Closes #19202 from DonnyZone/ResolveGroupingAnalytics.
    
    (cherry picked from commit 21c4450fb24635fab6481a3756fefa9c6f6d6235)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>

commit 51e5a821dcaa1d5f529afafc88cb8cfb4ad48e09
Author: Yanbo Liang <ybliang8@...>
Date:   2017-09-14T06:09:44Z

    [SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest.
    
    ## What changes were proposed in this pull request?
    #19197 fixed double caching for MLlib algorithms, but missed PySpark ```OneVsRest```,
this PR fixed it.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #19220 from yanboliang/SPARK-18608.
    
    (cherry picked from commit c76153cc7dd25b8de5266fe119095066be7f78f5)
    Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

commit 42852bb17121fb8067a4aea3e56d56f76a2e0d1d
Author: Andrew Ray <ray.andrew@...>
Date:   2017-09-17T17:46:27Z

    [SPARK-21985][PYSPARK] PairDeserializer is broken for double-zipped RDDs
    
    ## What changes were proposed in this pull request?
    (edited)
    Fixes a bug introduced in #16121
    
    In PairDeserializer convert each batch of keys and values to lists (if they do not have
`__len__` already) so that we can check that they are the same size. Normally they already
are lists so this should not have a performance impact, but this is needed when repeated `zip`'s
are done.
    
    ## How was this patch tested?
    
    Additional unit test
    
    Author: Andrew Ray <ray.andrew@gmail.com>
    
    Closes #19226 from aray/SPARK-21985.
    
    (cherry picked from commit 6adf67dd14b0ece342bb91adf800df0a7101e038)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

commit 309c401a5b3c76cc1b6b5aef97d03034fe4e1ce4
Author: Andrew Ash <andrew@...>
Date:   2017-09-18T02:42:24Z

    [SPARK-21953] Show both memory and disk bytes spilled if either is present
    
    As written now, there must be both memory and disk bytes spilled to show either of them.
If there is only one of those types of spill recorded, it will be hidden.
    
    Author: Andrew Ash <andrew@andrewash.com>
    
    Closes #19164 from ash211/patch-3.
    
    (cherry picked from commit 6308c65f08b507408033da1f1658144ea8c1491f)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>

commit a86831d618b05c789c2cea0afe5488c3234a14bc
Author: hyukjinkwon <gurwls223@...>
Date:   2017-09-18T04:20:11Z

    [SPARK-22043][PYTHON] Improves error message for show_profiles and dump_profiles
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to improve error message from:
    
    ```
    >>> sc.show_profiles()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/context.py", line 1000, in show_profiles
        self.profiler_collector.show_profiles()
    AttributeError: 'NoneType' object has no attribute 'show_profiles'
    >>> sc.dump_profiles("/tmp/abc")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/context.py", line 1005, in dump_profiles
        self.profiler_collector.dump_profiles(path)
    AttributeError: 'NoneType' object has no attribute 'dump_profiles'
    ```
    
    to
    
    ```
    >>> sc.show_profiles()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/context.py", line 1003, in show_profiles
        raise RuntimeError("'spark.python.profile' configuration must be set "
    RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python
profile.
    >>> sc.dump_profiles("/tmp/abc")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File ".../spark/python/pyspark/context.py", line 1012, in dump_profiles
        raise RuntimeError("'spark.python.profile' configuration must be set "
    RuntimeError: 'spark.python.profile' configuration must be set to 'true' to enable Python
profile.
    ```
    
    ## How was this patch tested?
    
    Unit tests added in `python/pyspark/tests.py` and manual tests.
    
    Author: hyukjinkwon <gurwls223@gmail.com>
    
    Closes #19260 from HyukjinKwon/profile-errors.
    
    (cherry picked from commit 7c7266208a3be984ac1ce53747dc0c3640f4ecac)
    Signed-off-by: hyukjinkwon <gurwls223@gmail.com>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message