spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From felixalbani <...@git.apache.org>
Subject [GitHub] spark pull request #20867: Spark 23759
Date Tue, 20 Mar 2018 22:51:30 GMT
GitHub user felixalbani opened a pull request:

    https://github.com/apache/spark/pull/20867

    Spark 23759

    ## What changes were proposed in this pull request?
    
    This pull is to fix SPARK-23759 issue
    
    Problem was created due connector.setHost(hostName) call was after connector.start()
    
    ## How was this patch tested?
    
    Patch was tested after build and deployment. This patch requires SPARK_LOCAL_IP environment
variable to be set on spark-env.sh
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/felixalbani/spark SPARK-23759

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20867.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20867
    
----
commit edcd9fbc92683753d55ed0c69f391bf3bed59da4
Author: Shixiong Zhu <shixiong@...>
Date:   2017-07-11T03:26:17Z

    [SPARK-21369][CORE] Don't use Scala Tuple2 in common/network-*
    
    ## What changes were proposed in this pull request?
    
    Remove all usages of Scala Tuple2 from common/network-* projects. Otherwise, Yarn users
cannot use `spark.reducer.maxReqSizeShuffleToMem`.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: Shixiong Zhu <shixiong@databricks.com>
    
    Closes #18593 from zsxwing/SPARK-21369.
    
    (cherry picked from commit 833eab2c9bd273ee9577fbf9e480d3e3a4b7d203)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>

commit 399aa016e8f44fea4e5ef4b71a9a80484dd755f8
Author: Xingbo Jiang <xingbo.jiang@...>
Date:   2017-07-11T13:52:54Z

    [SPARK-21366][SQL][TEST] Add sql test for window functions
    
    ## What changes were proposed in this pull request?
    
    Add sql test for window functions, also remove uncecessary test cases in `WindowQuerySuite`.
    
    ## How was this patch tested?
    
    Added `window.sql` and the corresponding output file.
    
    Author: Xingbo Jiang <xingbo.jiang@databricks.com>
    
    Closes #18591 from jiangxb1987/window.
    
    (cherry picked from commit 66d21686556681457aab6e44e19f5614c5635f0c)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>

commit cb6fc89ba20a427fa7d66fa5036b17c1a5d5d87f
Author: Eric Vandenberg <ericvandenberg@...>
Date:   2017-07-12T06:49:15Z

    [SPARK-21219][CORE] Task retry occurs on same executor due to race co…
    
    …ndition with blacklisting
    
    There's a race condition in the current TaskSetManager where a failed task is added for
retry (addPendingTask), and can asynchronously be assigned to an executor *prior* to the blacklist
state (updateBlacklistForFailedTask), the result is the task might re-execute on the same
executor.  This is particularly problematic if the executor is shutting down since the retry
task immediately becomes a lost task (ExecutorLostFailure).  Another side effect is that the
actual failure reason gets obscured by the retry task which never actually executed.  There
are sample logs showing the issue in the https://issues.apache.org/jira/browse/SPARK-21219
    
    The fix is to change the ordering of the addPendingTask and updatingBlackListForFailedTask
calls in TaskSetManager.handleFailedTask
    
    Implemented a unit test that verifies the task is black listed before it is added to the
pending task.  Ran the unit test without the fix and it fails.  Ran the unit test with the
fix and it passes.
    
    Please review http://spark.apache.org/contributing.html before opening a pull request.
    
    Author: Eric Vandenberg <ericvandenbergfb.com>
    
    Closes #18427 from ericvandenbergfb/blacklistFix.
    
    ## What changes were proposed in this pull request?
    
    This is a backport of the fix to SPARK-21219, already checked in as 96d58f2.
    
    ## How was this patch tested?
    
    Ran TaskSetManagerSuite tests locally.
    
    Author: Eric Vandenberg <ericvandenberg@fb.com>
    
    Closes #18604 from jsoltren/branch-2.2.

commit 39eba3053ac99f03d9df56471bae5fc5cc9f4462
Author: Kohki Nishio <taroplus@...>
Date:   2017-07-13T00:22:40Z

    [SPARK-18646][REPL] Set parent classloader as null for ExecutorClassLoader
    
    ## What changes were proposed in this pull request?
    
    `ClassLoader` will preferentially load class from `parent`. Only when `parent` is null
or the load failed, that it will call the overridden `findClass` function. To avoid the potential
issue caused by loading class using inappropriate class loader, we should set the `parent`
of `ClassLoader` to null, so that we can fully control which class loader is used.
    
    This is take over of #17074,  the primary author of this PR is taroplus .
    
    Should close #17074 after this PR get merged.
    
    ## How was this patch tested?
    
    Add test case in `ExecutorClassLoaderSuite`.
    
    Author: Kohki Nishio <taroplus@me.com>
    Author: Xingbo Jiang <xingbo.jiang@databricks.com>
    
    Closes #18614 from jiangxb1987/executor_classloader.
    
    (cherry picked from commit e08d06b37bc96cc48fec1c5e40f73e0bca09c616)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>

commit cf0719b5e99333b28bb4066b304dbcf8400c80ea
Author: Wenchen Fan <wenchen@...>
Date:   2017-07-13T00:34:42Z

    Revert "[SPARK-18646][REPL] Set parent classloader as null for ExecutorClassLoader"
    
    This reverts commit 39eba3053ac99f03d9df56471bae5fc5cc9f4462.

commit bfe3ba86936ffaabff9f89d03018eb368d246b4d
Author: jerryshao <sshao@...>
Date:   2017-07-13T22:25:38Z

    [SPARK-21376][YARN] Fix yarn client token expire issue when cleaning the staging files
in long running scenario
    
    ## What changes were proposed in this pull request?
    
    This issue happens in long running application with yarn cluster mode, because yarn#client
doesn't sync token with AM, so it will always keep the initial token, this token may be expired
in the long running scenario, so when yarn#client tries to clean up staging directory after
application finished, it will use this expired token and meet token expire issue.
    
    ## How was this patch tested?
    
    Manual verification is secure cluster.
    
    Author: jerryshao <sshao@hortonworks.com>
    
    Closes #18617 from jerryshao/SPARK-21376.
    
    (cherry picked from commit cb8d5cc90ff8d3c991ff33da41b136ab7634f71b)

commit 1cb4369a5b894619582e0d5ccc8c1f4ecb8ae36a
Author: Kazuaki Ishizaki <ishizaki@...>
Date:   2017-07-15T03:16:04Z

    [SPARK-21344][SQL] BinaryType comparison does signed byte array comparison
    
    ## What changes were proposed in this pull request?
    
    This PR fixes a wrong comparison for `BinaryType`. This PR enables unsigned comparison
and unsigned prefix generation for an array for `BinaryType`. Previous implementations uses
signed operations.
    
    ## How was this patch tested?
    
    Added a test suite in `OrderingSuite`.
    
    Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
    
    Closes #18571 from kiszk/SPARK-21344.
    
    (cherry picked from commit ac5d5d795909061a17e056696cf0ef87d9e65dd1)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>

commit 8e85ce625127f62b7e2abdfab81c7bcbebcc8448
Author: Sean Owen <sowen@...>
Date:   2017-07-15T08:21:29Z

    [SPARK-21267][DOCS][MINOR] Follow up to avoid referencing programming-guide redirector
    
    ## What changes were proposed in this pull request?
    
    Update internal references from programming-guide to rdd-programming-guide
    
    See https://github.com/apache/spark-website/commit/5ddf243fd84a0f0f98a5193a207737cea9cdc083
and https://github.com/apache/spark/pull/18485#issuecomment-314789751
    
    Let's keep the redirector even if it's problematic to build, but not rely on it internally.
    
    ## How was this patch tested?
    
    (Doc build)
    
    Author: Sean Owen <sowen@cloudera.com>
    
    Closes #18625 from srowen/SPARK-21267.2.
    
    (cherry picked from commit 74ac1fb081e9532d77278a4edca9f3f129fd62eb)
    Signed-off-by: Sean Owen <sowen@cloudera.com>

commit 0ef98fd435ff77196780c2cad6e1bda377b2642f
Author: John Lee <jlee2@...>
Date:   2017-07-17T18:13:35Z

    [SPARK-21321][SPARK CORE] Spark very verbose on shutdown
    
    ## What changes were proposed in this pull request?
    
    The current code is very verbose on shutdown.
    
    The changes I propose is to change the log level when the driver is shutting down and
the RPC connections are closed (RpcEnvStoppedException).
    
    ## How was this patch tested?
    
    Tested with word count(deploy-mode = cluster, master = yarn, num-executors = 4) with 300GB
of data.
    
    Author: John Lee <jlee2@yahoo-inc.com>
    
    Closes #18547 from yoonlee95/SPARK-21321.
    
    (cherry picked from commit 0e07a29cf4a5587f939585e6885ed0f7e68c31b5)
    Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>

commit 83bdb04871248357ddbb665198c538f2df449006
Author: aokolnychyi <anton.okolnychyi@...>
Date:   2017-07-18T04:07:50Z

    [SPARK-21332][SQL] Incorrect result type inferred for some decimal expressions
    
    ## What changes were proposed in this pull request?
    
    This PR changes the direction of expression transformation in the DecimalPrecision rule.
Previously, the expressions were transformed down, which led to incorrect result types when
decimal expressions had other decimal expressions as their operands. The root cause of this
issue was in visiting outer nodes before their children. Consider the example below:
    
    ```
        val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil)
        val sc = spark.sparkContext
        val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12)))
        val df = spark.createDataFrame(rdd, inputSchema)
    
        // Works correctly since no nested decimal expression is involved
        // Expected result type: (26, 6) * (26, 6) = (38, 12)
        df.select($"col" * $"col").explain(true)
        df.select($"col" * $"col").printSchema()
    
        // Gives a wrong result since there is a nested decimal expression that should be
visited first
        // Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38,
18)
        df.select($"col" * $"col" * $"col").explain(true)
        df.select($"col" * $"col" * $"col").printSchema()
    ```
    
    The example above gives the following output:
    
    ```
    // Correct result without sub-expressions
    == Parsed Logical Plan ==
    'Project [('col * 'col) AS (col * col)#4]
    +- LogicalRDD [col#1]
    
    == Analyzed Logical Plan ==
    (col * col): decimal(38,12)
    Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1
as decimal(26,6)))), DecimalType(38,12)) AS (col * col)#4]
    +- LogicalRDD [col#1]
    
    == Optimized Logical Plan ==
    Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4]
    +- LogicalRDD [col#1]
    
    == Physical Plan ==
    *Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4]
    +- Scan ExistingRDD[col#1]
    
    // Schema
    root
     |-- (col * col): decimal(38,12) (nullable = true)
    
    // Incorrect result with sub-expressions
    == Parsed Logical Plan ==
    'Project [(('col * 'col) * 'col) AS ((col * col) * col)#11]
    +- LogicalRDD [col#1]
    
    == Analyzed Logical Plan ==
    ((col * col) * col): decimal(38,12)
    Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1
as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12))
as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12))
AS ((col * col) * col)#11]
    +- LogicalRDD [col#1]
    
    == Optimized Logical Plan ==
    Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6))
* col#1), DecimalType(38,12)) AS ((col * col) * col)#11]
    +- LogicalRDD [col#1]
    
    == Physical Plan ==
    *Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6))
* col#1), DecimalType(38,12)) AS ((col * col) * col)#11]
    +- Scan ExistingRDD[col#1]
    
    // Schema
    root
     |-- ((col * col) * col): decimal(38,12) (nullable = true)
    ```
    
    ## How was this patch tested?
    
    This PR was tested with available unit tests. Moreover, there are tests to cover previously
failing scenarios.
    
    Author: aokolnychyi <anton.okolnychyi@sap.com>
    
    Closes #18583 from aokolnychyi/spark-21332.
    
    (cherry picked from commit 0be5fb41a6b7ef4da9ba36f3604ac646cb6d4ae3)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>

commit 99ce551a13f0918b440ddc094c3a32167d7ab3dd
Author: Burak Yavuz <brkyvz@...>
Date:   2017-07-18T04:09:07Z

    [SPARK-21445] Make IntWrapper and LongWrapper in UTF8String Serializable
    
    ## What changes were proposed in this pull request?
    
    Making those two classes will avoid Serialization issues like below:
    ```
    Caused by: java.io.NotSerializableException: org.apache.spark.unsafe.types.UTF8String$IntWrapper
    Serialization stack:
        - object not serializable (class: org.apache.spark.unsafe.types.UTF8String$IntWrapper,
value: org.apache.spark.unsafe.types.UTF8String$IntWrapper326450e)
        - field (class: org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$1,
name: result$2, type: class org.apache.spark.unsafe.types.UTF8String$IntWrapper)
        - object (class org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$1,
<function1>)
    ```
    
    ## How was this patch tested?
    
    - [x] Manual testing
    - [ ] Unit test
    
    Author: Burak Yavuz <brkyvz@gmail.com>
    
    Closes #18660 from brkyvz/serializableutf8.
    
    (cherry picked from commit 26cd2ca0402d7d49780116d45a5622a45c79f661)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>

commit df061fd5f93c8110107198a94e68a4e29248e345
Author: Wenchen Fan <wenchen@...>
Date:   2017-07-18T22:56:16Z

    [SPARK-21457][SQL] ExternalCatalog.listPartitions should correctly handle partition values
with dot
    
    ## What changes were proposed in this pull request?
    
    When we list partitions from hive metastore with a partial partition spec, we are expecting
exact matching according to the partition values. However, hive treats dot specially and match
any single character for dot. We should do an extra filter to drop unexpected partitions.
    
    ## How was this patch tested?
    
    new regression test.
    
    Author: Wenchen Fan <wenchen@databricks.com>
    
    Closes #18671 from cloud-fan/hive.
    
    (cherry picked from commit f18b905f6cace7686ef169fda7de474079d0af23)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>

commit 5a0a76f1648729dfa7ed0522dd2cb41ba805a2cd
Author: jinxing <jinxing6042@...>
Date:   2017-07-19T13:35:26Z

    [SPARK-21414] Refine SlidingWindowFunctionFrame to avoid OOM.
    
    ## What changes were proposed in this pull request?
    
    In `SlidingWindowFunctionFrame`, it is now adding all rows to the buffer for which the
input row value is equal to or less than the output row upper bound, then drop all rows from
the buffer for which the input row value is smaller than the output row lower bound.
    This could result in the buffer is very big though the window is small.
    For example:
    ```
    select a, b, sum(a)
    over (partition by b order by a range between 1000000 following and 1000001 following)
    from table
    ```
    We can refine the logic and just add the qualified rows into buffer.
    
    ## How was this patch tested?
    Manual test:
    Run sql
    `select shop, shopInfo, district, sum(revenue) over(partition by district order by revenue
range between 100 following and 200 following) from revenueList limit 10`
    against a table with 4  columns(shop: String, shopInfo: String, district: String, revenue:
Int). The biggest partition is around 2G bytes, containing 200k lines.
    Configure the executor with 2G bytes memory.
    With the change in this pr, it works find. Without this change, below exception will be
thrown.
    ```
    MemoryError: Java heap space
    	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:504)
    	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:62)
    	at org.apache.spark.sql.execution.window.SlidingWindowFunctionFrame.write(WindowFunctionFrame.scala:201)
    	at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:365)
    	at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:289)
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
    	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
    	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
    	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
    	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
    	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    	at org.apache.spark.scheduler.Task.run(Task.scala:108)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ```
    
    Author: jinxing <jinxing6042@126.com>
    
    Closes #18634 from jinxing64/SPARK-21414.
    
    (cherry picked from commit 4eb081cc870a9d1c42aae90418535f7d782553e9)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>

commit 4c212eed1a4a75756216b13aab211d945e14d89b
Author: donnyzone <wellfengzhu@...>
Date:   2017-07-19T13:48:54Z

    [SPARK-21441][SQL] Incorrect Codegen in SortMergeJoinExec results failures in some cases
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21441
    
    This issue can be reproduced by the following example:
    
    ```
    val spark = SparkSession
       .builder()
       .appName("smj-codegen")
       .master("local")
       .config("spark.sql.autoBroadcastJoinThreshold", "1")
       .getOrCreate()
    val df1 = spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3))).toDF("key", "int")
    val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", "str")
    val df = df1.join(df2, df1("key") === df2("key"))
       .filter("int = 2 or reflect('java.lang.Integer', 'valueOf', str) = 1")
       .select("int")
       df.show()
    ```
    
    To conclude, the issue happens when:
    (1) SortMergeJoin condition contains CodegenFallback expressions.
    (2) In PhysicalPlan tree, SortMergeJoin node  is the child of root node, e.g., the Project
in above example.
    
    This patch fixes the logic in `CollapseCodegenStages` rule.
    
    ## How was this patch tested?
    Unit test and manual verification in our cluster.
    
    Author: donnyzone <wellfengzhu@gmail.com>
    
    Closes #18656 from DonnyZone/Fix_SortMergeJoinExec.
    
    (cherry picked from commit 6b6dd682e84d3b03d0b15fbd81a0d16729e521d2)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>

commit 86cd3c08871618441c0c297da0f48ac284595697
Author: Tathagata Das <tathagata.das1565@...>
Date:   2017-07-19T18:02:07Z

    [SPARK-21464][SS] Minimize deprecation warnings caused by ProcessingTime class
    
    ## What changes were proposed in this pull request?
    
    Use of `ProcessingTime` class was deprecated in favor of `Trigger.ProcessingTime` in Spark
2.2. However interval uses to ProcessingTime causes deprecation warnings during compilation.
This cannot be avoided entirely as even though it is deprecated as a public API, ProcessingTime
instances are used internally in TriggerExecutor. This PR is to minimize the warning by removing
its uses from tests as much as possible.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Tathagata Das <tathagata.das1565@gmail.com>
    
    Closes #18678 from tdas/SPARK-21464.
    
    (cherry picked from commit 70fe99dc62ef636a99bcb8a580ad4de4dca95181)
    Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

commit 308bce0eb60649b15836614567532460ea73bd12
Author: DFFuture <albert.zhang23@...>
Date:   2017-07-19T21:45:11Z

    [SPARK-21446][SQL] Fix setAutoCommit never executed
    
    ## What changes were proposed in this pull request?
    JIRA Issue: https://issues.apache.org/jira/browse/SPARK-21446
    options.asConnectionProperties can not have fetchsize,because fetchsize belongs to Spark-only
options, and Spark-only options have been excluded in connection properities.
    So change properties of beforeFetch from  options.asConnectionProperties.asScala.toMap
to options.asProperties.asScala.toMap
    
    ## How was this patch tested?
    
    Author: DFFuture <albert.zhang23@gmail.com>
    
    Closes #18665 from DFFuture/sparksql_pg.
    
    (cherry picked from commit c9729187bcef78299390e53cd9af38c3e084060e)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>

commit 9949fed1c45865b6e5e8ebe610789c5fb9546052
Author: Corey Woodfield <coreywoodfield@...>
Date:   2017-07-19T22:21:38Z

    [SPARK-21333][DOCS] Removed invalid joinTypes from javadoc of Dataset#joinWith
    
    ## What changes were proposed in this pull request?
    
    Two invalid join types were mistakenly listed in the javadoc for joinWith, in the Dataset
class. I presume these were copied from the javadoc of join, but since joinWith returns a
Dataset\<Tuple2\>, left_semi and left_anti are invalid, as they only return values from
one of the datasets, instead of from both
    
    ## How was this patch tested?
    
    I ran the following code :
    ```
    public static void main(String[] args) {
    	SparkSession spark = new SparkSession(new SparkContext("local[*]", "Test"));
    	Dataset<Row> one = spark.createDataFrame(Arrays.asList(new Bean(1), new Bean(2),
new Bean(3), new Bean(4), new Bean(5)), Bean.class);
    	Dataset<Row> two = spark.createDataFrame(Arrays.asList(new Bean(4), new Bean(5),
new Bean(6), new Bean(7), new Bean(8), new Bean(9)), Bean.class);
    
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "inner").show();} catch (Exception
e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "cross").show();} catch (Exception
e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "outer").show();} catch (Exception
e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "full").show();} catch (Exception
e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "full_outer").show();} catch
(Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left").show();} catch (Exception
e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_outer").show();} catch
(Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "right").show();} catch (Exception
e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "right_outer").show();} catch
(Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_semi").show();} catch
(Exception e) {e.printStackTrace();}
    	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_anti").show();} catch
(Exception e) {e.printStackTrace();}
    }
    ```
    which tests all the different join types, and the last two (left_semi and left_anti) threw
exceptions. The same code using join instead of joinWith did fine. The Bean class was just
a java bean with a single int field, x.
    
    Author: Corey Woodfield <coreywoodfield@gmail.com>
    
    Closes #18462 from coreywoodfield/master.
    
    (cherry picked from commit 8cd9cdf17a7a4ad6f2eecd7c4b388ca363c20982)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>

commit 88dccda393bc79dc6032f71b6acf8eb2b4b152be
Author: Dhruve Ashar <dhruveashar@...>
Date:   2017-07-21T19:03:46Z

    [SPARK-21243][CORE] Limit no. of map outputs in a shuffle fetch
    
    For configurations with external shuffle enabled, we have observed that if a very large
no. of blocks are being fetched from a remote host, it puts the NM under extra pressure and
can crash it. This change introduces a configuration `spark.reducer.maxBlocksInFlightPerAddress`
, to limit the no. of map outputs being fetched from a given remote address. The changes applied
here are applicable for both the scenarios - when external shuffle is enabled as well as disabled.
    
    Ran the job with the default configuration which does not change the existing behavior
and ran it with few configurations of lower values -10,20,50,100. The job ran fine and there
is no change in the output. (I will update the metrics related to NM in some time.)
    
    Author: Dhruve Ashar <dhruveashargmail.com>
    
    Closes #18487 from dhruve/impr/SPARK-21243.
    
    Author: Dhruve Ashar <dhruveashar@gmail.com>
    
    Closes #18691 from dhruve/branch-2.2.

commit da403b95353f064c24da25236fa7f905fa8ddca1
Author: Holden Karau <holden@...>
Date:   2017-07-21T23:50:47Z

    [SPARK-21434][PYTHON][DOCS] Add pyspark pip documentation.
    
    Update the Quickstart and RDD programming guides to mention pip.
    
    Built docs locally.
    
    Author: Holden Karau <holden@us.ibm.com>
    
    Closes #18698 from holdenk/SPARK-21434-add-pyspark-pip-documentation.
    
    (cherry picked from commit cc00e99d5396893b2d3d50960161080837cf950a)
    Signed-off-by: Holden Karau <holden@us.ibm.com>

commit 62ca13dcaf79b85fca02de5628b607196534c605
Author: Marcelo Vanzin <vanzin@...>
Date:   2017-07-23T15:23:13Z

    [SPARK-20904][CORE] Don't report task failures to driver during shutdown.
    
    Executors run a thread pool with daemon threads to run tasks. This means
    that those threads remain active when the JVM is shutting down, meaning
    those tasks are affected by code that runs in shutdown hooks.
    
    So if a shutdown hook messes with something that the task is using (e.g.
    an HDFS connection), the task will fail and will report that failure to
    the driver. That will make the driver mark the task as failed regardless
    of what caused the executor to shut down. So, for example, if YARN pre-empted
    that executor, the driver would consider that task failed when it should
    instead ignore the failure.
    
    This change avoids reporting failures to the driver when shutdown hooks
    are executing; this fixes the YARN preemption accounting, and doesn't really
    change things much for other scenarios, other than reporting a more generic
    error ("Executor lost") when the executor shuts down unexpectedly - which
    is arguably more correct.
    
    Tested with a hacky app running on spark-shell that tried to cause failures
    only when shutdown hooks were running, verified that preemption didn't cause
    the app to fail because of task failures exceeding the threshold.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #18594 from vanzin/SPARK-20904.
    
    (cherry picked from commit cecd285a2aabad4e7db5a3d18944b87fbc4eee6c)
    Signed-off-by: Wenchen Fan <wenchen@databricks.com>

commit e5ec3390cbbef87fca8a27bea701a225e18b98ea
Author: DjvuLee <lihu@...>
Date:   2017-07-25T17:21:18Z

    [SPARK-21383][YARN] Fix the YarnAllocator allocates more Resource
    
    When NodeManagers launching Executors,
    the `missing` value will exceed the
    real value when the launch is slow, this can lead to YARN allocates more resource.
    
    We add the `numExecutorsRunning` when calculate the `missing` to avoid this.
    
    Test by experiment.
    
    Author: DjvuLee <lihu@bytedance.com>
    
    Closes #18651 from djvulee/YarnAllocate.
    
    (cherry picked from commit 8de080d9f9d3deac7745f9b3428d97595975701d)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

commit c91191bed186f816b760af98218392f9a178942b
Author: Eric Vandenberg <ericvandenberg@...>
Date:   2017-07-25T18:45:35Z

    [SPARK-21447][WEB UI] Spark history server fails to render compressed
    
    inprogress history file in some cases.
    
    Add failure handling for EOFException that can be thrown during
    decompression of an inprogress spark history file, treat same as case
    where can't parse the last line.
    
    ## What changes were proposed in this pull request?
    
    Failure handling for case of EOFException thrown within the ReplayListenerBus.replay method
to handle the case analogous to json parse fail case.  This path can arise in compressed inprogress
history files since an incomplete compression block could be read (not flushed by writer on
a block boundary).  See the stack trace of this occurrence in the jira ticket (https://issues.apache.org/jira/browse/SPARK-21447)
    
    ## How was this patch tested?
    
    Added a unit test that specifically targets validating the failure handling path appropriately
when maybeTruncated is true and false.
    
    Author: Eric Vandenberg <ericvandenberg@fb.com>
    
    Closes #18673 from ericvandenbergfb/fix_inprogress_compr_history_file.
    
    (cherry picked from commit 06a9793793ca41dcef2f10ca06af091a57c721c4)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

commit 1bfd1a83b5e18f42bf76c1d72cd0347ff578e9cd
Author: Marcelo Vanzin <vanzin@...>
Date:   2017-07-26T00:57:26Z

    [SPARK-21494][NETWORK] Use correct app id when authenticating to external service.
    
    There was some code based on the old SASL handler in the new auth client that
    was incorrectly using the SASL user as the user to authenticate against the
    external shuffle service. This caused the external service to not be able to
    find the correct secret to authenticate the connection, failing the connection.
    
    In the course of debugging, I found that some log messages from the YARN shuffle
    service were a little noisy, so I silenced some of them, and also added a couple
    of new ones that helped find this issue. On top of that, I found that a check
    in the code that records app secrets was wrong, causing more log spam and also
    using an O(n) operation instead of an O(1) call.
    
    Also added a new integration suite for the YARN shuffle service with auth on,
    and verified it failed before, and passes now.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #18706 from vanzin/SPARK-21494.
    
    (cherry picked from commit 300807c6e3011e4d78c6cf750201d0ab8e5bdaf5)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

commit 06b2ef01ed87add681144fe1d801718caba271af
Author: aokolnychyi <anton.okolnychyi@...>
Date:   2017-07-27T23:49:42Z

    [SPARK-21538][SQL] Attribute resolution inconsistency in the Dataset API
    
    ## What changes were proposed in this pull request?
    
    This PR contains a tiny update that removes an attribute resolution inconsistency in the
Dataset API. The following example is taken from the ticket description:
    
    ```
    spark.range(1).withColumnRenamed("id", "x").sort(col("id"))  // works
    spark.range(1).withColumnRenamed("id", "x").sort($"id")  // works
    spark.range(1).withColumnRenamed("id", "x").sort('id) // works
    spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with:
    org.apache.spark.sql.AnalysisException: Cannot resolve column name "id" among (x);
    ```
    The above `AnalysisException` happens because the last case calls `Dataset.apply()` to
convert strings into columns, which triggers attribute resolution. To make the API consistent
between overloaded methods, this PR defers the resolution and constructs columns directly.
    
    Author: aokolnychyi <anton.okolnychyi@sap.com>
    
    Closes #18740 from aokolnychyi/spark-21538.
    
    (cherry picked from commit f44ead89f48f040b7eb9dfc88df0ec995b47bfe9)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>

commit 93790313b2e36e5e5ac4dfe13b285f03c42da111
Author: Yan Facai (颜发才) <facai.yan@...>
Date:   2017-07-28T02:10:35Z

    [SPARK-21306][ML] OneVsRest should support setWeightCol
    
    ## What changes were proposed in this pull request?
    
    add `setWeightCol` method for OneVsRest.
    
    `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.
    
    ## How was this patch tested?
    
    + [x] add an unit test.
    
    Author: Yan Facai (颜发才) <facai.yan@gmail.com>
    
    Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol.
    
    (cherry picked from commit a5a3189974ea4628e9489eb50099a5432174e80c)
    Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

commit df6cd35ecb710b99911f39b9d7d16cac08468b4d
Author: Remis Haroon <remis.haroon@...>
Date:   2017-07-29T12:26:10Z

    [SPARK-21508][DOC] Fix example code provided in Spark Streaming Documentation
    
    ## What changes were proposed in this pull request?
    
    JIRA ticket : [SPARK-21508](https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21508)
    
    correcting a mistake in example code provided in Spark Streaming Custom Receivers Documentation
    The example code provided in the documentation on 'Spark Streaming Custom Receivers' has
an error.
    doc link : https://spark.apache.org/docs/latest/streaming-custom-receivers.html
    
    ```
    
    // Assuming ssc is the StreamingContext
    val customReceiverStream = ssc.receiverStream(new CustomReceiver(host, port))
    val words = lines.flatMap(_.split(" "))
    ...
    ```
    
    instead of `lines.flatMap(_.split(" "))`
    it should be `customReceiverStream.flatMap(_.split(" "))`
    
    ## How was this patch tested?
    this documentation change is tested manually by jekyll build , running below commands
    ```
    jekyll build
    jekyll serve --watch
    ```
    screen-shots provided below
    ![screenshot1](https://user-images.githubusercontent.com/8828470/28744636-a6de1ac6-7482-11e7-843b-ff84b5855ec0.png)
    ![screenshot2](https://user-images.githubusercontent.com/8828470/28744637-a6def496-7482-11e7-9512-7f4bbe027c6a.png)
    
    Author: Remis Haroon <Remis.Haroon@insdc01.pwc.com>
    
    Closes #18770 from remisharoon/master.
    
    (cherry picked from commit c14382030b373177cf6aa3c045e27d754368a927)
    Signed-off-by: Sean Owen <sowen@cloudera.com>

commit 24a9bace131465bf6a177f304cf8f05b0e4fe6ed
Author: Liang-Chi Hsieh <viirya@...>
Date:   2017-07-29T17:02:56Z

    [SPARK-21555][SQL] RuntimeReplaceable should be compared semantically by its canonicalized
child
    
    ## What changes were proposed in this pull request?
    
    When there are aliases (these aliases were added for nested fields) as parameters in `RuntimeReplaceable`,
as they are not in the children expression, those aliases can't be cleaned up in analyzer
rule `CleanupAliases`.
    
    An expression `nvl(foo.foo1, "value")` can be resolved to two semantically different expressions
in a group by query because they contain different aliases.
    
    Because those aliases are not children of `RuntimeReplaceable` which is an `UnaryExpression`.
So we can't trim the aliases out by simple transforming the expressions in `CleanupAliases`.
    
    If we want to replace the non-children aliases in `RuntimeReplaceable`, we need to add
more codes to `RuntimeReplaceable` and modify all expressions of `RuntimeReplaceable`. It
makes the interface ugly IMO.
    
    Consider those aliases will be replaced later at optimization and so they're no harm,
this patch chooses to simply override `canonicalized` of `RuntimeReplaceable`.
    
    One concern is about `CleanupAliases`. Because it actually cannot clean up ALL aliases
inside a plan. To make caller of this rule notice that, this patch adds a comment to `CleanupAliases`.
    
    ## How was this patch tested?
    
    Added test.
    
    Author: Liang-Chi Hsieh <viirya@gmail.com>
    
    Closes #18761 from viirya/SPARK-21555.
    
    (cherry picked from commit 9c8109ef414c92553335bb1e90e9681e142128a4)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>

commit 66fa6bd6d48b08625ecedfcb5a976678141300bd
Author: Xingbo Jiang <xingbo.jiang@...>
Date:   2017-07-29T17:11:31Z

    [SPARK-19451][SQL] rangeBetween method should accept Long value as boundary
    
    ## What changes were proposed in this pull request?
    
    Long values can be passed to `rangeBetween` as range frame boundaries, but we silently
convert it to Int values, this can cause wrong results and we should fix this.
    
    Further more, we should accept any legal literal values as range frame boundaries. In
this PR, we make it possible for Long values, and make accepting other DataTypes really easy
to add.
    
    This PR is mostly based on Herman's previous amazing work: https://github.com/hvanhovell/spark/commit/596f53c339b1b4629f5651070e56a8836a397768
    
    After this been merged, we can close #16818 .
    
    ## How was this patch tested?
    
    Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`.
    
    Author: Xingbo Jiang <xingbo.jiang@databricks.com>
    
    Closes #18540 from jiangxb1987/rangeFrame.
    
    (cherry picked from commit 92d85637e7f382aae61c0f26eb1524d2b4c93516)
    Signed-off-by: gatorsmile <gatorsmile@gmail.com>

commit e2062b9c1106433799d2874dfe17e181fe1ecb5e
Author: gatorsmile <gatorsmile@...>
Date:   2017-07-30T03:35:22Z

    Revert "[SPARK-19451][SQL] rangeBetween method should accept Long value as boundary"
    
    This reverts commit 66fa6bd6d48b08625ecedfcb5a976678141300bd.

commit 174543466934c6ced5812e2dfc7e1a18793cf0b1
Author: Marcelo Vanzin <vanzin@...>
Date:   2017-08-01T17:06:03Z

    [SPARK-21522][CORE] Fix flakiness in LauncherServerSuite.
    
    Handle the case where the server closes the socket before the full message
    has been written by the client.
    
    Author: Marcelo Vanzin <vanzin@cloudera.com>
    
    Closes #18727 from vanzin/SPARK-21522.
    
    (cherry picked from commit b133501800b43fa5c538a4e5ad597c9dc7d8378e)
    Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message