spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Charles Drotar (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-13156) JDBC using multiple partitions creates additional tasks but only executes on one
Date Wed, 03 Feb 2016 05:11:39 GMT

     [ https://issues.apache.org/jira/browse/SPARK-13156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Charles Drotar updated SPARK-13156:
-----------------------------------
    Description: 
I can successfully kick off a query through JDBC to Teradata, and when it runs it creates
a task on each executor for every partition. The problem is that all of the tasks except for
one complete within a couple seconds and the final task handles the entire dataset.

Example Code:
private val properties = new java.util.Properties()
properties.setProperty("driver","com.teradata.jdbc.TeraDriver")
properties.setProperty("username","foo")
properties.setProperty("password","bar")
val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10"
val numPartitions = 5
val dbTableTemp = "( SELECT  id MOD $numPartitions%d AS modulo, id
          FROM db.table
        ) AS TEMP_TABLE"
val partitionColumn = "modulo"
val lowerBound = 0.toLong
val upperBound = (numPartitions-1).toLong
val df = sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties)
df.write.parquet("/output/path/for/df/")

When I look at the Spark UI I see that 5 tasks, but only 1 is actually querying.

  was:
I can successfully kick off a query through JDBC to Teradata, and when it runs it creates
a task on each executor for every partition. The problem is that all of the tasks except for
one complete within a couple seconds and the final task handles the entire dataset.

Example Code:
private val properties = new java.util.Properties()
properties.setProperty("driver",this.driver)
properties.setProperty("username","foo")
properties.setProperty("password","bar")
val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10"
val numPartitions = 5
val dbTableTemp = "( SELECT  id MOD $numPartitions%d AS modulo, id
          FROM db.table
        ) AS TEMP_TABLE"
val partitionColumn = "modulo"
val lowerBound = 0.toLong
val upperBound = (numPartitions-1).toLong
val df = sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties)
df.write.parquet("/output/path/for/df/")

When I look at the Spark UI I see that 5 tasks, but only 1 is actually querying.


> JDBC using multiple partitions creates additional tasks but only executes on one
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-13156
>                 URL: https://issues.apache.org/jira/browse/SPARK-13156
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 1.5.0
>         Environment: Hadoop 2.6.0-cdh5.4.0, Teradata, yarn-client
>            Reporter: Charles Drotar
>
> I can successfully kick off a query through JDBC to Teradata, and when it runs it creates
a task on each executor for every partition. The problem is that all of the tasks except for
one complete within a couple seconds and the final task handles the entire dataset.
> Example Code:
> private val properties = new java.util.Properties()
> properties.setProperty("driver","com.teradata.jdbc.TeraDriver")
> properties.setProperty("username","foo")
> properties.setProperty("password","bar")
> val url = "jdbc:teradata://oneview/, TMODE=TERA,TYPE=FASTEXPORT,SESSIONS=10"
> val numPartitions = 5
> val dbTableTemp = "( SELECT  id MOD $numPartitions%d AS modulo, id
>           FROM db.table
>         ) AS TEMP_TABLE"
> val partitionColumn = "modulo"
> val lowerBound = 0.toLong
> val upperBound = (numPartitions-1).toLong
> val df = sqlContext.read.jdbc(url,dbTableTemp,partitionColumn,lowerBound,upperBound,numPartitions,properties)
> df.write.parquet("/output/path/for/df/")
> When I look at the Spark UI I see that 5 tasks, but only 1 is actually querying.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message