Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Wed, 15 Apr 2015 09:19:58 +0000 (UTC)
From: =?utf-8?Q?Micael_Capit=C3=A3o_=28JIRA=29?= <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.12819637.1428578520000.6030.1429089598626@Atlassian.JIRA>
In-Reply-To: <JIRA.12819637.1428578520000@Atlassian.JIRA>
References: <JIRA.12819637.1428578520000@Atlassian.JIRA>
 <JIRA.12819637.1428578520984@arcas>
Subject: [jira] [Comment Edited] (SPARK-6800) Reading from JDBC with
 SQLContext, using lower/upper bounds and numPartitions gives incorrect
 results.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/SPARK-6800?page=3Dcom.atlassian=
.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1449=
5908#comment-14495908 ]=20

Micael Capit=C3=A3o edited comment on SPARK-6800 at 4/15/15 9:19 AM:
----------------------------------------------------------------

In the example I've provided, there will be no repeated rows read because o=
f my dataset.
But if the queries generated are as this:
(0) age < 4,0
(1) age >=3D 4  AND age < 8,1
(2) age >=3D 8  AND age < 12,2
(3) age >=3D 12 AND age < 16,3
(4) age >=3D 16 AND age < 20,4
(5) age >=3D 20 AND age < 24,5
(6) age >=3D 24 AND age < 28,6
(7) age >=3D 28 AND age < 32,7
(8) age >=3D 32 AND age < 36,8
(9) age >=3D 36,9

... the ranges for the partitions 1 to 8 overlap. In a real, more complex, =
scenario that would be a same record in multiple partitions. For this case =
it seems that is not such issue because the 'age' column is an Int and the =
DB excludes the fractional part, but what if it was a double column? I'll t=
ry that case and post the results.
The title of the issue does not state that issue, but the description state=
s it...


was (Author: capitao):
In the example I've provided, there will be no repeated rows read because o=
f my dataset.
But if the queries generated are as this:
(0) age < 4,0
(1) age >=3D 4  AND age < 8,1
(2) age >=3D 8  AND age < 12,2
(3) age >=3D 12 AND age < 16,3
(4) age >=3D 16 AND age < 20,4
(5) age >=3D 20 AND age < 24,5
(6) age >=3D 24 AND age < 28,6
(7) age >=3D 28 AND age < 32,7
(8) age >=3D 32 AND age < 36,8
(9) age >=3D 36,9

... the ranges for the partitions 1 to 8 overlap. In a real, more complex, =
scenario that would be a same record in multiple partitions.
The title of the issue does not state that issue, but the description state=
s it...

> Reading from JDBC with SQLContext, using lower/upper bounds and numPartit=
ions gives incorrect results.
> -------------------------------------------------------------------------=
-----------------------------
>
>                 Key: SPARK-6800
>                 URL: https://issues.apache.org/jira/browse/SPARK-6800
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.3.0
>         Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, =
Scala 2.10
>            Reporter: Micael Capit=C3=A3o
>
> Having a Derby table with people info (id, name, age) defined like this:
> {code}
> val jdbcUrl =3D "jdbc:derby:memory:PeopleDB;create=3Dtrue"
> val conn =3D DriverManager.getConnection(jdbcUrl)
> val stmt =3D conn.createStatement()
> stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAY=
S AS IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)"=
)
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50=
)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)"=
)
> stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)"=
)
> stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)=
")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)")
> stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)"=
)
> {code}
> If I try to read that table from Spark SQL with lower/upper bounds, like =
this:
> {code}
> val people =3D sqlContext.jdbc(url =3D jdbcUrl, table =3D "Person",
>       columnName =3D "age", lowerBound =3D 0, upperBound =3D 40, numParti=
tions =3D 10)
> people.show()
> {code}
> I get this result:
> {noformat}
> PERSON_ID NAME             AGE
> 3         Ana Rita Costa   12=20
> 5         Miguel Costa     15=20
> 6         Anabela Sintra   13=20
> 2         Lurdes Pereira   23=20
> 4         Armando Pereira  32=20
> 1         Armando Carvalho 50=20
> {noformat}
> Which is wrong, considering the defined upper bound has been ignored (I g=
et a person with age 50!).
> Digging the code, I've found that in {{JDBCRelation.columnPartition}} the=
 WHERE clauses it generates are the following:
> {code}
> (0) age < 4,0
> (1) age >=3D 4  AND age < 8,1
> (2) age >=3D 8  AND age < 12,2
> (3) age >=3D 12 AND age < 16,3
> (4) age >=3D 16 AND age < 20,4
> (5) age >=3D 20 AND age < 24,5
> (6) age >=3D 24 AND age < 28,6
> (7) age >=3D 28 AND age < 32,7
> (8) age >=3D 32 AND age < 36,8
> (9) age >=3D 36,9
> {code}
> The last condition ignores the upper bound and the other ones may result =
in repeated rows being read.
> Using the JdbcRDD (and converting it to a DataFrame) I would have somethi=
ng like this:
> {code}
> val jdbcRdd =3D new JdbcRDD(sc, () =3D> DriverManager.getConnection(jdbcU=
rl),
>       "SELECT * FROM Person WHERE age >=3D ? and age <=3D ?", 0, 40, 10,
>       rs =3D> (rs.getInt(1), rs.getString(2), rs.getInt(3)))
> val people =3D jdbcRdd.toDF("PERSON_ID", "NAME", "AGE")
> people.show()
> {code}
> Resulting in:
> {noformat}
> PERSON_ID NAME            AGE
> 3         Ana Rita Costa  12=20
> 5         Miguel Costa    15=20
> 6         Anabela Sintra  13=20
> 2         Lurdes Pereira  23=20
> 4         Armando Pereira 32=20
> {noformat}
> Which is correct!
> Confirming the WHERE clauses generated by the JdbcRDD in the {{getPartiti=
ons}} I've found it generates the following:
> {code}
> (0) age >=3D 0  AND age <=3D 3
> (1) age >=3D 4  AND age <=3D 7
> (2) age >=3D 8  AND age <=3D 11
> (3) age >=3D 12 AND age <=3D 15
> (4) age >=3D 16 AND age <=3D 19
> (5) age >=3D 20 AND age <=3D 23
> (6) age >=3D 24 AND age <=3D 27
> (7) age >=3D 28 AND age <=3D 31
> (8) age >=3D 32 AND age <=3D 35
> (9) age >=3D 36 AND age <=3D 40
> {code}
> This is the behaviour I was expecting from the Spark SQL version. Is the =
Spark SQL version buggy or is this some weird expected behaviour?


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org