Return-Path: X-Original-To: apmail-spark-issues-archive@minotaur.apache.org Delivered-To: apmail-spark-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 76DD3176F0 for ; Wed, 15 Apr 2015 09:20:11 +0000 (UTC) Received: (qmail 9421 invoked by uid 500); 15 Apr 2015 09:19:58 -0000 Delivered-To: apmail-spark-issues-archive@spark.apache.org Received: (qmail 9388 invoked by uid 500); 15 Apr 2015 09:19:58 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 9378 invoked by uid 99); 15 Apr 2015 09:19:58 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 15 Apr 2015 09:19:58 +0000 Date: Wed, 15 Apr 2015 09:19:58 +0000 (UTC) From: =?utf-8?Q?Micael_Capit=C3=A3o_=28JIRA=29?= To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (SPARK-6800) Reading from JDBC with SQLContext, using lower/upper bounds and numPartitions gives incorrect results. MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SPARK-6800?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1449= 5908#comment-14495908 ]=20 Micael Capit=C3=A3o edited comment on SPARK-6800 at 4/15/15 9:19 AM: ---------------------------------------------------------------- In the example I've provided, there will be no repeated rows read because o= f my dataset. But if the queries generated are as this: (0) age < 4,0 (1) age >=3D 4 AND age < 8,1 (2) age >=3D 8 AND age < 12,2 (3) age >=3D 12 AND age < 16,3 (4) age >=3D 16 AND age < 20,4 (5) age >=3D 20 AND age < 24,5 (6) age >=3D 24 AND age < 28,6 (7) age >=3D 28 AND age < 32,7 (8) age >=3D 32 AND age < 36,8 (9) age >=3D 36,9 ... the ranges for the partitions 1 to 8 overlap. In a real, more complex, = scenario that would be a same record in multiple partitions. For this case = it seems that is not such issue because the 'age' column is an Int and the = DB excludes the fractional part, but what if it was a double column? I'll t= ry that case and post the results. The title of the issue does not state that issue, but the description state= s it... was (Author: capitao): In the example I've provided, there will be no repeated rows read because o= f my dataset. But if the queries generated are as this: (0) age < 4,0 (1) age >=3D 4 AND age < 8,1 (2) age >=3D 8 AND age < 12,2 (3) age >=3D 12 AND age < 16,3 (4) age >=3D 16 AND age < 20,4 (5) age >=3D 20 AND age < 24,5 (6) age >=3D 24 AND age < 28,6 (7) age >=3D 28 AND age < 32,7 (8) age >=3D 32 AND age < 36,8 (9) age >=3D 36,9 ... the ranges for the partitions 1 to 8 overlap. In a real, more complex, = scenario that would be a same record in multiple partitions. The title of the issue does not state that issue, but the description state= s it... > Reading from JDBC with SQLContext, using lower/upper bounds and numPartit= ions gives incorrect results. > -------------------------------------------------------------------------= ----------------------------- > > Key: SPARK-6800 > URL: https://issues.apache.org/jira/browse/SPARK-6800 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.3.0 > Environment: Windows 8.1, Apache Derby DB, Spark 1.3.0 CDH5.4.0, = Scala 2.10 > Reporter: Micael Capit=C3=A3o > > Having a Derby table with people info (id, name, age) defined like this: > {code} > val jdbcUrl =3D "jdbc:derby:memory:PeopleDB;create=3Dtrue" > val conn =3D DriverManager.getConnection(jdbcUrl) > val stmt =3D conn.createStatement() > stmt.execute("CREATE TABLE Person (person_id INT NOT NULL GENERATED ALWAY= S AS IDENTITY CONSTRAINT person_pk PRIMARY KEY, name VARCHAR(50), age INT)"= ) > stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Carvalho', 50= )") > stmt.execute("INSERT INTO Person(name, age) VALUES('Lurdes Pereira', 23)"= ) > stmt.execute("INSERT INTO Person(name, age) VALUES('Ana Rita Costa', 12)"= ) > stmt.execute("INSERT INTO Person(name, age) VALUES('Armando Pereira', 32)= ") > stmt.execute("INSERT INTO Person(name, age) VALUES('Miguel Costa', 15)") > stmt.execute("INSERT INTO Person(name, age) VALUES('Anabela Sintra', 13)"= ) > {code} > If I try to read that table from Spark SQL with lower/upper bounds, like = this: > {code} > val people =3D sqlContext.jdbc(url =3D jdbcUrl, table =3D "Person", > columnName =3D "age", lowerBound =3D 0, upperBound =3D 40, numParti= tions =3D 10) > people.show() > {code} > I get this result: > {noformat} > PERSON_ID NAME AGE > 3 Ana Rita Costa 12=20 > 5 Miguel Costa 15=20 > 6 Anabela Sintra 13=20 > 2 Lurdes Pereira 23=20 > 4 Armando Pereira 32=20 > 1 Armando Carvalho 50=20 > {noformat} > Which is wrong, considering the defined upper bound has been ignored (I g= et a person with age 50!). > Digging the code, I've found that in {{JDBCRelation.columnPartition}} the= WHERE clauses it generates are the following: > {code} > (0) age < 4,0 > (1) age >=3D 4 AND age < 8,1 > (2) age >=3D 8 AND age < 12,2 > (3) age >=3D 12 AND age < 16,3 > (4) age >=3D 16 AND age < 20,4 > (5) age >=3D 20 AND age < 24,5 > (6) age >=3D 24 AND age < 28,6 > (7) age >=3D 28 AND age < 32,7 > (8) age >=3D 32 AND age < 36,8 > (9) age >=3D 36,9 > {code} > The last condition ignores the upper bound and the other ones may result = in repeated rows being read. > Using the JdbcRDD (and converting it to a DataFrame) I would have somethi= ng like this: > {code} > val jdbcRdd =3D new JdbcRDD(sc, () =3D> DriverManager.getConnection(jdbcU= rl), > "SELECT * FROM Person WHERE age >=3D ? and age <=3D ?", 0, 40, 10, > rs =3D> (rs.getInt(1), rs.getString(2), rs.getInt(3))) > val people =3D jdbcRdd.toDF("PERSON_ID", "NAME", "AGE") > people.show() > {code} > Resulting in: > {noformat} > PERSON_ID NAME AGE > 3 Ana Rita Costa 12=20 > 5 Miguel Costa 15=20 > 6 Anabela Sintra 13=20 > 2 Lurdes Pereira 23=20 > 4 Armando Pereira 32=20 > {noformat} > Which is correct! > Confirming the WHERE clauses generated by the JdbcRDD in the {{getPartiti= ons}} I've found it generates the following: > {code} > (0) age >=3D 0 AND age <=3D 3 > (1) age >=3D 4 AND age <=3D 7 > (2) age >=3D 8 AND age <=3D 11 > (3) age >=3D 12 AND age <=3D 15 > (4) age >=3D 16 AND age <=3D 19 > (5) age >=3D 20 AND age <=3D 23 > (6) age >=3D 24 AND age <=3D 27 > (7) age >=3D 28 AND age <=3D 31 > (8) age >=3D 32 AND age <=3D 35 > (9) age >=3D 36 AND age <=3D 40 > {code} > This is the behaviour I was expecting from the Spark SQL version. Is the = Spark SQL version buggy or is this some weird expected behaviour? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org