spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yin Huai (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-14343) Dataframe operations on a partitioned dataset (using partition discovery) return invalid results
Date Wed, 01 Jun 2016 01:03:13 GMT

     [ https://issues.apache.org/jira/browse/SPARK-14343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yin Huai updated SPARK-14343:
-----------------------------
    Target Version/s: 2.0.0
            Priority: Critical  (was: Blocker)

> Dataframe operations on a partitioned dataset (using partition discovery) return invalid
results
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-14343
>                 URL: https://issues.apache.org/jira/browse/SPARK-14343
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 1.6.1, 2.0.0
>         Environment: Mac OS X 10.11.4 / Ubuntu 16.04 LTS
>            Reporter: Jurriaan Pruis
>            Priority: Critical
>
> When reading a dataset using {{sqlContext.read.text()}} queries on the partitioned column
return invalid results.
> h2. How to reproduce:
> h3. Generate datasets
> {code:title=repro.sh}
> #!/bin/sh
> mkdir -p dataset/year=2014
> mkdir -p dataset/year=2015
> echo "data from 2014" > dataset/year=2014/part01.txt
> echo "data from 2015" > dataset/year=2015/part01.txt
> {code}
> {code:title=repro2.sh}
> #!/bin/sh
> mkdir -p dataset2/month=june
> mkdir -p dataset2/month=july
> echo "data from june" > dataset2/month=june/part01.txt
> echo "data from july" > dataset2/month=july/part01.txt
> {code}
> h3. using first dataset
> {code:none}
> >>> df = sqlContext.read.text('dataset')
> ...
> >>> df
> DataFrame[value: string, year: int]
> >>> df.show()
> +--------------+----+
> |         value|year|
> +--------------+----+
> |data from 2014|2014|
> |data from 2015|2015|
> +--------------+----+
> >>> df.select('year').show()
> +----+
> |year|
> +----+
> |  14|
> |  14|
> +----+
> {code}
> This is clearly wrong. Seems like it returns the length of the value column?
> h3. using second dataset
> With another dataset it looks like this:
> {code:none}
> >>> df = sqlContext.read.text('dataset2')
> >>> df
> DataFrame[value: string, month: string]
> >>> df.show()
> +--------------+-----+
> |         value|month|
> +--------------+-----+
> |data from june| june|
> |data from july| july|
> +--------------+-----+
> >>> df.select('month').show()
> +--------------+
> |         month|
> +--------------+
> |data from june|
> |data from july|
> +--------------+
> {code}
> Here it returns the value of the value column instead of the month partition.
> h3. Workaround
> When I convert the dataframe to an RDD and back to a DataFrame I get the following result
(which is the expected behaviour):
> {code:none}
> >>> df.rdd.toDF().select('month').show()
> +-----+
> |month|
> +-----+
> | june|
> | july|
> +-----+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message