spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "david cottrell (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-20012) spark.read.csv schemas effectively ignore headers
Date Sat, 18 Mar 2017 11:48:41 GMT
david cottrell created SPARK-20012:
--------------------------------------

             Summary: spark.read.csv schemas effectively ignore headers
                 Key: SPARK-20012
                 URL: https://issues.apache.org/jira/browse/SPARK-20012
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 2.1.0
         Environment: pyspark
            Reporter: david cottrell
            Priority: Minor


New to Spark, so please direct me elsewhere if there is another place for this kind of discussion.

To my understanding, schema are ordered *named* structures however it seems the names are
not being used when reading files with headers.

I had a quick look at the DataFrameReader code and it seems like it might not be too hard
to
a) let the schema set the "global" order of the columns
b) per file, map the columns *by name* to the schema ordering and apply the types on load.

A simple way of saying this is that the schema is an ordered dictionary and the files with
headers only define dictionaries.

A typical example showing what I think are the implications of this problem: 

{code}
In [248]: a = spark.read.csv('./data/test.csv.gz', header=True, inferSchema=True).toPandas()

In [249]: b = spark.read.csv('./data/0.csv.gz', header=True, inferSchema=True).toPandas()

In [250]: d = pd.concat([a, b])

In [251]: df = spark.read.csv('./data/{test,0}.csv.gz', header=True, inferSchema=True).toPandas()

In [252]: df[['b', 'c', 'd', 'e']] = df[['b', 'c', 'd', 'e']].astype(float)

In [253]: a
Out[253]:
      a         b         e         d         c
0  test -0.874197  0.168660 -0.948726  0.479723
1  test  1.124383  0.620870  0.159186  0.993676
2  test -1.429108 -0.048814 -0.057273 -1.331702

In [254]: b
Out[254]:
   a         b         c         d         e
0  0 -1.671828 -1.259530  0.905029  0.487244
1  0 -0.024553 -1.750904  0.004466  1.978049
2  0  1.686806  0.175431  0.677609 -0.851670

In [255]: d
Out[255]:
      a         b         c         d         e
0  test -0.874197  0.479723 -0.948726  0.168660
1  test  1.124383  0.993676  0.159186  0.620870
2  test -1.429108 -1.331702 -0.057273 -0.048814
0     0 -1.671828 -1.259530  0.905029  0.487244
1     0 -0.024553 -1.750904  0.004466  1.978049
2     0  1.686806  0.175431  0.677609 -0.851670

In [256]: df
Out[256]:
      a         b         c         d         e
0  test -0.874197  0.168660 -0.948726  0.479723
1  test  1.124383  0.620870  0.159186  0.993676
2  test -1.429108 -0.048814 -0.057273 -1.331702
3     0 -1.671828 -1.259530  0.905029  0.487244
4     0 -0.024553 -1.750904  0.004466  1.978049
5     0  1.686806  0.175431  0.677609 -0.851670
{code}

Example also posted here: http://stackoverflow.com/questions/42637497/pyspark-2-1-0-spark-read-csv-scrambles-columns



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message