spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "L. C. Hsieh (Jira)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-32888) reading a parallized rdd with two identical records results in a zero count df when read via spark.read.csv
Date Wed, 16 Sep 2020 15:37:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-32888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197046#comment-17197046
] 

L. C. Hsieh commented on SPARK-32888:
-------------------------------------

Reading csv files is simple. We can just remove first line. But when we read RDD of string
of Dataset of String containing CSV lines, we don't know which lines are the first lines in
files. So what we can do is just remove the lines same as the header.

> reading a parallized rdd with two identical records results in a zero count df when read
via spark.read.csv
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-32888
>                 URL: https://issues.apache.org/jira/browse/SPARK-32888
>             Project: Spark
>          Issue Type: Documentation
>          Components: Spark Core
>    Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>            Reporter: Punit Shah
>            Assignee: L. C. Hsieh
>            Priority: Minor
>             Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> * Imagine a two-row csv file like so (where the header and first record are duplicate
rows):
> aaa,bbb
> aaa,bbb
>  * The following is pyspark code
>  * create a parallelized rdd like: {color:#FF0000}prdd = spark.read.text("test.csv").rdd.flatMap(lambda
x : x){color}
>  * {color:#172b4d}create a df like so: {color:#de350b}mydf = spark.read.csv(prdd, header=True){color}{color}
>  * {color:#172b4d}{color:#de350b}df.count(){color:#172b4d} will result in a record count
of zero (when it should be 1){color}{color}{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message