spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jayce Jiang (Jira)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-32515) Distinct Function Weird Bug
Date Wed, 05 Aug 2020 18:43:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-32515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171682#comment-17171682
] 

Jayce Jiang edited comment on SPARK-32515 at 8/5/20, 6:42 PM:
--------------------------------------------------------------

Okay,

The expect results is what is I show in the filter_df.toPandas()["username"].unique.

The result was all usernames are all in the correct format, the username columns only contain
characters [a-z][A-Z][0-9] and the underscore.  !unknown.png|width=631,height=251! For example,
"danielrainge", "dgreen_14".

 

The problem is when I use spark function instead of converting to a pandas dataframe first.
As you see in the image. In [134], when I do the collect() method, I am getting result string
like [["#classic"|#classic"]] , and random result with bracket [], those result shouldn't
be there, all the string in the username column does not contain bracket or hashtags #. 

!unknown1.png|width=576,height=272!

I am trying it in google colab right now, and see if it is a Jupyter notebook problem. Will
keep you updated

 


was (Author: tigaiii123):
Okay,

The expect results is what is I show in the filter_df.toPandas()["username"].unique.

The result was all usernames are all in the correct format, the username columns only contain
characters [a-z][A-Z][0-9] and the underscore. !unknown.png|width=631,height=251! y. For example,
"danielrainge", "dgreen_14",

> Distinct Function Weird Bug
> ---------------------------
>
>                 Key: SPARK-32515
>                 URL: https://issues.apache.org/jira/browse/SPARK-32515
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.6
>         Environment: Window 10 and Mac, both have the same issues.
> Using Scala version 2.11.12
> Python 3.6.10
> java version "1.8.0_261"
>            Reporter: Jayce Jiang
>            Priority: Major
>         Attachments: Capture.PNG, Capture1.png, Capture2.PNG, image-2020-08-03-07-03-55-716.png,
unknown.png, unknown1.png, unknown2.png
>
>
> A weird spark display and counting error. When I was loading in my CSV file into spark
and trying to do check all distinct value from a column inside of a dataframe. Everything
I try in spark resulted in a wrong answer. But if I convert my spark dataframe into pandas
dataframe, it works. Please help. This bug only happens in this one CSV file, all my other
CSV files work properly. Here are the pictures.
>  
> !image-2020-08-01-21-19-06-402.png!!image-2020-08-01-21-19-03-289.png!!image-2020-08-01-21-18-58-625.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message