spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rick Moritz (JIRA)" <>
Subject [jira] [Closed] (SPARK-8380) SparkR mis-counts
Date Tue, 16 Jun 2015 06:20:01 GMT


Rick Moritz closed SPARK-8380.
    Resolution: Invalid

I got my columns mixed up, late in the evening after a frustrating day with SparkR's documentation.
With the correct columns, the counts are equal in both expression types and via both platforms.

> SparkR mis-counts
> -----------------
>                 Key: SPARK-8380
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: SparkR
>    Affects Versions: 1.4.0
>            Reporter: Rick Moritz
> On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can perform count
operations on the entirety of the dataset and get the correct value, as double checked against
the same code in scala.
> When I start to add conditions or even do a simple partial ascending histogram, I get
> In particular, there are missing values in SparkR, and massively so:
> A top 6 count of a certain feature in my dataset results in an order of magnitude smaller
numbers, than I get via scala.
> The following logic, which I consider equivalent is the basis for this report:
> counts<-summarize(groupBy(df, df$col_name), count = n(tdf$col_name))
> head(arrange(counts, desc(counts$count)))
> versus:
> val table = sql("SELECT col_name, count(col_name) as value from df  group by col_name
order by value desc")
> The first, in particular, is taken directly from the SparkR programming guide. Since
summarize isn't documented from what I can see, I'd hope it does what the programming guide
indicates. In that case this would be a pretty serious logic bug (no errors are thrown). Otherwise,
there's the possibility of a lack of documentation and badly worded example in the guide being
behind my misperception of SparkRs functionality.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message