spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Saif Addin Ellafi (JIRA)" <>
Subject [jira] [Created] (SPARK-11330) Filter operation on StringType after groupBy brings no results when there are
Date Mon, 26 Oct 2015 23:44:27 GMT
Saif Addin Ellafi created SPARK-11330:

             Summary: Filter operation on StringType after groupBy brings no results when
there are
                 Key: SPARK-11330
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 1.5.1
         Environment: Stand alone Cluster of five servers. sqlContext instance of HiveContext
(default in spark-shell)
No special options other than driver memory and executor memory.
Parquet partitions are 512 where there are 160 cores.
Data is nearly 2 billion rows.
            Reporter: Saif Addin Ellafi
            Priority: Blocker

val data ="/var/data/Saif/ppnr_pqt")
val res = data.groupBy("vintages", "yyyymm").count.persist()
res.dtypes >>> res25: Array[(String, String)] = Array((vintages,StringType), (yyyymm,IntegerType),
res.first >>> res24: org.apache.spark.sql.Row = [1967-06-01,200506,18750]
val z ="vintages", "yyyymm").filter("vintages = '1967-06-01'").select("yyyymm").distinct.collect
z.length >>> 0

This does not happen if using another type of field, eg IntType
val z ="vintages", "yyyymm").filter("yyyymm = 200506").select("yyyymm").distinct.collect
z: Array[org.apache.spark.sql.Row] = Array([200506])

Originally, the issue happened with a larger aggregation operation, the result was that data
was inconsistent bringing different results every call.

Reducing the operation step by step, I got into this issue.
In any case, the original operation was:

val data ="/var/Saif/data_pqt")

val res = data.groupBy("product", "band", "age", "vint", "mb", "yyyymm").agg(count($"account_id").as("N"),
sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), sum($"spend").as("spend"),
sum($"payment").as("payment"), sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct"
=== 1).as("newacct")).persist()

val z ="vint", "yyyymm").filter("vint = '2007-01-01'").select("yyyymm").distinct.collect


>>> res0: Int = 102


val z ="vint", "yyyymm").filter("vint = '2007-01-01'").select("yyyymm").distinct.collect


>>> res1: Int = 103

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message