spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <Saif.A.Ell...@wellsfargo.com>
Subject Results change in group by operation
Date Mon, 26 Oct 2015 22:40:17 GMT
Hello Everyone,

I would need urgent help with a data consistency issue I am having.
Stand alone Cluster of five servers. sqlContext instance of HiveContext (default in spark-shell)
No special options other than driver memory and executor memory.
Parquet partitions are 512 where there are 160 cores.
Data is nearly 2 billion rows.
The issue happens

val data = sqlContext.read.parquet("/var/Saif/data_pqt")

val res = data.groupBy("product", "band", "age", "vint", "mb", "yyyymm").agg(count($"account_id").as("N"),
sum($"balance").as("balance_eom"), sum($"balancec").as("balance_eoc"), sum($"spend").as("spend"),
sum($"payment").as("payment"), sum($"feoc").as("feoc"), sum($"cfintbal").as("cfintbal"), count($"newacct"
=== 1).as("newacct")).persist()

val z = res.select("vint", "yyyymm").filter("vint = '2007-01-01'").select("yyyymm").distinct.collect

z.length

>>> res0: Int = 102

res.unpersist()

val z = res.select("vint", "yyyymm").filter("vint = '2007-01-01'").select("yyyymm").distinct.collect

z.length

>>> res1: Int = 103

Please help,
Saif






Mime
View raw message