spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject Will .count() always trigger an evaluation of each row?
Date Sat, 18 Feb 2017 03:15:33 GMT
Especially during development, people often use .count() or
.persist().count() to force evaluation of all rows — exposing any problems,
e.g. due to bad data — and to load data into cache to speed up subsequent
operations.

But as the optimizer gets smarter, I’m guessing it will eventually learn
that it doesn’t have to do all that work to give the correct count. (This
blog post
<https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html>
suggests that something like this is already happening.) This will change
Spark’s practical behavior while technically preserving semantics.

What will people need to do then to force evaluation or caching?

Nick
​

Mime
View raw message