spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nitin <>
Subject Spark SQL - Long running job
Date Sat, 21 Feb 2015 16:55:49 GMT
Hi All,

I intend to build a long running spark application which fetches data/tuples
from parquet, does some processing(time consuming) and then cache the
processed table (InMemoryColumnarTableScan). My use case is good retrieval
time for SQL query(benefits of Spark SQL optimizer) and data
compression(in-built in in-memory caching). Now the problem is that if my
driver goes down, I will have to fetch the data again for all the tables and
compute it and cache which is time consuming.

Is it possible to persist processed/cached RDDs on disk such that my system
up time is less when restarted after failure/going down?

On a side note, the data processing contains a shuffle step which creates
huge temporary shuffle files on local disk in temp folder and as per current
logic, shuffle files don't get deleted for running executors. This is
leading to my local disk getting filled up quickly and going out of space as
its a long running spark job. (running spark in yarn-client mode btw).


View this message in context:
Sent from the Apache Spark Developers List mailing list archive at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message