spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Chammas <nicholas.cham...@gmail.com>
Subject PySpark: Make persist() return a context manager
Date Fri, 05 Aug 2016 04:56:01 GMT
Context managers
<https://docs.python.org/3/reference/datamodel.html#context-managers> are a
natural way to capture closely related setup and teardown code in Python.

For example, they are commonly used when doing file I/O:

with open('/path/to/file') as f:
    contents = f.read()
    ...

Once the program exits the with block, f is automatically closed.

Does it make sense to apply this pattern to persisting and unpersisting
DataFrames and RDDs? I feel like there are many cases when you want to
persist a DataFrame for a specific set of operations and then unpersist it
immediately afterwards.

For example, take model training. Today, you might do something like this:

labeled_data.persist()
model = pipeline.fit(labeled_data)
labeled_data.unpersist()

If persist() returned a context manager, you could rewrite this as follows:

with labeled_data.persist():
    model = pipeline.fit(labeled_data)

Upon exiting the with block, labeled_data would automatically be
unpersisted.

This can be done in a backwards-compatible way since persist() would still
return the parent DataFrame or RDD as it does today, but add two methods to
the object: __enter__() and __exit__()

Does this make sense? Is it attractive?

Nick
‚Äč

Mime
View raw message