spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From johndesuv <desu...@gmail.com>
Subject DataFrame from in memory datasets in multiple JVMs
Date Tue, 28 Feb 2017 16:02:51 GMT
Hi,

I have an application that runs on a series of JVMs that each contain a
subset of a large dataset in memory.  I'd like to use this data in spark and
am looking at ways to use this as a data source in spark without writing the
data to disk as a handoff.

Parallelize doesn't work for me since I need to use the data across all the
JVMs as one DataFrame.

The only option I've come up with so far is to write a custom DataSource
that then transmits the data from each of the JVMs over the network.  This
seems like overkill though.

Is there a simpler solution for getting this data into a DataFrame?

Thanks,
John



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-from-in-memory-datasets-in-multiple-JVMs-tp28438.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message