Yep, that's definitely possible.  It's one of the workarounds I was considering.  I was just curious if there was a simpler (and perhaps more efficient) approach.

Keith

On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg <andy.twigg@gmail.com> wrote:
Could you modify your function so that it streams through the files record by record and outputs them to hdfs, then read them all in as RDDs and take the union? That would only use bounded memory.

On 1 December 2014 at 17:19, Keith Simmons <keith@pulse.io> wrote:
Actually, I'm working with a binary format.  The api allows reading out a single record at a time, but I'm not sure how to get those records into spark (without reading everything into memory from a single file at once).  



On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg <andy.twigg@gmail.com> wrote:
file => tranform file into a bunch of records

What does this function do exactly? Does it load the file locally?
Spark supports RDDs exceeding global RAM (cf the terasort example), but if your example just loads each file locally, then this may cause problems. Instead, you should load each file into an rdd with context.textFile(), flatmap that and union these rdds.

also see


On 1 December 2014 at 16:50, Keith Simmons <keith@pulse.io> wrote:
This is a long shot, but...

I'm trying to load a bunch of files spread out over hdfs into an RDD, and in most cases it works well, but for a few very large files, I exceed available memory.  My current workflow basically works like this:

context.parallelize(fileNames).flatMap { file =>
  tranform file into a bunch of records
}

I'm wondering if there are any APIs to somehow "flush" the records of a big dataset so I don't have to load them all into memory at once.  I know this doesn't exist, but conceptually:

context.parallelize(fileNames).streamMap { (file, stream) =>
 for every 10K records write records to stream and flush
}

Keith