This is a long shot, but...
I'm trying to load a bunch of files spread out over hdfs into an RDD, and
in most cases it works well, but for a few very large files, I exceed
available memory. My current workflow basically works like this:
context.parallelize(fileNames).flatMap { file =>
tranform file into a bunch of records
}
I'm wondering if there are any APIs to somehow "flush" the records of a big
dataset so I don't have to load them all into memory at once. I know this
doesn't exist, but conceptually:
context.parallelize(fileNames).streamMap { (file, stream) =>
for every 10K records write records to stream and flush
}
Keith
|