spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Simmons <ke...@pulse.io>
Subject Loading RDDs in a streaming fashion
Date Tue, 02 Dec 2014 00:50:12 GMT
This is a long shot, but...

I'm trying to load a bunch of files spread out over hdfs into an RDD, and
in most cases it works well, but for a few very large files, I exceed
available memory.  My current workflow basically works like this:

context.parallelize(fileNames).flatMap { file =>
  tranform file into a bunch of records
}

I'm wondering if there are any APIs to somehow "flush" the records of a big
dataset so I don't have to load them all into memory at once.  I know this
doesn't exist, but conceptually:

context.parallelize(fileNames).streamMap { (file, stream) =>
 for every 10K records write records to stream and flush
}

Keith

Mime
View raw message