spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Twigg <andy.tw...@gmail.com>
Subject Re: Loading RDDs in a streaming fashion
Date Tue, 02 Dec 2014 01:07:34 GMT
>
> file => tranform file into a bunch of records


What does this function do exactly? Does it load the file locally?
Spark supports RDDs exceeding global RAM (cf the terasort example), but if
your example just loads each file locally, then this may cause problems.
Instead, you should load each file into an rdd with context.textFile(),
flatmap that and union these rdds.

also see
http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files


On 1 December 2014 at 16:50, Keith Simmons <keith@pulse.io> wrote:

> This is a long shot, but...
>
> I'm trying to load a bunch of files spread out over hdfs into an RDD, and
> in most cases it works well, but for a few very large files, I exceed
> available memory.  My current workflow basically works like this:
>
> context.parallelize(fileNames).flatMap { file =>
>   tranform file into a bunch of records
> }
>
> I'm wondering if there are any APIs to somehow "flush" the records of a
> big dataset so I don't have to load them all into memory at once.  I know
> this doesn't exist, but conceptually:
>
> context.parallelize(fileNames).streamMap { (file, stream) =>
>  for every 10K records write records to stream and flush
> }
>
> Keith
>

Mime
View raw message