spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Simmons <ke...@pulse.io>
Subject Re: Loading RDDs in a streaming fashion
Date Tue, 02 Dec 2014 01:19:06 GMT
Actually, I'm working with a binary format.  The api allows reading out a
single record at a time, but I'm not sure how to get those records into
spark (without reading everything into memory from a single file at once).



On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg <andy.twigg@gmail.com> wrote:

> file => tranform file into a bunch of records
>
>
> What does this function do exactly? Does it load the file locally?
> Spark supports RDDs exceeding global RAM (cf the terasort example), but if
> your example just loads each file locally, then this may cause problems.
> Instead, you should load each file into an rdd with context.textFile(),
> flatmap that and union these rdds.
>
> also see
>
> http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files
>
>
> On 1 December 2014 at 16:50, Keith Simmons <keith@pulse.io> wrote:
>
>> This is a long shot, but...
>>
>> I'm trying to load a bunch of files spread out over hdfs into an RDD, and
>> in most cases it works well, but for a few very large files, I exceed
>> available memory.  My current workflow basically works like this:
>>
>> context.parallelize(fileNames).flatMap { file =>
>>   tranform file into a bunch of records
>> }
>>
>> I'm wondering if there are any APIs to somehow "flush" the records of a
>> big dataset so I don't have to load them all into memory at once.  I know
>> this doesn't exist, but conceptually:
>>
>> context.parallelize(fileNames).streamMap { (file, stream) =>
>>  for every 10K records write records to stream and flush
>> }
>>
>> Keith
>>
>
>

Mime
View raw message