spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Simmons <ke...@pulse.io>
Subject Re: Loading RDDs in a streaming fashion
Date Tue, 02 Dec 2014 02:40:20 GMT
Yep, that's definitely possible.  It's one of the workarounds I was
considering.  I was just curious if there was a simpler (and perhaps more
efficient) approach.

Keith

On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg <andy.twigg@gmail.com> wrote:

> Could you modify your function so that it streams through the files record
> by record and outputs them to hdfs, then read them all in as RDDs and take
> the union? That would only use bounded memory.
>
> On 1 December 2014 at 17:19, Keith Simmons <keith@pulse.io> wrote:
>
>> Actually, I'm working with a binary format.  The api allows reading out a
>> single record at a time, but I'm not sure how to get those records into
>> spark (without reading everything into memory from a single file at once).
>>
>>
>>
>> On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg <andy.twigg@gmail.com> wrote:
>>
>>> file => tranform file into a bunch of records
>>>
>>>
>>> What does this function do exactly? Does it load the file locally?
>>> Spark supports RDDs exceeding global RAM (cf the terasort example), but
>>> if your example just loads each file locally, then this may cause problems.
>>> Instead, you should load each file into an rdd with context.textFile(),
>>> flatmap that and union these rdds.
>>>
>>> also see
>>>
>>> http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files
>>>
>>>
>>> On 1 December 2014 at 16:50, Keith Simmons <keith@pulse.io> wrote:
>>>
>>>> This is a long shot, but...
>>>>
>>>> I'm trying to load a bunch of files spread out over hdfs into an RDD,
>>>> and in most cases it works well, but for a few very large files, I exceed
>>>> available memory.  My current workflow basically works like this:
>>>>
>>>> context.parallelize(fileNames).flatMap { file =>
>>>>   tranform file into a bunch of records
>>>> }
>>>>
>>>> I'm wondering if there are any APIs to somehow "flush" the records of a
>>>> big dataset so I don't have to load them all into memory at once.  I know
>>>> this doesn't exist, but conceptually:
>>>>
>>>> context.parallelize(fileNames).streamMap { (file, stream) =>
>>>>  for every 10K records write records to stream and flush
>>>> }
>>>>
>>>> Keith
>>>>
>>>
>>>
>>
>

Mime
View raw message