spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Twigg <andy.tw...@gmail.com>
Subject Re: Loading RDDs in a streaming fashion
Date Tue, 02 Dec 2014 05:41:22 GMT
You may be able to construct RDDs directly from an iterator - not sure
- you may have to subclass your own.

On 1 December 2014 at 18:40, Keith Simmons <keith@pulse.io> wrote:
> Yep, that's definitely possible.  It's one of the workarounds I was
> considering.  I was just curious if there was a simpler (and perhaps more
> efficient) approach.
>
> Keith
>
> On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg <andy.twigg@gmail.com> wrote:
>>
>> Could you modify your function so that it streams through the files record
>> by record and outputs them to hdfs, then read them all in as RDDs and take
>> the union? That would only use bounded memory.
>>
>> On 1 December 2014 at 17:19, Keith Simmons <keith@pulse.io> wrote:
>>>
>>> Actually, I'm working with a binary format.  The api allows reading out a
>>> single record at a time, but I'm not sure how to get those records into
>>> spark (without reading everything into memory from a single file at once).
>>>
>>>
>>>
>>> On Mon, Dec 1, 2014 at 5:07 PM, Andy Twigg <andy.twigg@gmail.com> wrote:
>>>>>
>>>>> file => tranform file into a bunch of records
>>>>
>>>>
>>>> What does this function do exactly? Does it load the file locally?
>>>> Spark supports RDDs exceeding global RAM (cf the terasort example), but
>>>> if your example just loads each file locally, then this may cause problems.
>>>> Instead, you should load each file into an rdd with context.textFile(),
>>>> flatmap that and union these rdds.
>>>>
>>>> also see
>>>>
>>>> http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files
>>>>
>>>>
>>>> On 1 December 2014 at 16:50, Keith Simmons <keith@pulse.io> wrote:
>>>>>
>>>>> This is a long shot, but...
>>>>>
>>>>> I'm trying to load a bunch of files spread out over hdfs into an RDD,
>>>>> and in most cases it works well, but for a few very large files, I exceed
>>>>> available memory.  My current workflow basically works like this:
>>>>>
>>>>> context.parallelize(fileNames).flatMap { file =>
>>>>>   tranform file into a bunch of records
>>>>> }
>>>>>
>>>>> I'm wondering if there are any APIs to somehow "flush" the records of
a
>>>>> big dataset so I don't have to load them all into memory at once.  I
know
>>>>> this doesn't exist, but conceptually:
>>>>>
>>>>> context.parallelize(fileNames).streamMap { (file, stream) =>
>>>>>  for every 10K records write records to stream and flush
>>>>> }
>>>>>
>>>>> Keith
>>>>
>>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message