spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Freeman <freeman.jer...@gmail.com>
Subject Re: Reading a large file (binary) into RDD
Date Thu, 02 Apr 2015 20:46:23 GMT
Hm, that will indeed be trickier because this method assumes records are the same byte size.
Is the file an arbitrary sequence of mixed types, or is there structure, e.g. short, long,
short, long, etc.? 

If you could post a gist with an example of the kind of file and how it should look once read
in that would be useful!

-------------------------
jeremyfreeman.net
@thefreemanlab

On Apr 2, 2015, at 2:09 PM, Vijayasarathy Kannan <kvijay@vt.edu> wrote:

> Thanks for the reply. Unfortunately, in my case, the binary file is a mix of short and
long integers. Is there any other way that could of use here?
> 
> My current method happens to have a large overhead (much more than actual computation
time). Also, I am short of memory at the driver when it has to read the entire file.
> 
> On Thu, Apr 2, 2015 at 1:44 PM, Jeremy Freeman <freeman.jeremy@gmail.com> wrote:
> If it’s a flat binary file and each record is the same length (in bytes), you can use
Spark’s binaryRecords method (defined on the SparkContext), which loads records from one
or more large flat binary files into an RDD. Here’s an example in python to show how it
works:
> 
>> # write data from an array
>> from numpy import random
>> dat = random.randn(100,5)
>> f = open('test.bin', 'w')
>> f.write(dat)
>> f.close()
> 
>> # load the data back in
>> from numpy import frombuffer
>> nrecords = 5
>> bytesize = 8
>> recordsize = nrecords * bytesize
>> data = sc.binaryRecords('test.bin', recordsize)
>> parsed = data.map(lambda v: frombuffer(buffer(v, 0, recordsize), 'float'))
> 
>> # these should be equal
>> parsed.first()
>> dat[0,:]
> 
> 
> Does that help?
> 
> -------------------------
> jeremyfreeman.net
> @thefreemanlab
> 
>> On Apr 2, 2015, at 1:33 PM, Vijayasarathy Kannan <kvijay@vt.edu> wrote:
>> 
>> What are some efficient ways to read a large file into RDDs?
>> 
>> For example, have several executors read a specific/unique portion of the file and
construct RDDs. Is this possible to do in Spark?
>> 
>> Currently, I am doing a line-by-line read of the file at the driver and constructing
the RDD.
> 
> 


Mime
View raw message