spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shivaram Venkataraman <shiva...@eecs.berkeley.edu>
Subject Re: In Memory Caching blowing up the size
Date Fri, 07 Feb 2014 21:51:45 GMT
Strings are pretty bad in terms of blow up - You can check the
SizeEstimatorTest to get an idea of how much things will blow up. For
example the string "abcdefgh" will be 72 bytes in a 64 bit arch.

Shivaram

On Fri, Feb 7, 2014 at 1:45 PM, Aaron Davidson <ilikerps@gmail.com> wrote:
> Have you tried caching the RDD in memory with serialization? How are you
> measuring the in-memory size?
>
> In general I can imagine a blowup of 2-3 times for small rows would be
> expected, but 10x does seem excessive.
>
>
> On Fri, Feb 7, 2014 at 12:38 PM, Vipul Pandey <vipandey@gmail.com> wrote:
>>
>> Hi,
>>
>> I have a very small dataset that I need to join with some bigger ones
>> later. The data is around 75M in size on disk. When I load it and transform
>> it a little it, as it goes, generates an RDD[(String,String)] where the
>> first string is on an average 25 chars long and the second one is about 10.
>>
>> Now :
>> - When I save this new RDD as a file on HDFS, the output file size is
>> around 70M
>> - When I cache it on disk with java serialization, the size in memory is
>> around 55M.
>> - But, when I cache this RDD in memory without any serialization, the
>> cached size is 700M (??)
>>
>> Any idea why is it bloating up by a factor of 10? What's a typical factor
>> (for size) by which uncompressed input is represented in memory/cache?
>>
>> Vipul
>>
>

Mime
View raw message