spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <reyno...@gmail.com>
Subject Re: off-heap RDDs
Date Sun, 25 Aug 2013 23:39:30 GMT
This can be a good idea, especially for large heaps, and the changes for
Spark is potentially fairly small (need to make BlockManager aware of off
heap size and direct byte buffers in its size accounting). This is
especially attractive if the application can read directly from a byte
buffer without generic serialization (like Shark).

One caveat with off-heap storage in the JVM is the OS might not be very
good at dealing with tons of small allocations, but this is not really a
big problem since RDD partitions are supposed to be large in Spark.




On Sun, Aug 25, 2013 at 3:26 PM, Imran Rashid <imran@therashids.com> wrote:

> Hi,
>
> I was wondering if anyone has thought about putting cached data in an
> RDD into off-heap memory, eg. w/ direct byte buffers.  For really
> long-lived RDDs that use a lot of memory, this seems like a huge
> improvement, since all the memory is now totally ignored during GC.
> (and reading data from direct byte buffers is potentially faster as
> well, buts thats just a nice bonus).
>
> The easiest thing to do is to store memory-serialized RDDs in direct
> byte buffers, but I guess we could also store the serialized RDD on
> disk and use a memory mapped file.  Serializing into off-heap buffers
> is a really simple patch, I just changed a few lines (I haven't done
> any real tests w/ it yet, though).  But I dont' really have a ton of
> experience w/ off-heap memory, so I thought I would ask what others
> think of the idea, if it makes sense or if there are any gotchas I
> should be aware of, etc.
>
> thanks,
> Imran
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message