We had a visitor from Intel a month ago. 

One question from him was "What could you do if we gave you a server 2 years from now that had 16TB of memory"

I went.... Eh... using Java....?

2 years is maybe unrealistic, but you can already get some quite acceptable prices even on servers in the 100GB memory range now if you buy in larger quantities (30-50 servers and more in one go). 

I don't think it is unrealistic that we will start seeing high end consumer (x64) servers with TB's of memory in a few years and I really wonder were that puts java based software.

Terje

On Fri, Jul 1, 2011 at 2:25 AM, Edward Capriolo <edlinuxguru@gmail.com> wrote:


On Thu, Jun 30, 2011 at 12:44 PM, Daniel Doubleday <daniel.doubleday@gmx.net> wrote:
Hi all - or rather devs

we have been working on an alternative implementation to the existing row cache(s)

We have 2 main goals:

- Decrease memory -> get more rows in the cache without suffering a huge performance penalty
- Reduce gc pressure

This sounds a lot like we should be using the new serializing cache in 0.8.
Unfortunately our workload consists of loads of updates which would invalidate the cache all the time.

The second unfortunate thing is that the idea we came up with doesn't fit the new cache provider api...

It looks like this:

Like the serializing cache we basically only cache the serialized byte buffer. we don't serialize the bloom filter and try to do some other minor compression tricks (var ints etc not done yet). The main difference is that we don't deserialize but use the normal sstable iterators and filters as in the regular uncached case.

So the read path looks like this:

return filter.collectCollatedColumns(memtable iter, cached row iter)

The write path is not affected. It does not update the cache

During flush we merge all memtable updates with the cached rows.

These are early test results:

- Depending on row width and value size the serialized cache takes between 30% - 50% of memory compared with cached CF. This might be optimized further
- Read times increase by 5 - 10%

We haven't tested the effects on gc but hope that we will see improvements there because we only cache a fraction of objects (in terms of numbers) in old gen heap which should make gc cheaper. Of course there's also the option to use native mem like serializing cache does.

We believe that this approach is quite promising but as I said it is not compatible with the current cache api.

So my question is: does that sound interesting enough to open a jira or has that idea already been considered and rejected for some reason?

Cheers,
Daniel
 


The problem I see with the row cache implementation is more of a JVM problem. This problem is not Cassandra localized (IMHO) as I hear Hbase people with similar large cache/ Xmx issues. Personally, I feel this is a sign of Java showing age. "Let us worry about the pointers" was a great solution when systems had 32MB memory, because the cost of walking the object graph was small and possible and small time windows. But JVM's already can not handle 13+ GB of RAM and it is quite common to see systems with 32-64GB physical memory. I am very curious to see how java is going to evolve on systems with 128GB or even higher memory.

The G1 collector will help somewhat, however I do not see that really pushing Xmx higher then it is now. HBase has even went the route of using an off heap cache, https://issues.apache.org/jira/browse/HBASE-4018 , and some Jira mentions Cassandra exploring this alternative as well.

Doing whatever possible to shrink the current size of item in cache is an awesome. Anything that delivers more bang for the buck is +1. However I feel that VFS cache is the only way to effectively cache large datasets. I was quite disappointed when I upped a machine from 16GB to 48 GB physical memory. I said to myself "Awesome! now I can shave off a couple of GB for larger row caches" I changed Xmx from 9GB to 13GB, upped the caches, and restarted. I found the system spending a lot of time managing heap, and also found that my compaction processes that did 200GB in 4 hours now were taking 6 or 8 hours.

I had heard that JVMs "top out around 20GB" but I found they "top out" much lower. VFS cache +1