cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From onlinespending <>
Subject Inefficiency with large set of small documents?
Date Mon, 25 Nov 2013 21:18:49 GMT
I’m trying to decide what noSQL database to use, and I’ve certainly decided against mongodb
due to its use of mmap. I’m wondering if Cassandra would also suffer from a similar inefficiency
with small documents. In mongodb, if you have a large set of small documents (each much less
than the 4KB page size) you will require far more RAM to fit your working set into memory,
since a large percentage of a 4KB chunk could very easily include infrequently accessed data
outside of your working set. Cassandra doesn’t use mmap, but it would still have to intelligently
discard the excess data that does not pertain to a small document that exists in the same
allocation unit on the hard disk when reading it into RAM. As an example lets say your cluster
size is 4KB as well, and you have 1000 small 256 byte documents that are scattered on the
disk that you want to fetch on a given query (the total number of documents is over 1 billion).
I want to make sure it only consumes roughly 256,000 bytes for those 1000 documents and not
4,096,000 bytes. When it first fetches a cluster from disk it may consume 4KB of cache, but
it should ultimately only ideally consume the relevant amount of bytes in RAM. If Cassandra
just indiscriminately uses RAM in 4KB blocks than that is unacceptable to me, because if my
working set at any given time is just 20% of my huge collection of small sized documents,
I don’t want to have to use servers with 5X as much RAM. That’s a huge expense.


P.S. Here’s a detailed post I made this morning in the mongodb user group about this topic.

People have often complained that because mongodb memory maps everything and leaves memory
management to the OS's virtual memory system, the swapping algorithm isn't optimized for database
usage. I disagree with this. For the most part, the swapping or paging algorithm itself can't
be much better than the sophisticated algorithms (such as LRU based ones) that OSes have refined
over many years. Why reinvent the wheel? Yes, you could potentially ensure that certain data
(such as the indexes) never get swapped out to disk, because even if they haven't been accessed
recently the cost of reading them back into memory will be too costly when they are in fact
needed. But that's not the bigger issue.

It breaks down with small documents << than page size

This is where using virtual memory for everything really becomes an issue. Suppose you've
got a bunch of really tiny documents (e.g. ~256 bytes) that are much smaller than the virtual
memory page size (e.g. 4KB). Now let's say that you've determined that your working set (e.g.
those documents in your collection that constitute say 99% of those accessed in a given hour)
to be 20GB. But your entire collection size is actually 100GB (it's just that 20% of your
documents are much much likely to be accessed in a given time period. It's not uncommon that
a small minority of documents will be accessed a large majority of the time). If your collection
is randomly distributed (such as would happen if you simply inserted new documents into your
collection) then in this example only about 20% of the documents that fit onto a 4KB page
will be part of the working set (i.e. the data that you need frequent access to at the moment).
The rest of the data will be made up of much less frequently accessed documents, that should
ideally be sitting on disk. So there's a huge inefficiency here. 80% of the data that is in
RAM is not even something I need to frequently access. In this example, I would need 5X the
amount of RAM to accommodate my working set.

Now, as a solution to this problem, you could separate your documents into two (or even a
few) collections with the grouping done by access frequency. The problem with this, is that
your working set can often change as a function of time of day and day of week. If your application
is global, your working set will be far different during 12pm local in NY vs 12pm local in
Tokyo. But more even more likely is that the working set is constantly changing as new data
is inserted into the database. Popularity of a document is often viral. As an example, a photo
that's posted on a social network may start off infrequently accessed but then quickly after
hundreds of "likes" could become very frequently accessed and part of your working set. You'd
need to actively monitor your documents and manually move a document from one collection to
the other, which is very inefficient.

Quite frankly this is not a burden that should be placed on the user anyways. By punting the
problem of memory management to the OS, mongodb requires the user to essentially do its job
and group data in a way that patches the inefficiencies in its memory management. As far as
I'm concerned, not until mongodb steps up and takes control of memory management can it be
taken seriously for very large datasets that often require many small documents with ever
changing working sets.

View raw message