couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Roger Binns <rog...@rogerbinns.com>
Subject Re: Space (in)efficiency
Date Fri, 08 Jan 2010 01:31:41 GMT
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Robert Newson wrote:
> The sequential uuids mentioned above are a configuration option. Since
> it's implicit in what you've said, you must be allowing couchdb to
> generate uuids on the server-side,

Sorry if I wasn't clear but your statement above is the complete opposite of
what is happening.

My objects have references to each other (circular and recursive) and I have
a script that generates JSON for them, one per line.  A separate script
imports those into CouchDB.

My generation script also generates the _id as it is referenced by other
items that have been or will be emitted.  The algorithm I used to generate
the _id was the same as CouchDB's default - 16 random hex digits.  That also
resulted in a massive database.  Switching my algorithm to generate 4
"digits" resulted in a significantly smaller database.

The bottom line is that CouchDB file size is very dependent on the size of
the _id.  Not only that, it seems to have an exponential factor of the _id size.

That information does not appear to be documented anywhere, nor does it seem
to be a "good thing".

> The cause, if you're interested, is simply that the low locality of
> reference between identifiers causes lots more btree nodes to be
> updated on each insert, which increases the size of the file and
> requires more effort to traverse during compaction.

That still doesn't explain the major difference in file sizes, especially
post compaction which is what I was measuring.  Even better how about a
formula to describe how big the database should be?  It appears to be
something like:

  size = nrecords * (avg_record_size + len_id ^ 3)

The power is probably based on log of the len_id, but in any event shows
just how dramatically database size can increase.

> I saw similar amounts of "bloat", which is why I contributed the
> original sequential uuids patch some months ago. The uuid's generated
> by "sequential" (the default is called "random") are still very
> unlikely to collide but are much kinder to the btree algorithm.

You have done what I did - addressed the symptoms rather than the cause :-)

> Finally, for large documents, or modest amounts of attachments, this
> bloat, even with random 128-bit uuid's, is much reduced.

Are you saying that how different the uuids are to each other is the biggest
determinant of space consumption rather than their size?

Roger
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktGiv0ACgkQmOOfHg372QRMRwCgryfiGdKLma7qjnGmJOwAEpcR
Io4AnjcSewuZjiFnKnrecxaHgWvnHOyL
=PTkC
-----END PGP SIGNATURE-----


Mime
View raw message