couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Anderson <jch...@apache.org>
Subject Re: Space (in)efficiency
Date Fri, 08 Jan 2010 06:17:48 GMT
On Thu, Jan 7, 2010 at 5:31 PM, Roger Binns <rogerb@rogerbinns.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Robert Newson wrote:
>> The sequential uuids mentioned above are a configuration option. Since
>> it's implicit in what you've said, you must be allowing couchdb to
>> generate uuids on the server-side,
>
> Sorry if I wasn't clear but your statement above is the complete opposite of
> what is happening.
>
> My objects have references to each other (circular and recursive) and I have
> a script that generates JSON for them, one per line.  A separate script
> imports those into CouchDB.
>
> My generation script also generates the _id as it is referenced by other
> items that have been or will be emitted.  The algorithm I used to generate
> the _id was the same as CouchDB's default - 16 random hex digits.  That also
> resulted in a massive database.  Switching my algorithm to generate 4
> "digits" resulted in a significantly smaller database.
>
> The bottom line is that CouchDB file size is very dependent on the size of
> the _id.  Not only that, it seems to have an exponential factor of the _id size.
>
> That information does not appear to be documented anywhere, nor does it seem
> to be a "good thing".
>
>> The cause, if you're interested, is simply that the low locality of
>> reference between identifiers causes lots more btree nodes to be
>> updated on each insert, which increases the size of the file and
>> requires more effort to traverse during compaction.
>
> That still doesn't explain the major difference in file sizes, especially
> post compaction which is what I was measuring.  Even better how about a
> formula to describe how big the database should be?  It appears to be
> something like:
>
>  size = nrecords * (avg_record_size + len_id ^ 3)
>
> The power is probably based on log of the len_id, but in any event shows
> just how dramatically database size can increase.
>
>> I saw similar amounts of "bloat", which is why I contributed the
>> original sequential uuids patch some months ago. The uuid's generated
>> by "sequential" (the default is called "random") are still very
>> unlikely to collide but are much kinder to the btree algorithm.
>
> You have done what I did - addressed the symptoms rather than the cause :-)
>
>> Finally, for large documents, or modest amounts of attachments, this
>> bloat, even with random 128-bit uuid's, is much reduced.
>
> Are you saying that how different the uuids are to each other is the biggest
> determinant of space consumption rather than their size?

Roger,

We've seen lots of cases where switching from random to sequential
uuids made big differences in insert performance and file size.

It's not surprising that shorter ids also made a similar difference.
If we wanted to experiment more, it'd be interesting to see the
difference between long and short sequential ids, as well as between
short random and short sequential.

If inserts of nearby keys are done close together in time, you're more
likely to have cache hits all the way down the memory hierarchy.

Chris

>
> Roger
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEYEARECAAYFAktGiv0ACgkQmOOfHg372QRMRwCgryfiGdKLma7qjnGmJOwAEpcR
> Io4AnjcSewuZjiFnKnrecxaHgWvnHOyL
> =PTkC
> -----END PGP SIGNATURE-----
>
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Mime
View raw message