From Roger Binns <>
Subject Re: Space (in)efficiency
Date Mon, 11 Jan 2010 01:59:16 GMT
Roger Binns wrote:
> I'll do the requisite experiments this weekend trying to see what has the
> most effect on file size 

And the answer is length.  It is quicker to add documents with sequential
(sorted) _ids.  The length of the _id field has an  effect on the final file
size and appears to be more than a multiple of the _id size as suggested in
earlier messages.  Somewhat amusingly compaction increased file sizes and
not by a trivial amount either.

To measure this, I wrote a simple Python script that created 65536 documents
with a 4 byte hex id, and then tried again padding the _id with zeros to get
8 and 16 byte, plus doing various other permutations.  It is an
embarrassingly small script (and likely just as small in other languages).
[Sorry for not publishing the script - BitBucket and I are having some
mutual hatred issues at the moment.]

The relationship between _id size, sparseness, file size and performance is
now better approached by someone with an understanding of the file format.

I've also started this page to help:


