incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Randall Leeds <randall.le...@gmail.com>
Subject Re: couchdb disk storage format - why so large overhead?
Date Wed, 28 Dec 2011 18:08:08 GMT
On Wed, Dec 28, 2011 at 11:11, Alexey Loshkarev <elf2001@gmail.com> wrote:
> Hello.
>
> I'm using CouchDB for two+ years for our company's internal projects.
> It's very good and reliable database, i'm almost satisfied with it.
>
> But, couchdb disk size makes me cry. I'll describe this.
>
> My new project must store and manipulate with simple documents (15-20
> integer/float/string fields, without attachments).
> Target documents count may vary between 50M-500M. We are using SSD for
> database now and need to count every gigabyte.
> Currently, project data stored in MySQL.
> I know why mysql data is so compact - data file consits only data, not
> types and row names.
> But CouchDB database disk size is very overheaded.
>
> Some examples:
>
> I have snippet of data (900K rows). Average row length is 200 bytes.
> Total data size (disk size) is about 190MB.
>
> I imported all of this data to CouchDB and realized, it occupies 800MB
> (4x more than mysql). It was bulk insert with incrementing keys and
> after import database was compacted.
> I tried to reduce field names from 8-10 characters to 1-2 with almost no result.
> My data consists of strings in unicode. I realized, erlang external
> term format takes 5 bytes for every unicode character (instead of
> 1-... for utf-8). So i converted my unicode characters to ascii (just
> transliterating cyrillic symbols to asci, one unicode symbol to ascii
> equivalent).
> Result - almost no.
>
> Then I tried to calc sum of document sizes.
> I wrote an erlang view:
>
> fun({Doc}) ->
>    Emit(<<"raw">>, size(term_to_binary(Doc))),
>    Emit(<<"compressed">>, size(term_to_binary(Doc, [{compressed, 9}])))
> end.
>
> According this,
> raw document sum is about 725MB. So, about 10% overhead to id/rev
> index. It's almost ok, but.. So much!
> compressed data takes 435MB. It's much more better than 725, but still
> 2x more than mysql. I can live with 2x overhead, but 4x makes me cry.
>
> Which serialization format is used by couchdb storage engine?
> If it uses term_to_binary, is it possible to enable data compression?
> Via config-file or by http-headers.
>
> Also, term_to_binary seems very overheaded by itself. Any unicode
> character is encoded with 4 bytes, when utf-8 uses only 2 bytes for
> cyrillic chars.
>
> So, the questions are:
>
> 1. What can I do now, to use less space for my data?
> 2. Can I add compression option to term_to_binary (if it used by couchdb, sure)?
> 3. Possibilities to provide charset information for data, to make
> unicode to binary conversion more efficient?
> 4. Are there any progress in CouchDB development to change data
> storage format to less overheaded?
>
>
> Also, I just realized here
> (http://www.erlang.org/doc/apps/erts/erl_ext_dist.html), cite:
> ===============
> A float is stored in string format. the format used in sprintf to
> format the float is "%.20e" (there are more bytes allocated than
> necessary)
> ===============
> So, every float requires 33 bytes off disk space. Not so efficient.
>
>
> --
> ----------------
> Best regards
> Alexey Loshkarev
> mailto:elf2001@gmail.com

Future releases of CouchDB, starting with the 1.2 release, will allow
for compression using google's snappy library which should greatly
reduce the overhead you experience. Also be sure to compact if the
ratio of disk usage to dataset size starts to grow too far. An
automatic compaction daemon is also coming.

-R

Mime
View raw message