couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Kocoloski <kocol...@apache.org>
Subject Re: Size of couchdb documents
Date Fri, 16 Mar 2012 13:31:54 GMT
On Mar 15, 2012, at 7:55 PM, Jason Smith wrote:

> On Thu, Mar 15, 2012 at 10:14 PM, Daniel Gonzalez <gonvaled@gonvaled.com> wrote:
>> Hi Matthieu,
>> 
>> This really seems to help. I am using now a base62 encoded monotonically
>> increasing integer, which means my doc_id goes from "0" onwards, using the
>> alphabet:
>> 
>> ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqrstuvwxyz
>> 
>> I am getting now 3000 docs/s, more or less stable, and the size of my
>> documents has decreased from 3KB to 0.4 KB.
>> I am not sure whether this metrics will worsen when the database grows, but
>> my feeling is that the situation has improved a lot just by changing the
>> doc_id.
> 
> Hi, Daniel. That's great news! Also, I have an update from a CouchDB 1.2.0 test.
> 
> I have a database here with 10 million documents, most several KB of
> English text. upgrade to version 1.2 changed the database size from
> 38GB to is 9.2GB, or now 0.94 KB per document.
> 
> So you should see an even greater improvement when 1.2.0 comes out
> Real Soon Now.
> 
>> I have one more question. Is the alphabet I have shown above "ordered" for
>> couchdb?
> 
> The sort order may not be quite what you expect, especially if you
> work with Unix or servers a lot.
> 
> It is described here:
> http://wiki.apache.org/couchdb/View_collation#Collation_Specification
> 
> Basically CouchDB follows (uses!) ICU. The major point is that
> different letter sequences are compared case-insensitively, but
> same-letter strings are case sensitive (lower case first). To me, it
> more or less follows how an English dictionary would do it.
> 
> -- 
> Iris Couch

If memory serves the database's by_id tree uses Erlang term sorting for collation instead
of ICU.  ICU is of course the default collation option for MR views.  Regards,

Adam


Mime
View raw message