couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Ehrenberg <>
Subject Unicode collation
Date Sat, 14 Aug 2010 06:01:19 GMT

I noticed that CouchDB uses ICU for Unicode collation. Great job on
that decision! I've been interested in Unicode for a while, so I
looked into the implementation of this. I saw a couple things that
confused me, though.

In the Version 0.3.0 changelog, it says that locale-specific collation
is supported, but I don't see how this works in the current
implementation. couch_icu_driver.c initializes a case-sensitive
collator and a case-insensitive collator both with calls to the ICU
function as ucol_open("", &status). But from the ICU documentation, it
looks like passing "" as the locale (the first argument) selects the
default collation rules as specified in the UCA and DUCET. Is there
some other way that the locale is being passed to ICU?

It looks like strings are being compared in CouchDB using the
col_strcollIter call. From what I understand, this is fine if used in
a simple binary comparison, but when comparing strings multiple times
(as in a B-tree), it can be more efficient to pre-calculate a
collation key using ucol_getSortKey (or, to be really fancy,
calculating only the used part of the collation key, on-demand, with
ucol_nextSortKeyPart, though this may be difficult to reconcile with
an append-only file structure). Has anyone evaluated this strategy
within CouchDB to see if it might yield better performance?


View raw message