incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jens Alfke <j...@couchbase.com>
Subject Re: Matching docs which values ending with a specific string
Date Thu, 13 Feb 2014 01:25:05 GMT

On Feb 12, 2014, at 5:08 PM, Tito Ciuro <tciuro@mac.com> wrote:

> This is taken verbatim from the "Getting Started with CouchDB" book, page 49:

Hm, I have not seen that book. But I agree that the general documentation situation is not
good. At least the online docs are better than they used to be.

> [...] If we want to restrict it to those starting with Apricot, we can use the UTF-8
sorting to our advantage. If we add the UTF-8 character 007F to ‘Apricot’, the range will
only include recipes with the title starting with Apricot, even if the document ID contains
other characters.

I see what they're getting at — it's the same trick as adding a "z" as a suffix (endkey="apricots")
to stop at a key that starts with anything beyond "apricot", except that they're intending
\u007F as a sort of "super-z" that sorts greater than anything else.

But that's wrong, because CouchDB doesn't use UTF-8 sorting, it uses Unicode sorting. From
the wiki: "Comparison of strings is done using ICU which implements the Unicode Collation
Algorithm…" [1] So a \u007f character isn't a particularly high value; it's greater than
any ASCII character but lower than anything else including other non-English Roman characters.

On the same page[2] the wiki suggests using the character \ufff0 as a suffix for this purpose.
That sounds more reasonable, although the details depend on whether the collation is really
being done on true Unicode code points or a UTF-16 encoding. If the former, \ufff0 isn't at
the top of the range and things like emoji will sort after it.

I hope you see that this is simply a trick of string range comparisons, not a special CouchDB
feature — you could use the same trick in a SQL query (if the database were sufficiently
Unicode-savvy.)

> Let’s see that in action:
> http://127.0.0.1:5984/recipes/_design/simple/_view/by_title?startkey=%22Apricot %22&endkey=%22Apricot%007F%22

This looks like a typo: they must have meant %7F but wrote %007F instead.

—Jens

[1] http://wiki.apache.org/couchdb/View_collation#Collation_Specification
[2] http://wiki.apache.org/couchdb/View_collation#String_Ranges
Mime
View raw message