couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jens Alfke (JIRA)" <>
Subject [jira] [Created] (COUCHDB-2327) Add string/array prefix match option, for view queries
Date Thu, 11 Sep 2014 16:22:35 GMT
Jens Alfke created COUCHDB-2327:

             Summary: Add string/array prefix match option, for view queries
                 Key: COUCHDB-2327
             Project: CouchDB
          Issue Type: Improvement
      Security Level: public (Regular issues)
          Components: HTTP Interface
            Reporter: Jens Alfke

View querying provides no clean way to match a string prefix The only advice I've seen is
to set startkey to the prefix, and endkey to the prefix with "some really high Unicode character"
appended, which is a total kludge*.

There's a similar issue with matching an array prefix, e.g. "all keys that start with [2014,
...]". Here the solution is less kludgy (append a "{}" to the endkey) but it's still very
unintuitive to people learning CouchDB. I've had to explain it to newbies many times.

I suggest adding an explicit query option to enable prefix matching. This doesn't need to
mess with the actual query engine — all it has to do is modify the endkey by appending an
appropriate Unicode character (in the string case) or empty object (in the array case.) If
no `endkey` is given it will be based on the `startkey`.

I've already implemented a comparable feature for Couchbase Lite:

Note that I made the `prefix_match` parameter an integer, not a boolean. This is to support
cases where you want to match a prefix of a _nested component_ of the key, for example "all
keys in 2014 whose product name starts with 'f'", where the startkey would be [2014, "f"]
and the prefix_match would be 2 to indicate that it's the nested string that should be prefix-matched
not the array. But in the common case you'd just set the value to 1 to indicate that the top
level key should be prefix-matched.

* Why is adding "some high Unicode character" a kludge? Because Unicode is so complicated
and so inconsistently implemented. Doing this immediately opens the possibility of weird Unicode
issues in your development language's string type, in its HTTP client library, and in Erlang's
equivalents on the server side. Not to mention the swamp that is the Unicode specification
itself — for instance, I've seen advice to use a character like \uFFFE, which was correct
until Unicode went 32-bit, and tended to work alright for a while after that, but will now
fail with emoji characters (which are both very commonly used and well outside the 16-bit
range.) Actually whether it fails depends on whether your string implementation operates on
UTF-16 (very common) or true Unicode code points. Like I said, it's a kludge.

This message was sent by Atlassian JIRA

View raw message