incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Candler <B.Cand...@pobox.com>
Subject Re: Efficient range queries
Date Tue, 05 May 2009 08:17:09 GMT
On Mon, May 04, 2009 at 05:18:13PM -0400, Paul Davis wrote:
> There's a data structure called a nested containment list that is
> specifically meant for this type of query. It's on my todo list, but
> no where near the top. If someone wants to add an indexer that'd be
> pretty awesome.

This one? http://bioinfo.mbi.ucla.edu/pygr/docs/draft5.pdf/view

Looks interesting. However I note:

"The price for efficiency of the NCList approach is that it is most
efficient in static environments where the database is updated in bulk only
rarely, since any update would require rebuild of the entire database"

So unfortunately we don't get incremental updates. But the paper does list
some other algorithms which might allow this.

In the mean time: looks like I need to divide my ranges into bins. With
care I think this can be done with varying bin widths.

e.g. document contains range 37 to 345. I could emit keys:

  - "37", "38", "39", "4x", "5x", "6x", "7x", "8x", "9x", "1xx", "2xx",
    "30x", "31x", "32x", "33x", "340", "341", "342", "343", "344", "345"

Then in order to query for value 123:

   - multi-key fetch "123", "12x", "1xx"

Not pretty but should scale reasonably well. Maybe it would work better in
binary:

e.g. document contains range 37 to 345. Emit keys:

  - "100101", "10011x", "101xxx", "11xxxx", "1xxxxxx", "1xxxxxxx",
    "100xxxxxx", "1010xxxxx", "101010xxx", "10101100x"

Query for value 123:

  - multi-key fetch "1111011", "111101x", "11110xx", "1111xxx",
    "111xxxx", "11xxxxx", "1xxxxxx"

That gives fewer keys to emit but more to fetch. A key like "1xxxxxx" could
be emitted as "64-127" which easier to understand, and slightly shorter.

Anyway, thanks for clarifying that I hadn't missed anything obvious.

Regards,

Brian.

Mime
View raw message