couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <>
Subject Re: View Intersections
Date Wed, 11 Feb 2009 03:58:41 GMT
On Tue, Feb 10, 2009 at 10:19 PM, Jeff Hinrichs - DM&T
<> wrote:
> Just a few  comment to get things started.
> On Tue, Feb 10, 2009 at 5:59 PM, Paul Davis <> wrote:
>> I've been contemplating implementing a new feature that I've been
>> wanting for awhile. There's been  some talk of implementing view
>> intersections for a bit now so I figured I'd try and give a summary of
>> what the feature would entail in terms of functionality and then the
>> necessary bits required for an implementation.
>> So the original idea for view intersections was exactly what the name
>> entails: Show me the intersection between two views for a given set of
>> view query parameters. After thinking about different methods of
>> implementation I think we can extend this to be more powerful and
>> generally applicable.
>> Major Hurdle 1
>> ============
>> The first necessary bit of ground work would be to implement an
>> optional value index on views. The more I thought about intersecting
>> views the more I realized it was starting to look pointless. Ignoring
>> something along the lines of group_level=N in that we can join on
>> array prefixes, all views being joined would require exactly the same
>> key. Which begs the question, why not just create 1 view that emits
>> the output of the two you want intersected.
>  I would argue that returning a simple list of docids that meet the
> requirement should suffice -- in fact, the views a and b need not be
> homogenous so returning anything beyond docids could end up being a
> bigger problem than the intersection itself.
> For instance, say we want the intersection of the documents who have
> both "blue" and "fuzzy" tags so we use
> a = /_view/tags/byval?key="blue"
> b = /_view/tags/byval?key="fuzzy"
> intersection(a,b)
> Now we want to limit that to things named Harold.
> c=/_view/name/first?key="Harold"
> intersection(intersection(a,b),c)
> Which gives us a list of docid's that contain Blue, Fuzzy things named Harold.
> However, the values returned by view a and view b are the same,
> however the values returned by view c might be completely different.
> So returning a view with varying values might not be very helpful
> (This is where I am not seeing why more than returning a list of
> docid's would be appropriate. Of course I am most likely missing the
> point.)  Only returning intersections of similar views would not be as
> interesting a returning intersections of dissimilar views.
>> I couldn't get past this for a long time until I heard Chris Anderson
>> pondering adding a btree index to the values in a view. The obvious
>> downfalls of the extra space and computation usage are there, but
>> making it optional should solve any qualms in that respect.
>> Given an index on a value we're now able to chain together arbitrary
>> views using either the key or value as well as limit the intersection
>> by any combination of key and value.
>> As a side benefit, we would also get the select views by value
>> restriction as well. I'm thinking it'd be as transparent as adding a
>> [start|end]value and [start|end]value_docid set of URL parameters. I
>> haven't followed this train of thought too far into the code yet, but
>> something approximating that should be fairly doable. A thought occurs
>> that limiting view results by both key and value could be interesting
>> in terms of implementation. Not sure if I'd force it through the
>> intersection API or not.
>> Caveats that come to mind are that this would break binary
>> compatibility for all generated views. It wouldn't require a
>> dump/reload, but it might come as a surprise to people upgrading that
>> all their views are regenerating.
>> Major Hurdle 2
>> ============
>> Implementing the view intersection API. First off, it probably needs a
>> new name. Once we have intersections working, unions, subtractions,
>> and the NxM one who's name escapes me (cross product floats up but
>> sounds not right) should be trivially implementable.
>> The underlying implementation for this is basically a large merge sort
>> running over the view btree's. If you read about the merge step in
>> map/reduce/merge that's basically what I've got in my head.
>> The biggest issue that I've found in getting this implemented
>> (excluding a value index) is that I'd need to write a new btree
>> traversal method that used iterators instead of a fold mechanism. This
>> shouldn't be overly difficult to implement.
>> Beyond that then it's basically up to the HTTP interface in parameter
>> parsing and error checking. For passing parameters I'm thinking along
>> the line of a JSON body posted (Premptive: any RESTafarians should
>> reference the long discussion on multi-get before writing about how
>> this isn't RESTful).
>  posting json documents seems to be required and beyond argument given
> the technical size limits of a GET request
>> Also, not sure if it's obvious but I'd plan on allowing arbitrarily
>> nested conditions, ie, "intersection(union(a, b), c)" type of
>> operations. There's a subtle detail in the sort order and thus
>> corresponding btree traversal that might come into play there. I can
>> punt and make the entire request use one sort order, as in the
>> previous example can't specify different sort directions for the two
>> nested operations because you'd get a (presumably) zero overlap in the
>> end. I'm pretty sure if we force all btrees to be traversed in the
>> same direction for each request we don't lose any functionality
>> though.
> 3) Since it is a set operation, as long as operator precedence is
> observed, the order of processing or traversing an index within the
> set should make no difference to the result.  The intersection of the
> sets {1, 2, 3} and {2, 3, 4} == {3, 1, 2} and {4, 2, 3}  == {2, 3} ==
> {3, 2}
>> Comments
>> =========
>> That's the general outline I've got in my head right now. I'm pretty
>> sure I can see 95% of the implementation, but it's possible I'm
>> missing a finer detail somewhere. If you've got questions or comments
>> let's hear them. If there's no general objection then I can probably
>> get to starting an implementation at the end of this week.
>> Thanks,
>> Paul Davis
> Hat in hand,
> Jeff


You bring up two good points I forgot to mention:

1. What is actually returned?

I see two possibilities for this. Either we return some json structure
that includes the key/value pairs from all views involved (per row) or
just a list of docids. I can see either being useful, and the ideas I
have for implementation doesn't rule out either approach. Perhaps this
is a query time parameter?

2. Do we support intersection of docid's somehow?

More specifically, do we concentrate on just getting a list of docid's
that meet the criterion, or allow for the possibility that we can have
multiple rows from a single docid? The ambiguity being that a single
row may include more than one docid as well as a single docid may be
repeated. For now I can see arguments for both approaches, but I have
suspicions that going for rows that may possibly contain repeated
docids will be more efficient because we don't have to track which
docids have been seen etc. This might affect #1 as well.

Implementation details related to the operation ordering should
prevent inequivalent sets with different orderings. Ie, all operations
are on sorted sets, so all results should be sorted. I haven't done
the math, but on the surface I think that's right.

Also, the fuzzy ideas I have for the posted JSON body will probably
lead to an exactly specified operator ordering. Ie, for every operator
you'll have to specify exactly 2 operands. Using the result of an
operand as an operand is permitted though. Ie. You can never say A + B
+ C - D, only (((A+B)+C)-D). Though that's probably me just being
unimaginitive with the JSON constraints.

Paul Davis

View raw message