Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of robert.newson@gmail.com
 designates 72.14.220.156 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=YzxM6hiYazHVzdd5bUP9RWSa3XRTXPJj9rPUO0pXGFrwBz2rr/s4E/6WvFMDzfqg7I
         /OxluHmbl0FsHpxxoA2SxjGVS6MWQDBXl5EkOUdS7FbSKyOC5s9DJacHUP9IJtAGs50o
         laWR+Rq6BR3S9HJurd6sQm0w3cr00qIfgZcV0=
MIME-Version: 1.0
In-Reply-To: <cd7634ce0902101416s7de1f80ewdecc58cfce94a6cd@mail.gmail.com>
References: <cd7634ce0902101133i1a7dcc79j88f14455cd2f750f@mail.gmail.com>
	 <e2111bbb0902101209h262cc81fra42551c95699bc92@mail.gmail.com>
	 <cd7634ce0902101416s7de1f80ewdecc58cfce94a6cd@mail.gmail.com>
Date: Tue, 10 Feb 2009 18:01:18 -0500
Message-ID: <46aeb24f0902101501t79315fedx67a4f347fda42b2a@mail.gmail.com>
Subject: Re: A permanent view for user-entered query with complex boolean
	expressions?
From: Robert Newson <robert.newson@gmail.com>
To: user@couchdb.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

couchdb-lucene uses [externals] to receive queries from the client and
it currently polls all_docs_by_seq for updates. This seems to match
Lucene's batch-oriented model anyway, so I've not looked deeply into
the update_notification option, etc.

B.

On Tue, Feb 10, 2009 at 5:16 PM, Barry Wark <barrywark@gmail.com> wrote:
> Paul,
>
> Thanks for the very interesting response. CouchDB is looking like a
> huge win for us in the long run. A couple of quick follow ups inline
> below...
>
> On Tue, Feb 10, 2009 at 12:09 PM, Paul Davis
> <paul.joseph.davis@gmail.com> wrote:
>> Barry,
>>
>> On Tue, Feb 10, 2009 at 2:33 PM, Barry Wark <barrywark@gmail.com> wrote:
>>> Hi all,
>>>
>>> I'm in the planning stage for a frontend to a large  data set of
>>> physiology data. I'm new to CouchDB and would like to get some
>>> feedback on the feasibility of some ideas before I dig to far into
>>> implementation.
>>>
>>> The data:
>>> Conceptually, the important parts of the data set can be modeled as a
>>> set of trials. Each trial has one or more stimulus settings which are
>>> key-value pairs. Not all trials have the same set of settings and not
>>> all trials with the same setting have the same value for that setting.
>>> CouchDB documents appear well-suited for this form of data. In
>>> addition, each trial has one or more numeric datasets, each order 1MB,
>>> but up to 100MB. It seems that having CouchDB documents that contain a
>>> key-value pair like
>>>
>>> "parameters" : {
>>>    "parameter1" : value1,
>>>    "parameter2" : value 2,
>>>    //etc.
>>> }
>>>
>>> and with attachments for the numeric data sets is the CouchDB way to go.
>>>
>>
>> This is exaclty the layout I'd recommend using.
>>
>>> Users will want to query this data set for all trials whose settings
>>> satisfy some boolean expression. So, for example "trials where
>>> (parameters['parameter1'] == 10 AND parameters['parameter2'] >= 42)"
>>>
>>> So, now a few questions:
>>>
>>> 1. Is there a way to create a permanent view that supports queries
>>> like that above? I got as far as a view like
>>>
>>> map:
>>> function map(doc) {
>>>    for parameter in doc.parameters {
>>>        emit([parameter, doc.parameters[parameter]], doc._id)
>>>    }
>>> }
>>>
>>> reduce:
>>> function reduce(keys, values, rereduce) {
>>>    if(rereduce) {
>>>        return union(values)
>>>    }
>>>
>>>    return values
>>> }
>
> In fact, I think I messed up; I don't really need the reduce function
> in this view do I?
>>>
>>> I believe this will give a view which, when queried with group=True
>>> will give a set of rows with keyed by [parameter, parameterValue] and
>>> with a list of trial document IDs that have that
>>> parameter:parameterValue. Is this correct?
>>>
>>> Given this, I could do a union of the values of rows with
>>> startkey=[parameter1, 10],count=1 and startkey=[parameter2, 42] to get
>>> the set of trial document ids that match the query.
>>>
>>> But is there a way to structure the view's map/reduce so that I don't
>>> have to do the union in my code (i.e. CouchDB does it as part of the
>>> map/reduce)? The approach outlined above leads to an HTTP GET for each
>>> term in the boolean expression, for example.
>>>
>>
>> Unfortunately, this is one of the aspects of CouchDB that is hard to
>> overcome. Lots of user specificable queries can lead to complications
>> without some limitation. Hopefully by the time 1.0 rolls through we'll
>> have made much more progress in dynamic query capabilities, but until
>> then the method I'd recommend would be something along the lines of
>> this:
>>
>> The first step is to know how many doc id's you have for each
>> parameter. Here we'll set that up:
>>
>> // Map
>> function(doc)
>> {
>>    for(var prop in doc) if(!prop.substr(0,1) == "_") emit(prop, 1);
>> }
>>
>> // Reduce
>> function(keys, values)
>> {
>>    return sum(values);
>> }
>>
>> Now you can query this with multi-get so that you know the number of
>> docids for each input parameter in your query by posting a JSON body
>> to the view:
>>
>> curl -X POST -d '{"keys": ["param1", "param2", "paramN"]}'
>> http://127.0.0.1:5984/db_name/_view/vname?group=true
>>
>> Now that we know the relative number of docids we can start searching
>> for the result set by applying each boolean clause using set math. We
>> just apply from the smallest number of docids to the largest to try
>> and make sure we keep resource usage to a minimum.
>
> This seems like a very common pattern. Is there any chance of getting
> it implemented in CouchDB?
>
>>
>> At the moment, that's the pure CouchDB way. In real life for your
>> query interface I'd most likely write a small slave process that uses
>> the _external interface. Hopefully in the next months a couple feature
>> ideas I have rattling around will coalesce into an implementation that
>> will make things like this easier from directly within CouchDB. But
>> for right now, that's all hand waving.
>
> I'm not familiar with the _external interface yet. Is there some
> documentation? Is this how the lucene index that Robert mentions
> works?
>
> User-specifiable queries like this  is going to be a critical feature
> for us, whether we go with CouchDB or not, so I'm very interested in
> keeping up with related developments. Feel free to contact me offline
> if you're interested in more specific use cases etc.
>
> Thanks again,
> Barry
>
>>
>>> 2. What is the (practical) limit on attachment size? Is it reasonable
>>> to store multi-MB attachments in the database? If not, I will go with
>>> an external file(s) for the numeric data and storing a reference in
>>> the trial document.
>>>
>>> Thanks for any insight,
>>>
>>> Barry
>>>
>>
>> Trunk has support for streaming writes when a Content-Length header is
>> present. Chris Anderson was just working the other day on streaming
>> writes to disk in the absence of a Content-Length header. That
>> basically means that if your HTTP client sends a content-length
>> header, the sky's the limit. If you don't send a Content-Length
>> header, you'll be limited by the available RAM on the machine running
>> CouchDB until Chris finishes his patch.
>>
>> A small caveat for the current implementation is that larger
>> attachments can end up causing a bit of RAM usage on the receiving
>> end. I would doubt that 100MiB attachments are big enough to cause an
>> issue, but you may want to test that before relying on it. Hopefully
>> this is taken care of pre-0.9 (the bits and pieces appear to be
>> falling in to place at least).
>>
>> HTH,
>> Paul Davis
>>
>