Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (athena.apache.org: domain of barrywark@gmail.com
 designates 74.125.92.24 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=PuVV8DBjtEP56ukCO7u0YMnSgQx0DLIwFrYvTmymfjBJ/C9CuXKMTdbF41WUWbRMxS
         C4n8U+hr7kn5LX6XS4IgD2SiLSvmcWlk3NIe1FiAV7BV4nGak+97RsDFbGf3DtnSZtt9
         dho1ZiDbB5HrboPoiU/At+IPACOBbk/W0L0vE=
MIME-Version: 1.0
In-Reply-To: <e2111bbb0902101209h262cc81fra42551c95699bc92@mail.gmail.com>
References: <cd7634ce0902101133i1a7dcc79j88f14455cd2f750f@mail.gmail.com>
	 <e2111bbb0902101209h262cc81fra42551c95699bc92@mail.gmail.com>
Date: Tue, 10 Feb 2009 14:16:08 -0800
Message-ID: <cd7634ce0902101416s7de1f80ewdecc58cfce94a6cd@mail.gmail.com>
Subject: Re: A permanent view for user-entered query with complex boolean
	expressions?
From: Barry Wark <barrywark@gmail.com>
To: user@couchdb.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Paul,

Thanks for the very interesting response. CouchDB is looking like a
huge win for us in the long run. A couple of quick follow ups inline
below...

On Tue, Feb 10, 2009 at 12:09 PM, Paul Davis
<paul.joseph.davis@gmail.com> wrote:
> Barry,
>
> On Tue, Feb 10, 2009 at 2:33 PM, Barry Wark <barrywark@gmail.com> wrote:
>> Hi all,
>>
>> I'm in the planning stage for a frontend to a large  data set of
>> physiology data. I'm new to CouchDB and would like to get some
>> feedback on the feasibility of some ideas before I dig to far into
>> implementation.
>>
>> The data:
>> Conceptually, the important parts of the data set can be modeled as a
>> set of trials. Each trial has one or more stimulus settings which are
>> key-value pairs. Not all trials have the same set of settings and not
>> all trials with the same setting have the same value for that setting.
>> CouchDB documents appear well-suited for this form of data. In
>> addition, each trial has one or more numeric datasets, each order 1MB,
>> but up to 100MB. It seems that having CouchDB documents that contain a
>> key-value pair like
>>
>> "parameters" : {
>>    "parameter1" : value1,
>>    "parameter2" : value 2,
>>    //etc.
>> }
>>
>> and with attachments for the numeric data sets is the CouchDB way to go.
>>
>
> This is exaclty the layout I'd recommend using.
>
>> Users will want to query this data set for all trials whose settings
>> satisfy some boolean expression. So, for example "trials where
>> (parameters['parameter1'] == 10 AND parameters['parameter2'] >= 42)"
>>
>> So, now a few questions:
>>
>> 1. Is there a way to create a permanent view that supports queries
>> like that above? I got as far as a view like
>>
>> map:
>> function map(doc) {
>>    for parameter in doc.parameters {
>>        emit([parameter, doc.parameters[parameter]], doc._id)
>>    }
>> }
>>
>> reduce:
>> function reduce(keys, values, rereduce) {
>>    if(rereduce) {
>>        return union(values)
>>    }
>>
>>    return values
>> }

In fact, I think I messed up; I don't really need the reduce function
in this view do I?
>>
>> I believe this will give a view which, when queried with group=True
>> will give a set of rows with keyed by [parameter, parameterValue] and
>> with a list of trial document IDs that have that
>> parameter:parameterValue. Is this correct?
>>
>> Given this, I could do a union of the values of rows with
>> startkey=[parameter1, 10],count=1 and startkey=[parameter2, 42] to get
>> the set of trial document ids that match the query.
>>
>> But is there a way to structure the view's map/reduce so that I don't
>> have to do the union in my code (i.e. CouchDB does it as part of the
>> map/reduce)? The approach outlined above leads to an HTTP GET for each
>> term in the boolean expression, for example.
>>
>
> Unfortunately, this is one of the aspects of CouchDB that is hard to
> overcome. Lots of user specificable queries can lead to complications
> without some limitation. Hopefully by the time 1.0 rolls through we'll
> have made much more progress in dynamic query capabilities, but until
> then the method I'd recommend would be something along the lines of
> this:
>
> The first step is to know how many doc id's you have for each
> parameter. Here we'll set that up:
>
> // Map
> function(doc)
> {
>    for(var prop in doc) if(!prop.substr(0,1) == "_") emit(prop, 1);
> }
>
> // Reduce
> function(keys, values)
> {
>    return sum(values);
> }
>
> Now you can query this with multi-get so that you know the number of
> docids for each input parameter in your query by posting a JSON body
> to the view:
>
> curl -X POST -d '{"keys": ["param1", "param2", "paramN"]}'
> http://127.0.0.1:5984/db_name/_view/vname?group=true
>
> Now that we know the relative number of docids we can start searching
> for the result set by applying each boolean clause using set math. We
> just apply from the smallest number of docids to the largest to try
> and make sure we keep resource usage to a minimum.

This seems like a very common pattern. Is there any chance of getting
it implemented in CouchDB?

>
> At the moment, that's the pure CouchDB way. In real life for your
> query interface I'd most likely write a small slave process that uses
> the _external interface. Hopefully in the next months a couple feature
> ideas I have rattling around will coalesce into an implementation that
> will make things like this easier from directly within CouchDB. But
> for right now, that's all hand waving.

I'm not familiar with the _external interface yet. Is there some
documentation? Is this how the lucene index that Robert mentions
works?

User-specifiable queries like this  is going to be a critical feature
for us, whether we go with CouchDB or not, so I'm very interested in
keeping up with related developments. Feel free to contact me offline
if you're interested in more specific use cases etc.

Thanks again,
Barry

>
>> 2. What is the (practical) limit on attachment size? Is it reasonable
>> to store multi-MB attachments in the database? If not, I will go with
>> an external file(s) for the numeric data and storing a reference in
>> the trial document.
>>
>> Thanks for any insight,
>>
>> Barry
>>
>
> Trunk has support for streaming writes when a Content-Length header is
> present. Chris Anderson was just working the other day on streaming
> writes to disk in the absence of a Content-Length header. That
> basically means that if your HTTP client sends a content-length
> header, the sky's the limit. If you don't send a Content-Length
> header, you'll be limited by the available RAM on the machine running
> CouchDB until Chris finishes his patch.
>
> A small caveat for the current implementation is that larger
> attachments can end up causing a bit of RAM usage on the receiving
> end. I would doubt that 100MiB attachments are big enough to cause an
> issue, but you may want to test that before relying on it. Hopefully
> this is taken care of pre-0.9 (the bits and pieces appear to be
> falling in to place at least).
>
> HTH,
> Paul Davis
>