Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (athena.apache.org: domain of paul.joseph.davis@gmail.com
 designates 209.85.132.241 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=E4uLE7xA3XcAX+g7IgsXZ7PlRBE8VPZeCR+njY3kh3YHuH+o0lwxx8JIueu7E85y34
         kqUqf7zRkKYePlg2HY/fKU9CmUQqieYKN+5LhGhJnbQNSIyaobcvrDqdA3HDWMFSdSql
         1VW+Pv69HEaV4V0plcA4pu4fhjYGLa74WAubE=
MIME-Version: 1.0
In-Reply-To: <cd7634ce0902101133i1a7dcc79j88f14455cd2f750f@mail.gmail.com>
References: <cd7634ce0902101133i1a7dcc79j88f14455cd2f750f@mail.gmail.com>
Date: Tue, 10 Feb 2009 15:09:31 -0500
Message-ID: <e2111bbb0902101209h262cc81fra42551c95699bc92@mail.gmail.com>
Subject: Re: A permanent view for user-entered query with complex boolean
	expressions?
From: Paul Davis <paul.joseph.davis@gmail.com>
To: user@couchdb.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Barry,

On Tue, Feb 10, 2009 at 2:33 PM, Barry Wark <barrywark@gmail.com> wrote:
> Hi all,
>
> I'm in the planning stage for a frontend to a large  data set of
> physiology data. I'm new to CouchDB and would like to get some
> feedback on the feasibility of some ideas before I dig to far into
> implementation.
>
> The data:
> Conceptually, the important parts of the data set can be modeled as a
> set of trials. Each trial has one or more stimulus settings which are
> key-value pairs. Not all trials have the same set of settings and not
> all trials with the same setting have the same value for that setting.
> CouchDB documents appear well-suited for this form of data. In
> addition, each trial has one or more numeric datasets, each order 1MB,
> but up to 100MB. It seems that having CouchDB documents that contain a
> key-value pair like
>
> "parameters" : {
>    "parameter1" : value1,
>    "parameter2" : value 2,
>    //etc.
> }
>
> and with attachments for the numeric data sets is the CouchDB way to go.
>

This is exaclty the layout I'd recommend using.

> Users will want to query this data set for all trials whose settings
> satisfy some boolean expression. So, for example "trials where
> (parameters['parameter1'] == 10 AND parameters['parameter2'] >= 42)"
>
> So, now a few questions:
>
> 1. Is there a way to create a permanent view that supports queries
> like that above? I got as far as a view like
>
> map:
> function map(doc) {
>    for parameter in doc.parameters {
>        emit([parameter, doc.parameters[parameter]], doc._id)
>    }
> }
>
> reduce:
> function reduce(keys, values, rereduce) {
>    if(rereduce) {
>        return union(values)
>    }
>
>    return values
> }
>
> I believe this will give a view which, when queried with group=True
> will give a set of rows with keyed by [parameter, parameterValue] and
> with a list of trial document IDs that have that
> parameter:parameterValue. Is this correct?
>
> Given this, I could do a union of the values of rows with
> startkey=[parameter1, 10],count=1 and startkey=[parameter2, 42] to get
> the set of trial document ids that match the query.
>
> But is there a way to structure the view's map/reduce so that I don't
> have to do the union in my code (i.e. CouchDB does it as part of the
> map/reduce)? The approach outlined above leads to an HTTP GET for each
> term in the boolean expression, for example.
>

Unfortunately, this is one of the aspects of CouchDB that is hard to
overcome. Lots of user specificable queries can lead to complications
without some limitation. Hopefully by the time 1.0 rolls through we'll
have made much more progress in dynamic query capabilities, but until
then the method I'd recommend would be something along the lines of
this:

The first step is to know how many doc id's you have for each
parameter. Here we'll set that up:

// Map
function(doc)
{
    for(var prop in doc) if(!prop.substr(0,1) == "_") emit(prop, 1);
}

// Reduce
function(keys, values)
{
    return sum(values);
}

Now you can query this with multi-get so that you know the number of
docids for each input parameter in your query by posting a JSON body
to the view:

curl -X POST -d '{"keys": ["param1", "param2", "paramN"]}'
http://127.0.0.1:5984/db_name/_view/vname?group=true

Now that we know the relative number of docids we can start searching
for the result set by applying each boolean clause using set math. We
just apply from the smallest number of docids to the largest to try
and make sure we keep resource usage to a minimum.

At the moment, that's the pure CouchDB way. In real life for your
query interface I'd most likely write a small slave process that uses
the _external interface. Hopefully in the next months a couple feature
ideas I have rattling around will coalesce into an implementation that
will make things like this easier from directly within CouchDB. But
for right now, that's all hand waving.

> 2. What is the (practical) limit on attachment size? Is it reasonable
> to store multi-MB attachments in the database? If not, I will go with
> an external file(s) for the numeric data and storing a reference in
> the trial document.
>
> Thanks for any insight,
>
> Barry
>

Trunk has support for streaming writes when a Content-Length header is
present. Chris Anderson was just working the other day on streaming
writes to disk in the absence of a Content-Length header. That
basically means that if your HTTP client sends a content-length
header, the sky's the limit. If you don't send a Content-Length
header, you'll be limited by the available RAM on the machine running
CouchDB until Chris finishes his patch.

A small caveat for the current implementation is that larger
attachments can end up causing a bit of RAM usage on the receiving
end. I would doubt that 100MiB attachments are big enough to cause an
issue, but you may want to test that before relying on it. Hopefully
this is taken care of pre-0.9 (the bits and pieces appear to be
falling in to place at least).

HTH,
Paul Davis