Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 44970 invoked from network); 10 Feb 2009 20:12:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Feb 2009 20:12:09 -0000 Received: (qmail 14292 invoked by uid 500); 10 Feb 2009 20:12:07 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 14188 invoked by uid 500); 10 Feb 2009 20:12:07 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 14176 invoked by uid 99); 10 Feb 2009 20:12:07 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Feb 2009 12:12:07 -0800 X-ASF-Spam-Status: No, hits=2.4 required=10.0 tests=NORMAL_HTTP_TO_IP,SPF_PASS,URIBL_RHS_DOB,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of paul.joseph.davis@gmail.com designates 209.85.132.241 as permitted sender) Received: from [209.85.132.241] (HELO an-out-0708.google.com) (209.85.132.241) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Feb 2009 20:12:00 +0000 Received: by an-out-0708.google.com with SMTP id c37so12081anc.5 for ; Tue, 10 Feb 2009 12:11:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=6BqQXBvvIDQv0Sw4vEDVdFG4ZNXecnQXjWWoXGvZlhU=; b=gDfK2/XtyxCemRYtNvcnhhcf3ywPsBxmBdi7W3Fy3yvGVj3P+2vRBTU3FbEErqy2Gy GnjKUNJgPSGeC1FVbWw06XL8n4gKl81uSCHoUM7QYZhfWLzNaJMkqg/cGJv+asjSnI1H mRltE/T6OEu3K64v3iiYRE66upFbHVLtvv47w= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=E4uLE7xA3XcAX+g7IgsXZ7PlRBE8VPZeCR+njY3kh3YHuH+o0lwxx8JIueu7E85y34 kqUqf7zRkKYePlg2HY/fKU9CmUQqieYKN+5LhGhJnbQNSIyaobcvrDqdA3HDWMFSdSql 1VW+Pv69HEaV4V0plcA4pu4fhjYGLa74WAubE= MIME-Version: 1.0 Received: by 10.100.133.1 with SMTP id g1mr2228986and.159.1234296571798; Tue, 10 Feb 2009 12:09:31 -0800 (PST) In-Reply-To: References: Date: Tue, 10 Feb 2009 15:09:31 -0500 Message-ID: Subject: Re: A permanent view for user-entered query with complex boolean expressions? From: Paul Davis To: user@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Barry, On Tue, Feb 10, 2009 at 2:33 PM, Barry Wark wrote: > Hi all, > > I'm in the planning stage for a frontend to a large data set of > physiology data. I'm new to CouchDB and would like to get some > feedback on the feasibility of some ideas before I dig to far into > implementation. > > The data: > Conceptually, the important parts of the data set can be modeled as a > set of trials. Each trial has one or more stimulus settings which are > key-value pairs. Not all trials have the same set of settings and not > all trials with the same setting have the same value for that setting. > CouchDB documents appear well-suited for this form of data. In > addition, each trial has one or more numeric datasets, each order 1MB, > but up to 100MB. It seems that having CouchDB documents that contain a > key-value pair like > > "parameters" : { > "parameter1" : value1, > "parameter2" : value 2, > //etc. > } > > and with attachments for the numeric data sets is the CouchDB way to go. > This is exaclty the layout I'd recommend using. > Users will want to query this data set for all trials whose settings > satisfy some boolean expression. So, for example "trials where > (parameters['parameter1'] == 10 AND parameters['parameter2'] >= 42)" > > So, now a few questions: > > 1. Is there a way to create a permanent view that supports queries > like that above? I got as far as a view like > > map: > function map(doc) { > for parameter in doc.parameters { > emit([parameter, doc.parameters[parameter]], doc._id) > } > } > > reduce: > function reduce(keys, values, rereduce) { > if(rereduce) { > return union(values) > } > > return values > } > > I believe this will give a view which, when queried with group=True > will give a set of rows with keyed by [parameter, parameterValue] and > with a list of trial document IDs that have that > parameter:parameterValue. Is this correct? > > Given this, I could do a union of the values of rows with > startkey=[parameter1, 10],count=1 and startkey=[parameter2, 42] to get > the set of trial document ids that match the query. > > But is there a way to structure the view's map/reduce so that I don't > have to do the union in my code (i.e. CouchDB does it as part of the > map/reduce)? The approach outlined above leads to an HTTP GET for each > term in the boolean expression, for example. > Unfortunately, this is one of the aspects of CouchDB that is hard to overcome. Lots of user specificable queries can lead to complications without some limitation. Hopefully by the time 1.0 rolls through we'll have made much more progress in dynamic query capabilities, but until then the method I'd recommend would be something along the lines of this: The first step is to know how many doc id's you have for each parameter. Here we'll set that up: // Map function(doc) { for(var prop in doc) if(!prop.substr(0,1) == "_") emit(prop, 1); } // Reduce function(keys, values) { return sum(values); } Now you can query this with multi-get so that you know the number of docids for each input parameter in your query by posting a JSON body to the view: curl -X POST -d '{"keys": ["param1", "param2", "paramN"]}' http://127.0.0.1:5984/db_name/_view/vname?group=true Now that we know the relative number of docids we can start searching for the result set by applying each boolean clause using set math. We just apply from the smallest number of docids to the largest to try and make sure we keep resource usage to a minimum. At the moment, that's the pure CouchDB way. In real life for your query interface I'd most likely write a small slave process that uses the _external interface. Hopefully in the next months a couple feature ideas I have rattling around will coalesce into an implementation that will make things like this easier from directly within CouchDB. But for right now, that's all hand waving. > 2. What is the (practical) limit on attachment size? Is it reasonable > to store multi-MB attachments in the database? If not, I will go with > an external file(s) for the numeric data and storing a reference in > the trial document. > > Thanks for any insight, > > Barry > Trunk has support for streaming writes when a Content-Length header is present. Chris Anderson was just working the other day on streaming writes to disk in the absence of a Content-Length header. That basically means that if your HTTP client sends a content-length header, the sky's the limit. If you don't send a Content-Length header, you'll be limited by the available RAM on the machine running CouchDB until Chris finishes his patch. A small caveat for the current implementation is that larger attachments can end up causing a bit of RAM usage on the receiving end. I would doubt that 100MiB attachments are big enough to cause an issue, but you may want to test that before relying on it. Hopefully this is taken care of pre-0.9 (the bits and pieces appear to be falling in to place at least). HTH, Paul Davis