Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 549 invoked from network); 10 Feb 2009 22:16:48 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Feb 2009 22:16:48 -0000 Received: (qmail 20058 invoked by uid 500); 10 Feb 2009 22:16:37 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 20032 invoked by uid 500); 10 Feb 2009 22:16:37 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 20021 invoked by uid 99); 10 Feb 2009 22:16:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Feb 2009 14:16:37 -0800 X-ASF-Spam-Status: No, hits=2.4 required=10.0 tests=NORMAL_HTTP_TO_IP,SPF_PASS,URIBL_RHS_DOB,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of barrywark@gmail.com designates 74.125.92.24 as permitted sender) Received: from [74.125.92.24] (HELO qw-out-2122.google.com) (74.125.92.24) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Feb 2009 22:16:30 +0000 Received: by qw-out-2122.google.com with SMTP id 5so57886qwi.29 for ; Tue, 10 Feb 2009 14:16:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=T/9OvgcqnG5/V9PoYEl3qsVBF92X2ZUZ01ZTAnApVbs=; b=UZ7duiKqJf8Dp6CMjK5EhfEP7l4nuA24Lh2o1ItWgnYRGMlr7g/KsayRA+4uMRDfMs IGv2NxV66f35bNDbEop9sEG+X6uGObfI998JCppt0oYDwCO11Dkuelbjgvq+M/F8os/f e/C5XaAcjANgI22rICGS9JO5oQIzc08fVVnG0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=PuVV8DBjtEP56ukCO7u0YMnSgQx0DLIwFrYvTmymfjBJ/C9CuXKMTdbF41WUWbRMxS C4n8U+hr7kn5LX6XS4IgD2SiLSvmcWlk3NIe1FiAV7BV4nGak+97RsDFbGf3DtnSZtt9 dho1ZiDbB5HrboPoiU/At+IPACOBbk/W0L0vE= MIME-Version: 1.0 Received: by 10.114.199.3 with SMTP id w3mr5174710waf.181.1234304169038; Tue, 10 Feb 2009 14:16:09 -0800 (PST) In-Reply-To: References: Date: Tue, 10 Feb 2009 14:16:08 -0800 Message-ID: Subject: Re: A permanent view for user-entered query with complex boolean expressions? From: Barry Wark To: user@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Paul, Thanks for the very interesting response. CouchDB is looking like a huge win for us in the long run. A couple of quick follow ups inline below... On Tue, Feb 10, 2009 at 12:09 PM, Paul Davis wrote: > Barry, > > On Tue, Feb 10, 2009 at 2:33 PM, Barry Wark wrote: >> Hi all, >> >> I'm in the planning stage for a frontend to a large data set of >> physiology data. I'm new to CouchDB and would like to get some >> feedback on the feasibility of some ideas before I dig to far into >> implementation. >> >> The data: >> Conceptually, the important parts of the data set can be modeled as a >> set of trials. Each trial has one or more stimulus settings which are >> key-value pairs. Not all trials have the same set of settings and not >> all trials with the same setting have the same value for that setting. >> CouchDB documents appear well-suited for this form of data. In >> addition, each trial has one or more numeric datasets, each order 1MB, >> but up to 100MB. It seems that having CouchDB documents that contain a >> key-value pair like >> >> "parameters" : { >> "parameter1" : value1, >> "parameter2" : value 2, >> //etc. >> } >> >> and with attachments for the numeric data sets is the CouchDB way to go. >> > > This is exaclty the layout I'd recommend using. > >> Users will want to query this data set for all trials whose settings >> satisfy some boolean expression. So, for example "trials where >> (parameters['parameter1'] == 10 AND parameters['parameter2'] >= 42)" >> >> So, now a few questions: >> >> 1. Is there a way to create a permanent view that supports queries >> like that above? I got as far as a view like >> >> map: >> function map(doc) { >> for parameter in doc.parameters { >> emit([parameter, doc.parameters[parameter]], doc._id) >> } >> } >> >> reduce: >> function reduce(keys, values, rereduce) { >> if(rereduce) { >> return union(values) >> } >> >> return values >> } In fact, I think I messed up; I don't really need the reduce function in this view do I? >> >> I believe this will give a view which, when queried with group=True >> will give a set of rows with keyed by [parameter, parameterValue] and >> with a list of trial document IDs that have that >> parameter:parameterValue. Is this correct? >> >> Given this, I could do a union of the values of rows with >> startkey=[parameter1, 10],count=1 and startkey=[parameter2, 42] to get >> the set of trial document ids that match the query. >> >> But is there a way to structure the view's map/reduce so that I don't >> have to do the union in my code (i.e. CouchDB does it as part of the >> map/reduce)? The approach outlined above leads to an HTTP GET for each >> term in the boolean expression, for example. >> > > Unfortunately, this is one of the aspects of CouchDB that is hard to > overcome. Lots of user specificable queries can lead to complications > without some limitation. Hopefully by the time 1.0 rolls through we'll > have made much more progress in dynamic query capabilities, but until > then the method I'd recommend would be something along the lines of > this: > > The first step is to know how many doc id's you have for each > parameter. Here we'll set that up: > > // Map > function(doc) > { > for(var prop in doc) if(!prop.substr(0,1) == "_") emit(prop, 1); > } > > // Reduce > function(keys, values) > { > return sum(values); > } > > Now you can query this with multi-get so that you know the number of > docids for each input parameter in your query by posting a JSON body > to the view: > > curl -X POST -d '{"keys": ["param1", "param2", "paramN"]}' > http://127.0.0.1:5984/db_name/_view/vname?group=true > > Now that we know the relative number of docids we can start searching > for the result set by applying each boolean clause using set math. We > just apply from the smallest number of docids to the largest to try > and make sure we keep resource usage to a minimum. This seems like a very common pattern. Is there any chance of getting it implemented in CouchDB? > > At the moment, that's the pure CouchDB way. In real life for your > query interface I'd most likely write a small slave process that uses > the _external interface. Hopefully in the next months a couple feature > ideas I have rattling around will coalesce into an implementation that > will make things like this easier from directly within CouchDB. But > for right now, that's all hand waving. I'm not familiar with the _external interface yet. Is there some documentation? Is this how the lucene index that Robert mentions works? User-specifiable queries like this is going to be a critical feature for us, whether we go with CouchDB or not, so I'm very interested in keeping up with related developments. Feel free to contact me offline if you're interested in more specific use cases etc. Thanks again, Barry > >> 2. What is the (practical) limit on attachment size? Is it reasonable >> to store multi-MB attachments in the database? If not, I will go with >> an external file(s) for the numeric data and storing a reference in >> the trial document. >> >> Thanks for any insight, >> >> Barry >> > > Trunk has support for streaming writes when a Content-Length header is > present. Chris Anderson was just working the other day on streaming > writes to disk in the absence of a Content-Length header. That > basically means that if your HTTP client sends a content-length > header, the sky's the limit. If you don't send a Content-Length > header, you'll be limited by the available RAM on the machine running > CouchDB until Chris finishes his patch. > > A small caveat for the current implementation is that larger > attachments can end up causing a bit of RAM usage on the receiving > end. I would doubt that 100MiB attachments are big enough to cause an > issue, but you may want to test that before relying on it. Hopefully > this is taken care of pre-0.9 (the bits and pieces appear to be > falling in to place at least). > > HTH, > Paul Davis >