Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 23263 invoked from network); 10 Feb 2009 23:02:53 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Feb 2009 23:02:53 -0000 Received: (qmail 77204 invoked by uid 500); 10 Feb 2009 23:01:48 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 77174 invoked by uid 500); 10 Feb 2009 23:01:48 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 77163 invoked by uid 99); 10 Feb 2009 23:01:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Feb 2009 15:01:47 -0800 X-ASF-Spam-Status: No, hits=2.4 required=10.0 tests=NORMAL_HTTP_TO_IP,SPF_PASS,URIBL_RHS_DOB,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of robert.newson@gmail.com designates 72.14.220.156 as permitted sender) Received: from [72.14.220.156] (HELO fg-out-1718.google.com) (72.14.220.156) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Feb 2009 23:01:39 +0000 Received: by fg-out-1718.google.com with SMTP id 19so57802fgg.3 for ; Tue, 10 Feb 2009 15:01:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=2Y/sHYdgmoWmt3lpXiN8/AI6j70nhGjDFgzoPoa4Er4=; b=b39+Vvipx93mNK76Z0qK8BLabXkwd5yxt/wU6EhRD20pdKiPEcp96op9RIxpIvntgK 81wOwpLCu1JKrrZm+cjk3ajvqdDzUGoykLq7c3ZrF7Is0R/i7bPNY+EMBgHO/AYOW+kM eUTKRI5FaPoACwG4C1JUWy2Y/w0gdfFjt2/i8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=YzxM6hiYazHVzdd5bUP9RWSa3XRTXPJj9rPUO0pXGFrwBz2rr/s4E/6WvFMDzfqg7I /OxluHmbl0FsHpxxoA2SxjGVS6MWQDBXl5EkOUdS7FbSKyOC5s9DJacHUP9IJtAGs50o laWR+Rq6BR3S9HJurd6sQm0w3cr00qIfgZcV0= MIME-Version: 1.0 Received: by 10.86.53.11 with SMTP id b11mr181570fga.23.1234306878450; Tue, 10 Feb 2009 15:01:18 -0800 (PST) In-Reply-To: References: Date: Tue, 10 Feb 2009 18:01:18 -0500 Message-ID: <46aeb24f0902101501t79315fedx67a4f347fda42b2a@mail.gmail.com> Subject: Re: A permanent view for user-entered query with complex boolean expressions? From: Robert Newson To: user@couchdb.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org couchdb-lucene uses [externals] to receive queries from the client and it currently polls all_docs_by_seq for updates. This seems to match Lucene's batch-oriented model anyway, so I've not looked deeply into the update_notification option, etc. B. On Tue, Feb 10, 2009 at 5:16 PM, Barry Wark wrote: > Paul, > > Thanks for the very interesting response. CouchDB is looking like a > huge win for us in the long run. A couple of quick follow ups inline > below... > > On Tue, Feb 10, 2009 at 12:09 PM, Paul Davis > wrote: >> Barry, >> >> On Tue, Feb 10, 2009 at 2:33 PM, Barry Wark wrote: >>> Hi all, >>> >>> I'm in the planning stage for a frontend to a large data set of >>> physiology data. I'm new to CouchDB and would like to get some >>> feedback on the feasibility of some ideas before I dig to far into >>> implementation. >>> >>> The data: >>> Conceptually, the important parts of the data set can be modeled as a >>> set of trials. Each trial has one or more stimulus settings which are >>> key-value pairs. Not all trials have the same set of settings and not >>> all trials with the same setting have the same value for that setting. >>> CouchDB documents appear well-suited for this form of data. In >>> addition, each trial has one or more numeric datasets, each order 1MB, >>> but up to 100MB. It seems that having CouchDB documents that contain a >>> key-value pair like >>> >>> "parameters" : { >>> "parameter1" : value1, >>> "parameter2" : value 2, >>> //etc. >>> } >>> >>> and with attachments for the numeric data sets is the CouchDB way to go. >>> >> >> This is exaclty the layout I'd recommend using. >> >>> Users will want to query this data set for all trials whose settings >>> satisfy some boolean expression. So, for example "trials where >>> (parameters['parameter1'] == 10 AND parameters['parameter2'] >= 42)" >>> >>> So, now a few questions: >>> >>> 1. Is there a way to create a permanent view that supports queries >>> like that above? I got as far as a view like >>> >>> map: >>> function map(doc) { >>> for parameter in doc.parameters { >>> emit([parameter, doc.parameters[parameter]], doc._id) >>> } >>> } >>> >>> reduce: >>> function reduce(keys, values, rereduce) { >>> if(rereduce) { >>> return union(values) >>> } >>> >>> return values >>> } > > In fact, I think I messed up; I don't really need the reduce function > in this view do I? >>> >>> I believe this will give a view which, when queried with group=True >>> will give a set of rows with keyed by [parameter, parameterValue] and >>> with a list of trial document IDs that have that >>> parameter:parameterValue. Is this correct? >>> >>> Given this, I could do a union of the values of rows with >>> startkey=[parameter1, 10],count=1 and startkey=[parameter2, 42] to get >>> the set of trial document ids that match the query. >>> >>> But is there a way to structure the view's map/reduce so that I don't >>> have to do the union in my code (i.e. CouchDB does it as part of the >>> map/reduce)? The approach outlined above leads to an HTTP GET for each >>> term in the boolean expression, for example. >>> >> >> Unfortunately, this is one of the aspects of CouchDB that is hard to >> overcome. Lots of user specificable queries can lead to complications >> without some limitation. Hopefully by the time 1.0 rolls through we'll >> have made much more progress in dynamic query capabilities, but until >> then the method I'd recommend would be something along the lines of >> this: >> >> The first step is to know how many doc id's you have for each >> parameter. Here we'll set that up: >> >> // Map >> function(doc) >> { >> for(var prop in doc) if(!prop.substr(0,1) == "_") emit(prop, 1); >> } >> >> // Reduce >> function(keys, values) >> { >> return sum(values); >> } >> >> Now you can query this with multi-get so that you know the number of >> docids for each input parameter in your query by posting a JSON body >> to the view: >> >> curl -X POST -d '{"keys": ["param1", "param2", "paramN"]}' >> http://127.0.0.1:5984/db_name/_view/vname?group=true >> >> Now that we know the relative number of docids we can start searching >> for the result set by applying each boolean clause using set math. We >> just apply from the smallest number of docids to the largest to try >> and make sure we keep resource usage to a minimum. > > This seems like a very common pattern. Is there any chance of getting > it implemented in CouchDB? > >> >> At the moment, that's the pure CouchDB way. In real life for your >> query interface I'd most likely write a small slave process that uses >> the _external interface. Hopefully in the next months a couple feature >> ideas I have rattling around will coalesce into an implementation that >> will make things like this easier from directly within CouchDB. But >> for right now, that's all hand waving. > > I'm not familiar with the _external interface yet. Is there some > documentation? Is this how the lucene index that Robert mentions > works? > > User-specifiable queries like this is going to be a critical feature > for us, whether we go with CouchDB or not, so I'm very interested in > keeping up with related developments. Feel free to contact me offline > if you're interested in more specific use cases etc. > > Thanks again, > Barry > >> >>> 2. What is the (practical) limit on attachment size? Is it reasonable >>> to store multi-MB attachments in the database? If not, I will go with >>> an external file(s) for the numeric data and storing a reference in >>> the trial document. >>> >>> Thanks for any insight, >>> >>> Barry >>> >> >> Trunk has support for streaming writes when a Content-Length header is >> present. Chris Anderson was just working the other day on streaming >> writes to disk in the absence of a Content-Length header. That >> basically means that if your HTTP client sends a content-length >> header, the sky's the limit. If you don't send a Content-Length >> header, you'll be limited by the available RAM on the machine running >> CouchDB until Chris finishes his patch. >> >> A small caveat for the current implementation is that larger >> attachments can end up causing a bit of RAM usage on the receiving >> end. I would doubt that 100MiB attachments are big enough to cause an >> issue, but you may want to test that before relying on it. Hopefully >> this is taken care of pre-0.9 (the bits and pieces appear to be >> falling in to place at least). >> >> HTH, >> Paul Davis >> >