Return-Path: Delivered-To: apmail-incubator-couchdb-user-archive@locus.apache.org Received: (qmail 29345 invoked from network); 20 Aug 2008 21:46:25 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 20 Aug 2008 21:46:25 -0000 Received: (qmail 52724 invoked by uid 500); 20 Aug 2008 21:46:22 -0000 Delivered-To: apmail-incubator-couchdb-user-archive@incubator.apache.org Received: (qmail 52697 invoked by uid 500); 20 Aug 2008 21:46:22 -0000 Mailing-List: contact couchdb-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: couchdb-user@incubator.apache.org Delivered-To: mailing list couchdb-user@incubator.apache.org Received: (qmail 52686 invoked by uid 99); 20 Aug 2008 21:46:22 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Aug 2008 14:46:22 -0700 X-ASF-Spam-Status: No, hits=1.8 required=10.0 tests=SPF_PASS,WEIRD_PORT,WHOIS_DMNBYPROXY X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of nickretallack@gmail.com designates 216.239.58.188 as permitted sender) Received: from [216.239.58.188] (HELO gv-out-0910.google.com) (216.239.58.188) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Aug 2008 21:45:22 +0000 Received: by gv-out-0910.google.com with SMTP id i36so98441gve.17 for ; Wed, 20 Aug 2008 14:45:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type :content-transfer-encoding:content-disposition:references; bh=GlbXXZ6lgw0d4wrbXGvWw9lJZUAsDDJEDVJoPMkLMps=; b=Q7qHeVj9M539vnAjZmnPejzdXivYjIR6ehg/gdqEp4HXak2reRcVZY9EpJtfE7qc/H 76Zlx9g4H5D4bR6c9dwGwhm7yxpv+6+oOHrYZC/3KZgZXGQ6yIRMRYhPaC0CktPVWgKO 7ypdZUG69KMZQBRA8qz3fq/FROV5crnUYpaFQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:content-transfer-encoding:content-disposition :references; b=Y8UjC0UNzZ7rznYirJ8YpMcgHYdbU6uolLr+b9PhTXcbsP4ZyvR26jltspDvTfpe6Y 3+t6aJBOvFQM2lc9xeFG0gO3xtIFOE75Gxj7Sqf18+mxe79wI0V0ShLcmQrXulMnIzso 4vs2khaSHj3Oownnavmm6SHfTR5GNX8N5Nkec= Received: by 10.103.208.15 with SMTP id k15mr440892muq.84.1219268733321; Wed, 20 Aug 2008 14:45:33 -0700 (PDT) Received: by 10.103.212.17 with HTTP; Wed, 20 Aug 2008 14:45:33 -0700 (PDT) Message-ID: <66e809970808201445g9d01346s2278877ef4c4f5f3@mail.gmail.com> Date: Wed, 20 Aug 2008 14:45:33 -0700 From: "Nicholas Retallack" To: couchdb-user@incubator.apache.org Subject: Re: Reduce is Really Slow! In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <66e809970808182327m4ede024bw131f3c13a19ef31c@mail.gmail.com> <2282DB36-2CFD-4D6B-B33E-A5C0CDCCB2B7@gmail.com> <66e809970808191555k14b53909w8db8ba6e4720e045@mail.gmail.com> <66e809970808191714k5b3e9a2dn1f3c08923a9ec33b@mail.gmail.com> <66e809970808201332v341f2202s1f6ec522bcc184de@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org Oh clever. I was considering a solution like this, but I was worried I wouldn't know where to stop, and might end up chopping it between some documents that should be grouped together. "some_value_that_sorts_after_last_possible_date" solves that problem. There's another problem though, for when I want to do pagination. Say I want to display exactly 100 of these on a page. How do I know I've fetched 100 of them, if any number of documents could be in a group? Also, how would I know what document name appears 100 documents ahead of this one? This gets messy... Essentially I figured this should be a task the database is capable of doing on its own. I don't want every action in my web application to have to solve the caching problem on its own, after doing serious data-munging on all this ugly stuff I got back from the database. How do I know when the cache should be invalidated anyway, without insider knowledge from the database? Hm, cleverness. I guess I could figure out what every hundredth name is by making a view for just the names and querying that. Any efficient way to reduce that list for uniqueness? Perhaps group=true and reduce = function(){return true}. There should be a wiki page devoted to these silly tricks, like this hackish way to put together pagination. And tag clouds. On Wed, Aug 20, 2008 at 1:56 PM, Paul Davis wrote: > If I'm not mistaken, you have a number of documents that all have a > given 'name'. And you want the list of elements for each value of > 'name'. To accomplish this in db, land, you could use a design > document like [1]. > > Then to get the data for any given doc name, you'd query your map like > [2]. This gets you everything emitted with a given doc name. The > underlying idea to remember in getting data out of couch is that your > maps should emit things that sort together. Then you can use 'slice' > operations to pull at the documents you need. > > You're values aren't magically in one array, but merging the arrays in > app-land is easy enough. > > If I've completely screwed up what you were going after, let me know. > > [1] http://www.friendpaste.com/2AHz3ahr > [2] http://localhost:5984/dbname/_view/design_docid/index?startkey=["docname"]&endkey=["docname", > some_value_that_sorts_after_last_possible_date] > > Paul > > On Wed, Aug 20, 2008 at 4:32 PM, Nicholas Retallack > wrote: >> Replacing 'return values' with 'return values.length' shows you're >> right. 4 minutes for the first query, miliseconds afterward, as >> opposed to forever. >> >> I guess I was expecting reduce to do things it wasn't designed to do. >> I notice ?group=true&group_level=1 is ignored unless a reduce function >> of some sort exists though. Is there any way to get this grouping >> behavior without such extreme reductions in result size / performance? >> >> The view I was using here (http://www.friendpaste.com/2AHz3ahr) was >> designed to simply take each document with the same name and merge >> them into one document, turning same-named fields into lists (here's a >> more general version http://www.friendpaste.com/Ud6ELaXC). This >> reduces the document size, but only by whatever overhead the repeated >> field names would add. The fields I was reducing only contained >> integers, so reduction did shrink documents by quite a bit. It was >> pretty handy, but the query took 25 seconds to return one result even >> when called repeatedly. >> >> Is there some technical reason for this limitation? >> >> I had assumed reduce was just an ordinary post-processing step that I >> could run once and have something akin to a brand new generated table >> to query on, so I wrote my views to transform my data to fit the >> various ways I wanted to view it. It worked fine for small amounts of >> data in little experiments, but as soon as I used it on my real >> database, I hit this wall. >> >> Are there plans to make reduce work for these more general >> data-mangling tasks? Or should I be approaching the problem a >> different way? Perhaps write my map calls differently so they produce >> more rows for reduce to compact? Or do something special if the third >> parameter to reduce is true? >> >> On Tue, Aug 19, 2008 at 5:41 PM, Damien Katz wrote: >>> You can return arrays and objects, whatever json allows. But if the object >>> keeps getting bigger the more rows it reduces, then it simply won't work. >>> >>> The exception is that the size of the reduce value can be logarithmic with >>> respect to the rows. The simplest example of logarithmic growth is the >>> summing of a row value. With Erlangs bignums, the size on disk is >>> Log2(Sum(Rows)), which is perfectly acceptable growth. >>> >>> -Damien >>> >>> On Aug 19, 2008, at 8:14 PM, Nicholas Retallack wrote: >>> >>>> Oh! I didn't realize that was a rule. I had used 'return values' in >>>> attempt to run the simplest test possible on my data. But hey, values is >>>> an >>>> array. Does that mean you're not allowed to return objects like arrays >>>> from >>>> reduce at all? Because I was kind of hoping I could. I was able to do it >>>> with smaller amounts of data, after all. Perhaps this is due to re-reduce >>>> kicking in? >>>> >>>> For the record, couchdb is still working on this query I started hours >>>> ago, >>>> and chewing up all my cpu. I am going to have to kill it so I can get >>>> some >>>> work done. >>>> >>>> On Tue, Aug 19, 2008 at 4:21 PM, Damien Katz wrote: >>>> >>>>> I think the problem with your reduce is that it looks like its not >>>>> actually >>>>> reducing to a single value, but instead using reduce for grouping data. >>>>> That >>>>> will cause severe performance problems. >>>>> >>>>> For reduce to work properly, you should end up with a fixed size data >>>>> structure regardless of the number of values being reduced (not stricty >>>>> true, but that's the general rule). >>>>> >>>>> -Damien >>>>> >>>>> >>>>> On Aug 19, 2008, at 6:55 PM, Nicholas Retallack wrote: >>>>> >>>>> Okay, I got it built on gentoo instead, but I'm still having performance >>>>>> >>>>>> issues with reduce. >>>>>> >>>>>> Erlang (BEAM) emulator version 5.6.3 [source] [64-bit] [async-threads:0] >>>>>> couchdb - Apache CouchDB 0.8.1-incubating >>>>>> >>>>>> Here's a query I tried to do: >>>>>> >>>>>> I freshly imported about 191MB of data in 155399 documents. 29090 are >>>>>> not >>>>>> discarded by map. Map produces one row with 5 fields for each of these >>>>>> documents. After grouping, each group should have four rows. Reduce is >>>>>> a >>>>>> simple function(keys,values){return values}. >>>>>> >>>>>> Here's the query call: >>>>>> time curl -X GET ' >>>>>> >>>>>> >>>>>> http://localhost:5984/clickfund/_view/offers/index?count=1&group=true&group_level=1 >>>>>> ' >>>>>> >>>>>> This is running on a 512MB slicehost account. http://www.slicehost.com/ >>>>>> >>>>>> I'd love to give you this command's execution time, since I ran it last >>>>>> night before I went to bed, but it must have taken over an hour because >>>>>> my >>>>>> laptop went to sleep and severed the connection. Trying it again. >>>>>> >>>>>> Considering it's blazing fast without the reduce function, I can only >>>>>> assume >>>>>> what's taking all this time is overhead setting up and tearing down the >>>>>> simple function(keys,values){return values}. >>>>>> >>>>>> I can give you guys the python source to set up this database so you can >>>>>> try >>>>>> it yourself if you like. >>>>>> >>>>> >>>>> >>> >>> >> >