Mailing-List: contact couchdb-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: couchdb-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of ralf.nieuwenhuijsen@gmail.com
 designates 209.85.198.245 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:to:subject:in-reply-to:mime-version
         :content-type:content-transfer-encoding:content-disposition
         :references;
        b=Yz9ea0K39iW4wcWz/R+Ocd5gKgq5wSkXntJp0efyOWFnS+wAIT9AW7csa3m7hf+Bsn
         ZpNLj44UNoBkc7aY2bqRQEFnLxQwmzXjBcWU0h3neDziN4vttKjDXxIYvFlh+y3nUKjW
         MTxMe6YVjYWA5QewDG3zG2+YyhNz6QMRnhoZM=
Message-ID: <41fe564f0808190116i235cb618sabe19a2059e6289f@mail.gmail.com>
Date: Tue, 19 Aug 2008 10:16:55 +0200
From: "Ralf Nieuwenhuijsen" <ralf.nieuwenhuijsen@gmail.com>
To: couchdb-user@incubator.apache.org
Subject: Re: flexible filtering needed, with speed.
In-Reply-To: <76E4AFEC-7FCA-4063-9819-34150CF19E68@sankatygroup.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <76E4AFEC-7FCA-4063-9819-34150CF19E68@sankatygroup.com>

Don't take Futon as a speed measure; since it might also be slowing
down in the rendering part if your documents are big. (there is a lot
of stuff going on client=side as well).

The truth is, all data that is being searched, people only care about
3-5 different types of search.

You can offcourse, go nuts with the indexing and just generate all
possible indexes you could possible need.

Here is one of my favorites; this creates an index for every unique field.

function(doc) {
 for(var k in doc){
   emit([k,1,doc[k]], rdoc);
 }
}

You can query it like:

use startkey=['someField',1,null] and endkey=['someField',2,null]
To get the index for 'someField'.

Offcourse, this baby is going to create a huge index if used with too
many or too big documents, but I would at least try something like
that.

I use the above view function to make sure I can get the data sorted
however I want.

2008/8/19 Brad Anderson <brad@sankatygroup.com>:
> Howdy,
>
> I have 12K docs that look like this:
>
> {
>  "_id": "000111bf7a8515da822b05ebbb8cd257",
>  "_rev": "94750440",
>  "month": 17,
>  "store": {
>  "store_num": 123,
>  "city": "Atlanta",
>  "state": "GA",
>  "zip": "30301",
>  "exterior": true,
>  "interior": true,
>  "restroom": true,
>  "breakfast": true,
>  "sunday": true,
>  "adi_name": "Atlanta, GA",
>  "adi_num": 123,
>  "ownership": "Company",
>  "playground": "Indoor",
>  "seats": 123,
>  "parking_spaces": 123
>  },
>  "raw": {
>  "Other Hourly Pay": 0.28,
>  "Workers Comp - State Funds Exp": 401.65,
>  "Rent Expense - Company": -8,
>  "Archives Expense": 82.81,
>  "Revised Hours allowed per": 860.22,
>  "Merch Standard": 174.78,
>  "Total Property Tax": 1190.91
>
>  ...
>
>  }
> }
>
> I truncated 'raw' but it's usually much longer, and avg. doc size is 5K.
>
>  I'm trying to see how I will query them with views.  I want to be able to
> filter down by various store sub fields, i.e all the Breakfast = true stores
> in Georgia that are owned by Franchisees.  However, this will differ for
> just about every query.
>
> The 'reduce' function would then be averaging each line in the 'raw' field.
>
> I have played around with views that take the store filters, but just
> returning the 'raw' field as the value from the map function is brutally
> slow in Futon.  This is because the view is accessed right away, so it
> builds, takes about 3-4 mins (on a MBP with 4GB RAM, 2.2GHz dual core,
> 7200RPM disk).  I understand the next time this specific store group is
> requested, it's fast...  but they will all be so dynamic that this seems
> prohibitively slow.
>
> So, I thought, should I be doing this in two steps?  Set up the key to be
> store and whatever else I might want to query on (Month or whatever
> timeframe), and return the doc id's as the values on the original query?  I
> would then send in a complex key to do the filtering.  This would require
> waiting for the _bulk_get functionality, and I'd send that list of ID's into
> a 2nd query to get the raw data to send it to 'map'.
>
> This is slow now on 12K docs... It needs to be stupid-fast at that low
> number of docs, because the plan is for *way* more data.
>
> The filtering part is tailor-made for a RDBMS, but the doc handling (all the
> 'raw' fields will be different store-by-store, industry by industry, change
> over time, and in general be free-form) is perfect for CouchDB.  Thoughts?
>  I want to use the right tool for the job, and that's looking like a RDBMS,
> sadly.  That is, unless I'm completely misusing Couch.  In which case, swift
> blows to the head are welcome.
>
> Cheers,
> BA
>
>
>