Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 35085 invoked from network); 24 Jul 2010 18:16:04 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 24 Jul 2010 18:16:04 -0000 Received: (qmail 23834 invoked by uid 500); 24 Jul 2010 18:16:02 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 23618 invoked by uid 500); 24 Jul 2010 18:16:02 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 23610 invoked by uid 99); 24 Jul 2010 18:16:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 Jul 2010 18:16:01 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.160.52] (HELO mail-pw0-f52.google.com) (209.85.160.52) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 Jul 2010 18:15:55 +0000 Received: by pwi7 with SMTP id 7so8527292pwi.11 for ; Sat, 24 Jul 2010 11:15:34 -0700 (PDT) Received: by 10.114.127.3 with SMTP id z3mr8007547wac.83.1279995334763; Sat, 24 Jul 2010 11:15:34 -0700 (PDT) Received: from [192.168.1.102] (c-98-248-172-14.hsd1.ca.comcast.net [98.248.172.14]) by mx.google.com with ESMTPS id d38sm2869348wam.8.2010.07.24.11.15.33 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sat, 24 Jul 2010 11:15:34 -0700 (PDT) Sender: J Chris Anderson Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Apple Message framework v1081) Subject: Re: Large lists of data From: J Chris Anderson In-Reply-To: <976425960.1183231279982519811.JavaMail.root@zimbra5-e1.priv.proxad.net> Date: Sat, 24 Jul 2010 11:15:32 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <976425960.1183231279982519811.JavaMail.root@zimbra5-e1.priv.proxad.net> To: user@couchdb.apache.org X-Mailer: Apple Mail (2.1081) On Jul 24, 2010, at 7:41 AM, mickael.bailly@free.fr wrote: > Hello, >=20 > 1/ it's a little hard to answer this question, your setup is certainly = a little more complex than what you expose in your email :-) However = thousands of documents are gracefuly handled by CouchDB. >=20 > 2/ At first sight your documents will look like : > { "_id": 0123456789 , "list": "mylist", "type": "NP", = "status":"portedIn", "operatorId":1234 } >=20 > That way you can query your document by phone number : >=20 > GET /database/0123456789 >=20 > and have all documents belonging to the list "mylist" by creating a = view that emits the "list" field : >=20 > function (doc) { > if ( doc.list && doc.type =3D=3D "NP" ) { > emit (doc.list,null); > } > } >=20 > and fetching them with something like : >=20 > GET = /database/_design/portability/_view/NP?key=3D"mylist"&include_docs=3Dtrue >=20 > 3/ When updating a document : the document is of course immediately = available. However the view index won't be updated. In CouchDB view = indexes are rebuilt on view query (not on document update). When you'll = query CouchDB "give me all the documents of the view NP", Couch will = take all documents that have changed (added, updated, deleted) since the = last time you asked Couch for the view, and will update indexes = accordingly. You have the option of fetching the view without rebuilding = the index, with the "stale" parameter, but in this case, of course, you = won't see the changes. During the rebuilt of the index, subsequent view = queries are queued until the index is up to date. >=20 > 4/ I setup CouchDB to parse network logs. A view took something like = 25 minuts for 100 millions documents, on a Dell PowerEdge 2950 Xen = Virtual Machine with two dedicated processors and 4gigs ram. Numbers can = heavily vary according to the complexity of the view, so it's always = hard (and dangerous) to give numbers. Moreover my indexes were not only = numbers, but also strings. >=20 this is a good response. I'd only follow up to say that there are some = techniques you can use to further tune view-generation performance. one: = keysize and entropy can make a big difference. the view by list, as = above, looks pretty good on that front. CouchDB can also be configured to store view indexes on a separate disk = from the database file, which can reduce IO contention if you are at the = edge of what your hardware can do. Also, there is the option to query views with stale=3Dok, which will = return a query based on the latest snapshot, with low latency, so = clients aren't blocked waiting for generation to complete. then you can = use a cron-job with a regular view query and limit=3D1 to keep the index = up to date. so clients always see a fairly recent snapshot, with low = latency. >=20 > What you should be aware of is that CouchDB requires maintenance tasks = to keep great performances, it's called "compact" and should be run on = databases (to rebuilt the db file that is append-only) and on databases = views (to rebuild the index file that is append-only). During the = compact, database is still available but performances are degraded (from = my personnal experience). > Also, a new replication engine is in the pipe and should greatly = improve the replication experience. >=20 >=20 > Mickael >=20 > ----- Mail Original ----- > De: "John" > =C0: user@couchdb.apache.org > Envoy=E9: Samedi 24 Juillet 2010 11h37:56 GMT +01:00 Amsterdam / = Berlin / Berne / Rome / Stockholm / Vienne > Objet: Large lists of data >=20 > Hi=20 >=20 > I'm currently evaluating couchdb as a candidate to replace the = relational databases as used in our Telecom Applications. > For most of our data I can see a good fit and we already expose our = service provisioning as json over REST so we're well positioned for a = migration. > One area that concerns me though is whether this technology is = suitable for our list data. An example of this is Mobile Number = Portability where we have millions of rows of data representing ported = numbers with some atrributes against each. >=20 > We use the standard Relational approach to this and have an entries = table that has a foreign key reference to a parent list.=20 >=20 > On our web services we do something like this: >=20 > Create a List: >=20 > PUT /cie-rest/provision/accounts/netdev/lists/mylist > { "type": "NP"} >=20 > To add a row to a list=20 > PUT = /cie-rest/provision/accounts/netdev/lists/mylist/entries/0123456789 > { "status":"portedIn", "operatorId":1234} >=20 > If we want to add a lot of rows we just POST a document to the list. >=20 > The list data is used when processing calls and it requires a fast = lookup on the entries table which is obviously indexed. >=20 > Anyway, I'd be interested in getting some opinions on: >=20 > 1) Is couchdb the *right* technology for this job? (I know it can do = it!) >=20 > 2) I presume that the relationship I currently have in my relational = database would remain the same for couch i.e. The entry document would = ref the list document but maybe there's a better way to do this? >=20 > 3) Number portability requires 15 min, 1 hour and daily syncs with a = central number portability database. This can result in bulk updates of = thousands of numbers. I'm concerned with how long it takes to build a = couchdb index and to incrementally update it when the number of changes = is large (Adds/removes). =20 > What does this mean to the availability of the number? i.e. Is the = entry in the db but its unavailable to the application as it's entry in = the index hasnt been built yet? >=20 > 4) Telephone numbers like btrees so the index building should be quite = fast and efficient I would of thought but does someone have anything = more concrete in terms of how long it would take typically? I think that = the bottleneck is the disk i/o and therefore it may be vastly different = between my laptop and one of our beefy production servers but again I'd = be interested in other peoples experience. >=20 > Bit of a long one so thanks if you've read it to this point! There's a = lot to like with couchdb (esp the replication for our use case) so I'm = hoping that what i've asked above is feasible! >=20 > Thanks >=20 > John >=20 >=20