Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 73938 invoked from network); 24 Jul 2010 21:15:09 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 24 Jul 2010 21:15:09 -0000 Received: (qmail 6217 invoked by uid 500); 24 Jul 2010 21:15:08 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 6165 invoked by uid 500); 24 Jul 2010 21:15:08 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 6157 invoked by uid 99); 24 Jul 2010 21:15:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 Jul 2010 21:15:08 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.125.82.180] (HELO mail-wy0-f180.google.com) (74.125.82.180) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 Jul 2010 21:15:02 +0000 Received: by wyb33 with SMTP id 33so1715341wyb.11 for ; Sat, 24 Jul 2010 14:14:40 -0700 (PDT) Received: by 10.227.156.143 with SMTP id x15mr5157104wbw.147.1280006080375; Sat, 24 Jul 2010 14:14:40 -0700 (PDT) Received: from [192.168.1.115] (host109-152-211-145.range109-152.btcentralplus.com [109.152.211.145]) by mx.google.com with ESMTPS id i25sm1471842wbi.4.2010.07.24.14.14.38 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sat, 24 Jul 2010 14:14:39 -0700 (PDT) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Apple Message framework v1081) Subject: Re: Large lists of data From: John In-Reply-To: <885B3949-C9C4-4656-8A48-C55D5F080103@apache.org> Date: Sat, 24 Jul 2010 22:14:37 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: <42FE9041-D5E9-43FF-BD5F-1EEE3C5A35CC@netdev.co.uk> References: <976425960.1183231279982519811.JavaMail.root@zimbra5-e1.priv.proxad.net> <6543836F-5897-412D-8572-C3B70C1F7089@netdev.co.uk> <9188D97D-09EE-4CFF-973F-BFAC5C185685@apache.org> <860CB19C-B359-4C2D-AD59-EBD9FAFD0BD0@netdev.co.uk> <885B3949-C9C4-4656-8A48-C55D5F080103@apache.org> To: user@couchdb.apache.org X-Mailer: Apple Mail (2.1081) Thanks for the help Chris, much appreciated. I'll give it a whirl = tomorrow and I'll put a bit more work into phrasing my problem next = time! John On 24 Jul 2010, at 21:51, J Chris Anderson wrote: >=20 > On Jul 24, 2010, at 1:36 PM, John Logsdon wrote: >=20 >> In SQL it would be something like: >>=20 >> Select * from entries e >> Where e.list-id=3D1234=20 >> and key in ('0123456789', '012345678', '01234567', '0123456', = '012345', '01234') >>=20 >=20 > oh well in that case (you know all the keys in advance) you can do a = multi-key query. >=20 > we tunnel a semantic GET over an http POST, so you post >=20 > { > "keys" : [array of keys] > } >=20 > to the view. >=20 > For some reason I thought you were looking for a special kind of range = query. >=20 > Chris >=20 >>=20 >> On 24 Jul 2010, at 21:20, J Chris Anderson wrote: >>=20 >>>=20 >>> On Jul 24, 2010, at 1:13 PM, John wrote: >>>=20 >>>> I'm not sure if that will give me what I want exactly. >>>>=20 >>>> I want to search for these exact numbers only nothing in-between >>>>=20 >>>>>> 0123456789 >>>>>> 012345678 >>>>>> 01234567 >>>>>> 0123456 >>>>>> 012345 >>>>>> 01234 >>>>=20 >>>> Therefore the most documents it could possibly return is 6 i.e. = even though startKey is "01234" and endkey is "0123456789" the number = "012346" is not a valid match.=20 >>>>=20 >>>=20 >>> statykey "01234" endkey "012346" with inclusive_end =3D false should = work, except it will bring in eg "0123457" >>>=20 >>> If "0123457" is not OK then I'm flummoxed as to how to describe your = key requirements in a simple way, regardless of technology. >>>=20 >>>> Hope that makes sense! >>>>=20 >>>> John >>>>=20 >>>> On 24 Jul 2010, at 20:49, J Chris Anderson wrote: >>>>=20 >>>>>=20 >>>>> On Jul 24, 2010, at 12:35 PM, John wrote: >>>>>=20 >>>>>> Hi >>>>>>=20 >>>>>> Thanks to you both for the answers so far. Indeed my setup is far = more complex than I have exposed to date but I'm making it into bite = sized chunks around the Use Cases that I think are the more challenging = for me. >>>>>>=20 >>>>>> Although your answers were useful they don't quite hit the mark = and that's probably because I didn't explain my problem well enough to = start with! >>>>>>=20 >>>>>> The database will contain entries from multiple lists (many = thousands perhaps) so the _id will never be unique on a telephone = number. Perhaps this might work though: >>>>>>=20 >>>>>> GET /database/#0123456789 >>>>>>=20 >>>>>> or I could just keep the _id as a uuid and move this problem = (find by list id and number) to the view. >>>>>>=20 >>>>>=20 >>>>> for now I'd say just go with uuids and you can have a view by = telephone number for direct (or starts_with) lookups. >>>>>=20 >>>>>> The view by list wont work for me. I need to be able to query the = view with something like: >>>>>>=20 >>>>>> GET = /database/_design/portability/_view/NP?key=3D0123456789&list=3D<_id of = list> >>>>>>=20 >>>>>> In fact in some cases the problem is more complex than this as I = need to search for "widest match": >>>>>>=20 >>>>>> GET = /database/_design/portability/_view/NP?key=3D0123456789&list=3D<_id of = list>&min_width=3D5 >>>>>>=20 >>>>>> which would return the widest match in: >>>>>>=20 >>>>>> 0123456789 >>>>>> 012345678 >>>>>> 01234567 >>>>>> 0123456 >>>>>> 012345 >>>>>> 01234 >>>>>>=20 >>>>>>=20 >>>>>> I even have another use case where I need to do a STARTS_WITH = e.g. provide a key of 01234 and return true if there are any numbers = that start 01234. >>>>>>=20 >>>>>=20 >>>>> this is easy. have a view like: >>>>>=20 >>>>> function(doc) { >>>>> emit([doc.list_id, doc.number], null)=20 >>>>> } >>>>>=20 >>>>> Then you can query with=20 >>>>>=20 >>>>> ?startkey=3D["mylist", "012"]&endkey=3D["mylist", "013"] >>>>>=20 >>>>> to get everything with a prefix of "012" in the "mylist" list. you = can mess around with the endkey_inclusive (or is it inclusive_endkey) =3D = true / false to not get the exact number "013" in your result set. >>>>>=20 >>>>> from this technique you can see how you could do starts-with = against just phone numbers also, with a view like >>>>>=20 >>>>> function(doc_ { >>>>> emit(doc.number, null) >>>>> } >>>>>=20 >>>>> Note I have telephone numbers as strings in this example as a = regular number 012 is the same as 12. >>>>>=20 >>>>>> This is a typical telecom problem and it would be good to = document a Design Pattern for this Use Case. In fact there's a = discussion for another day on how/where we could document this patterns = and get peer reviews on them. >>>>>>=20 >>>>>> Thanks again >>>>>>=20 >>>>>> John >>>>>>=20 >>>>>> On 24 Jul 2010, at 19:15, J Chris Anderson wrote: >>>>>>=20 >>>>>>>=20 >>>>>>> On Jul 24, 2010, at 7:41 AM, mickael.bailly@free.fr wrote: >>>>>>>=20 >>>>>>>> Hello, >>>>>>>>=20 >>>>>>>> 1/ it's a little hard to answer this question, your setup is = certainly a little more complex than what you expose in your email :-) = However thousands of documents are gracefuly handled by CouchDB. >>>>>>>>=20 >>>>>>>> 2/ At first sight your documents will look like : >>>>>>>> { "_id": 0123456789 , "list": "mylist", "type": "NP", = "status":"portedIn", "operatorId":1234 } >>>>>>>>=20 >>>>>>>> That way you can query your document by phone number : >>>>>>>>=20 >>>>>>>> GET /database/0123456789 >>>>>>>>=20 >>>>>>>> and have all documents belonging to the list "mylist" by = creating a view that emits the "list" field : >>>>>>>>=20 >>>>>>>> function (doc) { >>>>>>>> if ( doc.list && doc.type =3D=3D "NP" ) { >>>>>>>> emit (doc.list,null); >>>>>>>> } >>>>>>>> } >>>>>>>>=20 >>>>>>>> and fetching them with something like : >>>>>>>>=20 >>>>>>>> GET = /database/_design/portability/_view/NP?key=3D"mylist"&include_docs=3Dtrue >>>>>>>>=20 >>>>>>>> 3/ When updating a document : the document is of course = immediately available. However the view index won't be updated. In = CouchDB view indexes are rebuilt on view query (not on document update). = When you'll query CouchDB "give me all the documents of the view NP", = Couch will take all documents that have changed (added, updated, = deleted) since the last time you asked Couch for the view, and will = update indexes accordingly. You have the option of fetching the view = without rebuilding the index, with the "stale" parameter, but in this = case, of course, you won't see the changes. During the rebuilt of the = index, subsequent view queries are queued until the index is up to date. >>>>>>>>=20 >>>>>>>> 4/ I setup CouchDB to parse network logs. A view took something = like 25 minuts for 100 millions documents, on a Dell PowerEdge 2950 Xen = Virtual Machine with two dedicated processors and 4gigs ram. Numbers can = heavily vary according to the complexity of the view, so it's always = hard (and dangerous) to give numbers. Moreover my indexes were not only = numbers, but also strings. >>>>>>>>=20 >>>>>>>=20 >>>>>>> this is a good response. I'd only follow up to say that there = are some techniques you can use to further tune view-generation = performance. one: keysize and entropy can make a big difference. the = view by list, as above, looks pretty good on that front. >>>>>>>=20 >>>>>>> CouchDB can also be configured to store view indexes on a = separate disk from the database file, which can reduce IO contention if = you are at the edge of what your hardware can do. >>>>>>>=20 >>>>>>> Also, there is the option to query views with stale=3Dok, which = will return a query based on the latest snapshot, with low latency, so = clients aren't blocked waiting for generation to complete. then you can = use a cron-job with a regular view query and limit=3D1 to keep the index = up to date. so clients always see a fairly recent snapshot, with low = latency. >>>>>>>=20 >>>>>>>>=20 >>>>>>>> What you should be aware of is that CouchDB requires = maintenance tasks to keep great performances, it's called "compact" and = should be run on databases (to rebuilt the db file that is append-only) = and on databases views (to rebuild the index file that is append-only). = During the compact, database is still available but performances are = degraded (from my personnal experience). >>>>>>>> Also, a new replication engine is in the pipe and should = greatly improve the replication experience. >>>>>>>>=20 >>>>>>>>=20 >>>>>>>> Mickael >>>>>>>>=20 >>>>>>>> ----- Mail Original ----- >>>>>>>> De: "John" >>>>>>>> =C0: user@couchdb.apache.org >>>>>>>> Envoy=E9: Samedi 24 Juillet 2010 11h37:56 GMT +01:00 Amsterdam = / Berlin / Berne / Rome / Stockholm / Vienne >>>>>>>> Objet: Large lists of data >>>>>>>>=20 >>>>>>>> Hi=20 >>>>>>>>=20 >>>>>>>> I'm currently evaluating couchdb as a candidate to replace the = relational databases as used in our Telecom Applications. >>>>>>>> For most of our data I can see a good fit and we already expose = our service provisioning as json over REST so we're well positioned for = a migration. >>>>>>>> One area that concerns me though is whether this technology is = suitable for our list data. An example of this is Mobile Number = Portability where we have millions of rows of data representing ported = numbers with some atrributes against each. >>>>>>>>=20 >>>>>>>> We use the standard Relational approach to this and have an = entries table that has a foreign key reference to a parent list.=20 >>>>>>>>=20 >>>>>>>> On our web services we do something like this: >>>>>>>>=20 >>>>>>>> Create a List: >>>>>>>>=20 >>>>>>>> PUT /cie-rest/provision/accounts/netdev/lists/mylist >>>>>>>> { "type": "NP"} >>>>>>>>=20 >>>>>>>> To add a row to a list=20 >>>>>>>> PUT = /cie-rest/provision/accounts/netdev/lists/mylist/entries/0123456789 >>>>>>>> { "status":"portedIn", "operatorId":1234} >>>>>>>>=20 >>>>>>>> If we want to add a lot of rows we just POST a document to the = list. >>>>>>>>=20 >>>>>>>> The list data is used when processing calls and it requires a = fast lookup on the entries table which is obviously indexed. >>>>>>>>=20 >>>>>>>> Anyway, I'd be interested in getting some opinions on: >>>>>>>>=20 >>>>>>>> 1) Is couchdb the *right* technology for this job? (I know it = can do it!) >>>>>>>>=20 >>>>>>>> 2) I presume that the relationship I currently have in my = relational database would remain the same for couch i.e. The entry = document would ref the list document but maybe there's a better way to = do this? >>>>>>>>=20 >>>>>>>> 3) Number portability requires 15 min, 1 hour and daily syncs = with a central number portability database. This can result in bulk = updates of thousands of numbers. I'm concerned with how long it takes to = build a couchdb index and to incrementally update it when the number of = changes is large (Adds/removes). =20 >>>>>>>> What does this mean to the availability of the number? i.e. Is = the entry in the db but its unavailable to the application as it's entry = in the index hasnt been built yet? >>>>>>>>=20 >>>>>>>> 4) Telephone numbers like btrees so the index building should = be quite fast and efficient I would of thought but does someone have = anything more concrete in terms of how long it would take typically? I = think that the bottleneck is the disk i/o and therefore it may be vastly = different between my laptop and one of our beefy production servers but = again I'd be interested in other peoples experience. >>>>>>>>=20 >>>>>>>> Bit of a long one so thanks if you've read it to this point! = There's a lot to like with couchdb (esp the replication for our use = case) so I'm hoping that what i've asked above is feasible! >>>>>>>>=20 >>>>>>>> Thanks >>>>>>>>=20 >>>>>>>> John >>>>>>>>=20 >>>>>>>>=20 >>>>>>>=20 >>>>>>=20 >>>>>=20 >>>>=20 >>>=20 >=20