Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 66515 invoked from network); 24 Jul 2010 20:52:12 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 24 Jul 2010 20:52:12 -0000 Received: (qmail 93859 invoked by uid 500); 24 Jul 2010 20:52:10 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 93770 invoked by uid 500); 24 Jul 2010 20:52:10 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 93762 invoked by uid 99); 24 Jul 2010 20:52:10 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 Jul 2010 20:52:10 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.210.52] (HELO mail-pz0-f52.google.com) (209.85.210.52) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 24 Jul 2010 20:52:02 +0000 Received: by pzk27 with SMTP id 27so1163943pzk.11 for ; Sat, 24 Jul 2010 13:51:41 -0700 (PDT) Received: by 10.114.107.6 with SMTP id f6mr7511438wac.54.1280004700857; Sat, 24 Jul 2010 13:51:40 -0700 (PDT) Received: from [192.168.1.102] (c-98-248-172-14.hsd1.ca.comcast.net [98.248.172.14]) by mx.google.com with ESMTPS id b11sm1056862rvf.22.2010.07.24.13.51.39 (version=TLSv1/SSLv3 cipher=RC4-MD5); Sat, 24 Jul 2010 13:51:40 -0700 (PDT) Sender: J Chris Anderson Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Apple Message framework v1081) Subject: Re: Large lists of data From: J Chris Anderson In-Reply-To: <860CB19C-B359-4C2D-AD59-EBD9FAFD0BD0@netdev.co.uk> Date: Sat, 24 Jul 2010 13:51:38 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <885B3949-C9C4-4656-8A48-C55D5F080103@apache.org> References: <976425960.1183231279982519811.JavaMail.root@zimbra5-e1.priv.proxad.net> <6543836F-5897-412D-8572-C3B70C1F7089@netdev.co.uk> <9188D97D-09EE-4CFF-973F-BFAC5C185685@apache.org> <860CB19C-B359-4C2D-AD59-EBD9FAFD0BD0@netdev.co.uk> To: user@couchdb.apache.org X-Mailer: Apple Mail (2.1081) X-Virus-Checked: Checked by ClamAV on apache.org On Jul 24, 2010, at 1:36 PM, John Logsdon wrote: > In SQL it would be something like: >=20 > Select * from entries e > Where e.list-id=3D1234=20 > and key in ('0123456789', '012345678', '01234567', '0123456', = '012345', '01234') >=20 oh well in that case (you know all the keys in advance) you can do a = multi-key query. we tunnel a semantic GET over an http POST, so you post { "keys" : [array of keys] } to the view. For some reason I thought you were looking for a special kind of range = query. Chris >=20 > On 24 Jul 2010, at 21:20, J Chris Anderson wrote: >=20 >>=20 >> On Jul 24, 2010, at 1:13 PM, John wrote: >>=20 >>> I'm not sure if that will give me what I want exactly. >>>=20 >>> I want to search for these exact numbers only nothing in-between >>>=20 >>>>> 0123456789 >>>>> 012345678 >>>>> 01234567 >>>>> 0123456 >>>>> 012345 >>>>> 01234 >>>=20 >>> Therefore the most documents it could possibly return is 6 i.e. even = though startKey is "01234" and endkey is "0123456789" the number = "012346" is not a valid match.=20 >>>=20 >>=20 >> statykey "01234" endkey "012346" with inclusive_end =3D false should = work, except it will bring in eg "0123457" >>=20 >> If "0123457" is not OK then I'm flummoxed as to how to describe your = key requirements in a simple way, regardless of technology. >>=20 >>> Hope that makes sense! >>>=20 >>> John >>>=20 >>> On 24 Jul 2010, at 20:49, J Chris Anderson wrote: >>>=20 >>>>=20 >>>> On Jul 24, 2010, at 12:35 PM, John wrote: >>>>=20 >>>>> Hi >>>>>=20 >>>>> Thanks to you both for the answers so far. Indeed my setup is far = more complex than I have exposed to date but I'm making it into bite = sized chunks around the Use Cases that I think are the more challenging = for me. >>>>>=20 >>>>> Although your answers were useful they don't quite hit the mark = and that's probably because I didn't explain my problem well enough to = start with! >>>>>=20 >>>>> The database will contain entries from multiple lists (many = thousands perhaps) so the _id will never be unique on a telephone = number. Perhaps this might work though: >>>>>=20 >>>>> GET /database/#0123456789 >>>>>=20 >>>>> or I could just keep the _id as a uuid and move this problem (find = by list id and number) to the view. >>>>>=20 >>>>=20 >>>> for now I'd say just go with uuids and you can have a view by = telephone number for direct (or starts_with) lookups. >>>>=20 >>>>> The view by list wont work for me. I need to be able to query the = view with something like: >>>>>=20 >>>>> GET = /database/_design/portability/_view/NP?key=3D0123456789&list=3D<_id of = list> >>>>>=20 >>>>> In fact in some cases the problem is more complex than this as I = need to search for "widest match": >>>>>=20 >>>>> GET = /database/_design/portability/_view/NP?key=3D0123456789&list=3D<_id of = list>&min_width=3D5 >>>>>=20 >>>>> which would return the widest match in: >>>>>=20 >>>>> 0123456789 >>>>> 012345678 >>>>> 01234567 >>>>> 0123456 >>>>> 012345 >>>>> 01234 >>>>>=20 >>>>>=20 >>>>> I even have another use case where I need to do a STARTS_WITH e.g. = provide a key of 01234 and return true if there are any numbers that = start 01234. >>>>>=20 >>>>=20 >>>> this is easy. have a view like: >>>>=20 >>>> function(doc) { >>>> emit([doc.list_id, doc.number], null)=20 >>>> } >>>>=20 >>>> Then you can query with=20 >>>>=20 >>>> ?startkey=3D["mylist", "012"]&endkey=3D["mylist", "013"] >>>>=20 >>>> to get everything with a prefix of "012" in the "mylist" list. you = can mess around with the endkey_inclusive (or is it inclusive_endkey) =3D = true / false to not get the exact number "013" in your result set. >>>>=20 >>>> from this technique you can see how you could do starts-with = against just phone numbers also, with a view like >>>>=20 >>>> function(doc_ { >>>> emit(doc.number, null) >>>> } >>>>=20 >>>> Note I have telephone numbers as strings in this example as a = regular number 012 is the same as 12. >>>>=20 >>>>> This is a typical telecom problem and it would be good to document = a Design Pattern for this Use Case. In fact there's a discussion for = another day on how/where we could document this patterns and get peer = reviews on them. >>>>>=20 >>>>> Thanks again >>>>>=20 >>>>> John >>>>>=20 >>>>> On 24 Jul 2010, at 19:15, J Chris Anderson wrote: >>>>>=20 >>>>>>=20 >>>>>> On Jul 24, 2010, at 7:41 AM, mickael.bailly@free.fr wrote: >>>>>>=20 >>>>>>> Hello, >>>>>>>=20 >>>>>>> 1/ it's a little hard to answer this question, your setup is = certainly a little more complex than what you expose in your email :-) = However thousands of documents are gracefuly handled by CouchDB. >>>>>>>=20 >>>>>>> 2/ At first sight your documents will look like : >>>>>>> { "_id": 0123456789 , "list": "mylist", "type": "NP", = "status":"portedIn", "operatorId":1234 } >>>>>>>=20 >>>>>>> That way you can query your document by phone number : >>>>>>>=20 >>>>>>> GET /database/0123456789 >>>>>>>=20 >>>>>>> and have all documents belonging to the list "mylist" by = creating a view that emits the "list" field : >>>>>>>=20 >>>>>>> function (doc) { >>>>>>> if ( doc.list && doc.type =3D=3D "NP" ) { >>>>>>> emit (doc.list,null); >>>>>>> } >>>>>>> } >>>>>>>=20 >>>>>>> and fetching them with something like : >>>>>>>=20 >>>>>>> GET = /database/_design/portability/_view/NP?key=3D"mylist"&include_docs=3Dtrue >>>>>>>=20 >>>>>>> 3/ When updating a document : the document is of course = immediately available. However the view index won't be updated. In = CouchDB view indexes are rebuilt on view query (not on document update). = When you'll query CouchDB "give me all the documents of the view NP", = Couch will take all documents that have changed (added, updated, = deleted) since the last time you asked Couch for the view, and will = update indexes accordingly. You have the option of fetching the view = without rebuilding the index, with the "stale" parameter, but in this = case, of course, you won't see the changes. During the rebuilt of the = index, subsequent view queries are queued until the index is up to date. >>>>>>>=20 >>>>>>> 4/ I setup CouchDB to parse network logs. A view took something = like 25 minuts for 100 millions documents, on a Dell PowerEdge 2950 Xen = Virtual Machine with two dedicated processors and 4gigs ram. Numbers can = heavily vary according to the complexity of the view, so it's always = hard (and dangerous) to give numbers. Moreover my indexes were not only = numbers, but also strings. >>>>>>>=20 >>>>>>=20 >>>>>> this is a good response. I'd only follow up to say that there are = some techniques you can use to further tune view-generation performance. = one: keysize and entropy can make a big difference. the view by list, as = above, looks pretty good on that front. >>>>>>=20 >>>>>> CouchDB can also be configured to store view indexes on a = separate disk from the database file, which can reduce IO contention if = you are at the edge of what your hardware can do. >>>>>>=20 >>>>>> Also, there is the option to query views with stale=3Dok, which = will return a query based on the latest snapshot, with low latency, so = clients aren't blocked waiting for generation to complete. then you can = use a cron-job with a regular view query and limit=3D1 to keep the index = up to date. so clients always see a fairly recent snapshot, with low = latency. >>>>>>=20 >>>>>>>=20 >>>>>>> What you should be aware of is that CouchDB requires maintenance = tasks to keep great performances, it's called "compact" and should be = run on databases (to rebuilt the db file that is append-only) and on = databases views (to rebuild the index file that is append-only). During = the compact, database is still available but performances are degraded = (from my personnal experience). >>>>>>> Also, a new replication engine is in the pipe and should greatly = improve the replication experience. >>>>>>>=20 >>>>>>>=20 >>>>>>> Mickael >>>>>>>=20 >>>>>>> ----- Mail Original ----- >>>>>>> De: "John" >>>>>>> =C0: user@couchdb.apache.org >>>>>>> Envoy=E9: Samedi 24 Juillet 2010 11h37:56 GMT +01:00 Amsterdam / = Berlin / Berne / Rome / Stockholm / Vienne >>>>>>> Objet: Large lists of data >>>>>>>=20 >>>>>>> Hi=20 >>>>>>>=20 >>>>>>> I'm currently evaluating couchdb as a candidate to replace the = relational databases as used in our Telecom Applications. >>>>>>> For most of our data I can see a good fit and we already expose = our service provisioning as json over REST so we're well positioned for = a migration. >>>>>>> One area that concerns me though is whether this technology is = suitable for our list data. An example of this is Mobile Number = Portability where we have millions of rows of data representing ported = numbers with some atrributes against each. >>>>>>>=20 >>>>>>> We use the standard Relational approach to this and have an = entries table that has a foreign key reference to a parent list.=20 >>>>>>>=20 >>>>>>> On our web services we do something like this: >>>>>>>=20 >>>>>>> Create a List: >>>>>>>=20 >>>>>>> PUT /cie-rest/provision/accounts/netdev/lists/mylist >>>>>>> { "type": "NP"} >>>>>>>=20 >>>>>>> To add a row to a list=20 >>>>>>> PUT = /cie-rest/provision/accounts/netdev/lists/mylist/entries/0123456789 >>>>>>> { "status":"portedIn", "operatorId":1234} >>>>>>>=20 >>>>>>> If we want to add a lot of rows we just POST a document to the = list. >>>>>>>=20 >>>>>>> The list data is used when processing calls and it requires a = fast lookup on the entries table which is obviously indexed. >>>>>>>=20 >>>>>>> Anyway, I'd be interested in getting some opinions on: >>>>>>>=20 >>>>>>> 1) Is couchdb the *right* technology for this job? (I know it = can do it!) >>>>>>>=20 >>>>>>> 2) I presume that the relationship I currently have in my = relational database would remain the same for couch i.e. The entry = document would ref the list document but maybe there's a better way to = do this? >>>>>>>=20 >>>>>>> 3) Number portability requires 15 min, 1 hour and daily syncs = with a central number portability database. This can result in bulk = updates of thousands of numbers. I'm concerned with how long it takes to = build a couchdb index and to incrementally update it when the number of = changes is large (Adds/removes). =20 >>>>>>> What does this mean to the availability of the number? i.e. Is = the entry in the db but its unavailable to the application as it's entry = in the index hasnt been built yet? >>>>>>>=20 >>>>>>> 4) Telephone numbers like btrees so the index building should be = quite fast and efficient I would of thought but does someone have = anything more concrete in terms of how long it would take typically? I = think that the bottleneck is the disk i/o and therefore it may be vastly = different between my laptop and one of our beefy production servers but = again I'd be interested in other peoples experience. >>>>>>>=20 >>>>>>> Bit of a long one so thanks if you've read it to this point! = There's a lot to like with couchdb (esp the replication for our use = case) so I'm hoping that what i've asked above is feasible! >>>>>>>=20 >>>>>>> Thanks >>>>>>>=20 >>>>>>> John >>>>>>>=20 >>>>>>>=20 >>>>>>=20 >>>>>=20 >>>>=20 >>>=20 >>=20