Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Apple Message framework v1081)
Subject: Re: Large lists of data
From: John <john.logsdon@netdev.co.uk>
In-Reply-To: <885B3949-C9C4-4656-8A48-C55D5F080103@apache.org>
Date: Sat, 24 Jul 2010 22:14:37 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <42FE9041-D5E9-43FF-BD5F-1EEE3C5A35CC@netdev.co.uk>
References: 
 <976425960.1183231279982519811.JavaMail.root@zimbra5-e1.priv.proxad.net>
 <C50B1BA3-D2F2-4018-AC88-249448934174@apache.org>
 <6543836F-5897-412D-8572-C3B70C1F7089@netdev.co.uk>
 <9188D97D-09EE-4CFF-973F-BFAC5C185685@apache.org>
 <A07D0A58-C5A0-42D4-B78D-5FF158CF6276@netdev.co.uk>
 <B4986803-DF55-4677-8714-0CC0BA6B3E8F@couch.io>
 <860CB19C-B359-4C2D-AD59-EBD9FAFD0BD0@netdev.co.uk>
 <885B3949-C9C4-4656-8A48-C55D5F080103@apache.org>
To: user@couchdb.apache.org

Thanks for the help Chris, much appreciated. I'll give it a whirl =
tomorrow and I'll put a bit more work into phrasing my problem next =
time!

John

On 24 Jul 2010, at 21:51, J Chris Anderson wrote:

>=20
> On Jul 24, 2010, at 1:36 PM, John Logsdon wrote:
>=20
>> In SQL it would be something like:
>>=20
>> Select * from entries e
>> Where e.list-id=3D1234=20
>> and key in ('0123456789', '012345678', '01234567', '0123456', =
'012345', '01234')
>>=20
>=20
> oh well in that case (you know all the keys in advance) you can do a =
multi-key query.
>=20
> we tunnel a semantic GET over an http POST, so you post
>=20
> {
> "keys" : [array of keys]
> }
>=20
> to the view.
>=20
> For some reason I thought you were looking for a special kind of range =
query.
>=20
> Chris
>=20
>>=20
>> On 24 Jul 2010, at 21:20, J Chris Anderson <jchris@couch.io> wrote:
>>=20
>>>=20
>>> On Jul 24, 2010, at 1:13 PM, John wrote:
>>>=20
>>>> I'm not sure if that will give me what I want exactly.
>>>>=20
>>>> I want to search for these exact numbers only nothing in-between
>>>>=20
>>>>>> 0123456789
>>>>>> 012345678
>>>>>> 01234567
>>>>>> 0123456
>>>>>> 012345
>>>>>> 01234
>>>>=20
>>>> Therefore the most documents it could possibly return is 6 i.e. =
even though startKey is "01234" and endkey is "0123456789" the number =
"012346" is not a valid match.=20
>>>>=20
>>>=20
>>> statykey "01234" endkey "012346" with inclusive_end =3D false should =
work, except it will bring in eg "0123457"
>>>=20
>>> If "0123457" is not OK then I'm flummoxed as to how to describe your =
key requirements in a simple way, regardless of technology.
>>>=20
>>>> Hope that makes sense!
>>>>=20
>>>> John
>>>>=20
>>>> On 24 Jul 2010, at 20:49, J Chris Anderson wrote:
>>>>=20
>>>>>=20
>>>>> On Jul 24, 2010, at 12:35 PM, John wrote:
>>>>>=20
>>>>>> Hi
>>>>>>=20
>>>>>> Thanks to you both for the answers so far. Indeed my setup is far =
more complex than I have exposed to date but I'm making it into bite =
sized chunks around the Use Cases that I think are the more challenging =
for me.
>>>>>>=20
>>>>>> Although your answers were useful they don't quite hit the mark =
and that's probably because I didn't explain my problem well enough to =
start with!
>>>>>>=20
>>>>>> The database will contain entries from multiple lists (many =
thousands perhaps) so the _id will never be unique on a telephone =
number. Perhaps this might work though:
>>>>>>=20
>>>>>> GET /database/<list _id>#0123456789
>>>>>>=20
>>>>>> or I could just keep the _id as a uuid and move this problem =
(find by list id and number) to the view.
>>>>>>=20
>>>>>=20
>>>>> for now I'd say just go with uuids and you can have a view by =
telephone number for direct (or starts_with) lookups.
>>>>>=20
>>>>>> The view by list wont work for me. I need to be able to query the =
view with something like:
>>>>>>=20
>>>>>> GET  =
/database/_design/portability/_view/NP?key=3D0123456789&list=3D<_id of =
list>
>>>>>>=20
>>>>>> In fact in some cases the problem is more complex than this as I =
need to search for "widest match":
>>>>>>=20
>>>>>> GET  =
/database/_design/portability/_view/NP?key=3D0123456789&list=3D<_id of =
list>&min_width=3D5
>>>>>>=20
>>>>>> which would return the widest match in:
>>>>>>=20
>>>>>> 0123456789
>>>>>> 012345678
>>>>>> 01234567
>>>>>> 0123456
>>>>>> 012345
>>>>>> 01234
>>>>>>=20
>>>>>>=20
>>>>>> I even have another use case where I need to do a STARTS_WITH =
e.g. provide a key of 01234 and return true if there are any numbers =
that start 01234.
>>>>>>=20
>>>>>=20
>>>>> this is easy. have a view like:
>>>>>=20
>>>>> function(doc) {
>>>>> emit([doc.list_id, doc.number], null)=20
>>>>> }
>>>>>=20
>>>>> Then you can query with=20
>>>>>=20
>>>>> ?startkey=3D["mylist", "012"]&endkey=3D["mylist", "013"]
>>>>>=20
>>>>> to get everything with a prefix of "012" in the "mylist" list. you =
can mess around with the endkey_inclusive (or is it inclusive_endkey) =3D =
true / false to not get the exact number "013" in your result set.
>>>>>=20
>>>>> from this technique you can see how you could do starts-with =
against just phone numbers also, with a view like
>>>>>=20
>>>>> function(doc_ {
>>>>> emit(doc.number, null)
>>>>> }
>>>>>=20
>>>>> Note I have telephone numbers as strings in this example as a =
regular number 012 is the same as 12.
>>>>>=20
>>>>>> This is a typical telecom problem and it would be good to =
document a Design Pattern for this Use Case. In fact there's a =
discussion for another day on how/where we could document this patterns =
and get peer reviews on them.
>>>>>>=20
>>>>>> Thanks again
>>>>>>=20
>>>>>> John
>>>>>>=20
>>>>>> On 24 Jul 2010, at 19:15, J Chris Anderson wrote:
>>>>>>=20
>>>>>>>=20
>>>>>>> On Jul 24, 2010, at 7:41 AM, mickael.bailly@free.fr wrote:
>>>>>>>=20
>>>>>>>> Hello,
>>>>>>>>=20
>>>>>>>> 1/ it's a little hard to answer this question, your setup is =
certainly a little more complex than what you expose in your email :-) =
However thousands of documents are gracefuly handled by CouchDB.
>>>>>>>>=20
>>>>>>>> 2/ At first sight your documents will look like :
>>>>>>>> { "_id": 0123456789 , "list": "mylist", "type": "NP", =
"status":"portedIn", "operatorId":1234 }
>>>>>>>>=20
>>>>>>>> That way you can query your document by phone number :
>>>>>>>>=20
>>>>>>>> GET /database/0123456789
>>>>>>>>=20
>>>>>>>> and have all documents belonging to the list "mylist" by =
creating a view that emits the "list" field :
>>>>>>>>=20
>>>>>>>> function (doc) {
>>>>>>>> if ( doc.list  && doc.type =3D=3D "NP" ) {
>>>>>>>> emit (doc.list,null);
>>>>>>>> }
>>>>>>>> }
>>>>>>>>=20
>>>>>>>> and fetching them with something like :
>>>>>>>>=20
>>>>>>>> GET =
/database/_design/portability/_view/NP?key=3D"mylist"&include_docs=3Dtrue
>>>>>>>>=20
>>>>>>>> 3/ When updating a document : the document is of course =
immediately available. However the view index won't be updated. In =
CouchDB view indexes are rebuilt on view query (not on document update). =
When you'll query CouchDB "give me all the documents of the view NP", =
Couch will take all documents that have changed (added, updated, =
deleted) since the last time you asked Couch for the view, and will =
update indexes accordingly. You have the option of fetching the view =
without rebuilding the index, with the "stale" parameter, but in this =
case, of course, you won't see the changes. During the rebuilt of the =
index, subsequent view queries are queued until the index is up to date.
>>>>>>>>=20
>>>>>>>> 4/ I setup CouchDB to parse network logs. A view took something =
like 25 minuts for 100 millions documents, on a Dell PowerEdge 2950 Xen =
Virtual Machine with two dedicated processors and 4gigs ram. Numbers can =
heavily vary according to the complexity of the view, so it's always =
hard (and dangerous) to give numbers. Moreover my indexes were not only =
numbers, but also strings.
>>>>>>>>=20
>>>>>>>=20
>>>>>>> this is a good response. I'd only follow up to say that there =
are some techniques you can use to further tune view-generation =
performance. one: keysize and entropy can make a big difference. the =
view by list, as above, looks pretty good on that front.
>>>>>>>=20
>>>>>>> CouchDB can also be configured to store view indexes on a =
separate disk from the database file, which can reduce IO contention if =
you are at the edge of what your hardware can do.
>>>>>>>=20
>>>>>>> Also, there is the option to query views with stale=3Dok, which =
will return a query based on the latest snapshot, with low latency, so =
clients aren't blocked waiting for generation to complete. then you can =
use a cron-job with a regular view query and limit=3D1 to keep the index =
up to date. so clients always see a fairly recent snapshot, with low =
latency.
>>>>>>>=20
>>>>>>>>=20
>>>>>>>> What you should be aware of is that CouchDB requires =
maintenance tasks to keep great performances, it's called "compact" and =
should be run on databases (to rebuilt the db file that is append-only) =
and on databases views (to rebuild the index file that is append-only). =
During the compact, database is still available but performances are =
degraded (from my personnal experience).
>>>>>>>> Also, a new replication engine is in the pipe and should =
greatly improve the replication experience.
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>> Mickael
>>>>>>>>=20
>>>>>>>> ----- Mail Original -----
>>>>>>>> De: "John" <john.logsdon@netdev.co.uk>
>>>>>>>> =C0: user@couchdb.apache.org
>>>>>>>> Envoy=E9: Samedi 24 Juillet 2010 11h37:56 GMT +01:00 Amsterdam =
/ Berlin / Berne / Rome / Stockholm / Vienne
>>>>>>>> Objet: Large lists of data
>>>>>>>>=20
>>>>>>>> Hi=20
>>>>>>>>=20
>>>>>>>> I'm currently evaluating couchdb as a candidate to replace the =
relational databases as used in our Telecom Applications.
>>>>>>>> For most of our data I can see a good fit and we already expose =
our service provisioning as json over REST so we're well positioned for =
a migration.
>>>>>>>> One area that concerns me though is whether this technology is =
suitable for our list data. An example of this is Mobile Number =
Portability where we have millions of rows of data representing ported =
numbers with some atrributes against each.
>>>>>>>>=20
>>>>>>>> We use the standard Relational approach to this and have an =
entries table that has a foreign key reference to a parent list.=20
>>>>>>>>=20
>>>>>>>> On our web services we do something like this:
>>>>>>>>=20
>>>>>>>> Create a List:
>>>>>>>>=20
>>>>>>>> PUT /cie-rest/provision/accounts/netdev/lists/mylist
>>>>>>>> { "type": "NP"}
>>>>>>>>=20
>>>>>>>> To add a row to a list=20
>>>>>>>> PUT =
/cie-rest/provision/accounts/netdev/lists/mylist/entries/0123456789
>>>>>>>> { "status":"portedIn", "operatorId":1234}
>>>>>>>>=20
>>>>>>>> If we want to add a lot of rows we just POST a document to the =
list.
>>>>>>>>=20
>>>>>>>> The list data is used when processing calls and it requires a =
fast lookup on the entries table which is obviously indexed.
>>>>>>>>=20
>>>>>>>> Anyway, I'd be interested in getting some opinions on:
>>>>>>>>=20
>>>>>>>> 1) Is couchdb the *right* technology for this job? (I know it =
can do it!)
>>>>>>>>=20
>>>>>>>> 2) I presume that the relationship I currently have in my =
relational database would remain the same for couch i.e. The entry =
document would ref the list document but maybe there's a better way to =
do this?
>>>>>>>>=20
>>>>>>>> 3) Number portability requires 15 min, 1 hour and daily syncs =
with a central number portability database. This can result in bulk =
updates of thousands of numbers. I'm concerned with how long it takes to =
build a couchdb index and to incrementally update it when the number of =
changes is large (Adds/removes). =20
>>>>>>>> What does this mean to the availability of the number? i.e. Is =
the entry in the db but its unavailable to the application as it's entry =
in the index hasnt been built yet?
>>>>>>>>=20
>>>>>>>> 4) Telephone numbers like btrees so the index building should =
be quite fast and efficient I would of thought but does someone have =
anything more concrete in terms of how long it would take typically? I =
think that the bottleneck is the disk i/o and therefore it may be vastly =
different between my laptop and one of our beefy production servers but =
again I'd be interested in other peoples experience.
>>>>>>>>=20
>>>>>>>> Bit of a long one so thanks if you've read it to this point! =
There's a lot to like with couchdb (esp the replication for our use =
case) so I'm hoping that what i've asked above is feasible!
>>>>>>>>=20
>>>>>>>> Thanks
>>>>>>>>=20
>>>>>>>> John
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>=20
>>>>>>=20
>>>>>=20
>>>>=20
>>>=20
>=20