incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From J Chris Anderson <jch...@apache.org>
Subject Re: Large lists of data
Date Sat, 24 Jul 2010 20:51:38 GMT

On Jul 24, 2010, at 1:36 PM, John Logsdon wrote:

> In SQL it would be something like:
> 
> Select * from entries e
> Where e.list-id=1234 
> and key in ('0123456789', '012345678', '01234567', '0123456', '012345', '01234')
> 

oh well in that case (you know all the keys in advance) you can do a multi-key query.

we tunnel a semantic GET over an http POST, so you post

{
"keys" : [array of keys]
}

to the view.

For some reason I thought you were looking for a special kind of range query.

Chris

> 
> On 24 Jul 2010, at 21:20, J Chris Anderson <jchris@couch.io> wrote:
> 
>> 
>> On Jul 24, 2010, at 1:13 PM, John wrote:
>> 
>>> I'm not sure if that will give me what I want exactly.
>>> 
>>> I want to search for these exact numbers only nothing in-between
>>> 
>>>>> 0123456789
>>>>> 012345678
>>>>> 01234567
>>>>> 0123456
>>>>> 012345
>>>>> 01234
>>> 
>>> Therefore the most documents it could possibly return is 6 i.e. even though startKey
is "01234" and endkey is "0123456789" the number "012346" is not a valid match. 
>>> 
>> 
>> statykey "01234" endkey "012346" with inclusive_end = false should work, except it
will bring in eg "0123457"
>> 
>> If "0123457" is not OK then I'm flummoxed as to how to describe your key requirements
in a simple way, regardless of technology.
>> 
>>> Hope that makes sense!
>>> 
>>> John
>>> 
>>> On 24 Jul 2010, at 20:49, J Chris Anderson wrote:
>>> 
>>>> 
>>>> On Jul 24, 2010, at 12:35 PM, John wrote:
>>>> 
>>>>> Hi
>>>>> 
>>>>> Thanks to you both for the answers so far. Indeed my setup is far more
complex than I have exposed to date but I'm making it into bite sized chunks around the Use
Cases that I think are the more challenging for me.
>>>>> 
>>>>> Although your answers were useful they don't quite hit the mark and that's
probably because I didn't explain my problem well enough to start with!
>>>>> 
>>>>> The database will contain entries from multiple lists (many thousands
perhaps) so the _id will never be unique on a telephone number. Perhaps this might work though:
>>>>> 
>>>>> GET /database/<list _id>#0123456789
>>>>> 
>>>>> or I could just keep the _id as a uuid and move this problem (find by
list id and number) to the view.
>>>>> 
>>>> 
>>>> for now I'd say just go with uuids and you can have a view by telephone number
for direct (or starts_with) lookups.
>>>> 
>>>>> The view by list wont work for me. I need to be able to query the view
with something like:
>>>>> 
>>>>> GET  /database/_design/portability/_view/NP?key=0123456789&list=<_id
of list>
>>>>> 
>>>>> In fact in some cases the problem is more complex than this as I need
to search for "widest match":
>>>>> 
>>>>> GET  /database/_design/portability/_view/NP?key=0123456789&list=<_id
of list>&min_width=5
>>>>> 
>>>>> which would return the widest match in:
>>>>> 
>>>>> 0123456789
>>>>> 012345678
>>>>> 01234567
>>>>> 0123456
>>>>> 012345
>>>>> 01234
>>>>> 
>>>>> 
>>>>> I even have another use case where I need to do a STARTS_WITH e.g. provide
a key of 01234 and return true if there are any numbers that start 01234.
>>>>> 
>>>> 
>>>> this is easy. have a view like:
>>>> 
>>>> function(doc) {
>>>> emit([doc.list_id, doc.number], null) 
>>>> }
>>>> 
>>>> Then you can query with 
>>>> 
>>>> ?startkey=["mylist", "012"]&endkey=["mylist", "013"]
>>>> 
>>>> to get everything with a prefix of "012" in the "mylist" list. you can mess
around with the endkey_inclusive (or is it inclusive_endkey) = true / false to not get the
exact number "013" in your result set.
>>>> 
>>>> from this technique you can see how you could do starts-with against just
phone numbers also, with a view like
>>>> 
>>>> function(doc_ {
>>>> emit(doc.number, null)
>>>> }
>>>> 
>>>> Note I have telephone numbers as strings in this example as a regular number
012 is the same as 12.
>>>> 
>>>>> This is a typical telecom problem and it would be good to document a
Design Pattern for this Use Case. In fact there's a discussion for another day on how/where
we could document this patterns and get peer reviews on them.
>>>>> 
>>>>> Thanks again
>>>>> 
>>>>> John
>>>>> 
>>>>> On 24 Jul 2010, at 19:15, J Chris Anderson wrote:
>>>>> 
>>>>>> 
>>>>>> On Jul 24, 2010, at 7:41 AM, mickael.bailly@free.fr wrote:
>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> 1/ it's a little hard to answer this question, your setup is
certainly a little more complex than what you expose in your email :-) However thousands of
documents are gracefuly handled by CouchDB.
>>>>>>> 
>>>>>>> 2/ At first sight your documents will look like :
>>>>>>> { "_id": 0123456789 , "list": "mylist", "type": "NP", "status":"portedIn",
"operatorId":1234 }
>>>>>>> 
>>>>>>> That way you can query your document by phone number :
>>>>>>> 
>>>>>>> GET /database/0123456789
>>>>>>> 
>>>>>>> and have all documents belonging to the list "mylist" by creating
a view that emits the "list" field :
>>>>>>> 
>>>>>>> function (doc) {
>>>>>>> if ( doc.list  && doc.type == "NP" ) {
>>>>>>> emit (doc.list,null);
>>>>>>> }
>>>>>>> }
>>>>>>> 
>>>>>>> and fetching them with something like :
>>>>>>> 
>>>>>>> GET /database/_design/portability/_view/NP?key="mylist"&include_docs=true
>>>>>>> 
>>>>>>> 3/ When updating a document : the document is of course immediately
available. However the view index won't be updated. In CouchDB view indexes are rebuilt on
view query (not on document update). When you'll query CouchDB "give me all the documents
of the view NP", Couch will take all documents that have changed (added, updated, deleted)
since the last time you asked Couch for the view, and will update indexes accordingly. You
have the option of fetching the view without rebuilding the index, with the "stale" parameter,
but in this case, of course, you won't see the changes. During the rebuilt of the index, subsequent
view queries are queued until the index is up to date.
>>>>>>> 
>>>>>>> 4/ I setup CouchDB to parse network logs. A view took something
like 25 minuts for 100 millions documents, on a Dell PowerEdge 2950 Xen Virtual Machine with
two dedicated processors and 4gigs ram. Numbers can heavily vary according to the complexity
of the view, so it's always hard (and dangerous) to give numbers. Moreover my indexes were
not only numbers, but also strings.
>>>>>>> 
>>>>>> 
>>>>>> this is a good response. I'd only follow up to say that there are
some techniques you can use to further tune view-generation performance. one: keysize and
entropy can make a big difference. the view by list, as above, looks pretty good on that front.
>>>>>> 
>>>>>> CouchDB can also be configured to store view indexes on a separate
disk from the database file, which can reduce IO contention if you are at the edge of what
your hardware can do.
>>>>>> 
>>>>>> Also, there is the option to query views with stale=ok, which will
return a query based on the latest snapshot, with low latency, so clients aren't blocked waiting
for generation to complete. then you can use a cron-job with a regular view query and limit=1
to keep the index up to date. so clients always see a fairly recent snapshot, with low latency.
>>>>>> 
>>>>>>> 
>>>>>>> What you should be aware of is that CouchDB requires maintenance
tasks to keep great performances, it's called "compact" and should be run on databases (to
rebuilt the db file that is append-only) and on databases views (to rebuild the index file
that is append-only). During the compact, database is still available but performances are
degraded (from my personnal experience).
>>>>>>> Also, a new replication engine is in the pipe and should greatly
improve the replication experience.
>>>>>>> 
>>>>>>> 
>>>>>>> Mickael
>>>>>>> 
>>>>>>> ----- Mail Original -----
>>>>>>> De: "John" <john.logsdon@netdev.co.uk>
>>>>>>> À: user@couchdb.apache.org
>>>>>>> Envoyé: Samedi 24 Juillet 2010 11h37:56 GMT +01:00 Amsterdam
/ Berlin / Berne / Rome / Stockholm / Vienne
>>>>>>> Objet: Large lists of data
>>>>>>> 
>>>>>>> Hi 
>>>>>>> 
>>>>>>> I'm currently evaluating couchdb as a candidate to replace the
relational databases as used in our Telecom Applications.
>>>>>>> For most of our data I can see a good fit and we already expose
our service provisioning as json over REST so we're well positioned for a migration.
>>>>>>> One area that concerns me though is whether this technology is
suitable for our list data. An example of this is Mobile Number Portability where we have
millions of rows of data representing ported numbers with some atrributes against each.
>>>>>>> 
>>>>>>> We use the standard Relational approach to this and have an entries
table that has a foreign key reference to a parent list. 
>>>>>>> 
>>>>>>> On our web services we do something like this:
>>>>>>> 
>>>>>>> Create a List:
>>>>>>> 
>>>>>>> PUT /cie-rest/provision/accounts/netdev/lists/mylist
>>>>>>> { "type": "NP"}
>>>>>>> 
>>>>>>> To add a row to a list 
>>>>>>> PUT /cie-rest/provision/accounts/netdev/lists/mylist/entries/0123456789
>>>>>>> { "status":"portedIn", "operatorId":1234}
>>>>>>> 
>>>>>>> If we want to add a lot of rows we just POST a document to the
list.
>>>>>>> 
>>>>>>> The list data is used when processing calls and it requires a
fast lookup on the entries table which is obviously indexed.
>>>>>>> 
>>>>>>> Anyway, I'd be interested in getting some opinions on:
>>>>>>> 
>>>>>>> 1) Is couchdb the *right* technology for this job? (I know it
can do it!)
>>>>>>> 
>>>>>>> 2) I presume that the relationship I currently have in my relational
database would remain the same for couch i.e. The entry document would ref the list document
but maybe there's a better way to do this?
>>>>>>> 
>>>>>>> 3) Number portability requires 15 min, 1 hour and daily syncs
with a central number portability database. This can result in bulk updates of thousands of
numbers. I'm concerned with how long it takes to build a couchdb index and to incrementally
update it when the number of changes is large (Adds/removes).  
>>>>>>> What does this mean to the availability of the number? i.e. Is
the entry in the db but its unavailable to the application as it's entry in the index hasnt
been built yet?
>>>>>>> 
>>>>>>> 4) Telephone numbers like btrees so the index building should
be quite fast and efficient I would of thought but does someone have anything more concrete
in terms of how long it would take typically? I think that the bottleneck is the disk i/o
and therefore it may be vastly different between my laptop and one of our beefy production
servers but again I'd be interested in other peoples experience.
>>>>>>> 
>>>>>>> Bit of a long one so thanks if you've read it to this point!
There's a lot to like with couchdb (esp the replication for our use case) so I'm hoping that
what i've asked above is feasible!
>>>>>>> 
>>>>>>> Thanks
>>>>>>> 
>>>>>>> John
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 


Mime
View raw message