incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From J Chris Anderson <jch...@apache.org>
Subject Re: Large lists of data
Date Sat, 24 Jul 2010 18:15:32 GMT

On Jul 24, 2010, at 7:41 AM, mickael.bailly@free.fr wrote:

> Hello,
> 
> 1/ it's a little hard to answer this question, your setup is certainly a little more
complex than what you expose in your email :-) However thousands of documents are gracefuly
handled by CouchDB.
> 
> 2/ At first sight your documents will look like :
> { "_id": 0123456789 , "list": "mylist", "type": "NP", "status":"portedIn", "operatorId":1234
}
> 
> That way you can query your document by phone number :
> 
> GET /database/0123456789
> 
> and have all documents belonging to the list "mylist" by creating a view that emits the
"list" field :
> 
> function (doc) {
>  if ( doc.list  && doc.type == "NP" ) {
>    emit (doc.list,null);
>  }
> }
> 
> and fetching them with something like :
> 
> GET /database/_design/portability/_view/NP?key="mylist"&include_docs=true
> 
> 3/ When updating a document : the document is of course immediately available. However
the view index won't be updated. In CouchDB view indexes are rebuilt on view query (not on
document update). When you'll query CouchDB "give me all the documents of the view NP", Couch
will take all documents that have changed (added, updated, deleted) since the last time you
asked Couch for the view, and will update indexes accordingly. You have the option of fetching
the view without rebuilding the index, with the "stale" parameter, but in this case, of course,
you won't see the changes. During the rebuilt of the index, subsequent view queries are queued
until the index is up to date.
> 
> 4/ I setup CouchDB to parse network logs. A view took something like 25 minuts for 100
millions documents, on a Dell PowerEdge 2950 Xen Virtual Machine with two dedicated processors
and 4gigs ram. Numbers can heavily vary according to the complexity of the view, so it's always
hard (and dangerous) to give numbers. Moreover my indexes were not only numbers, but also
strings.
> 

this is a good response. I'd only follow up to say that there are some techniques you can
use to further tune view-generation performance. one: keysize and entropy can make a big difference.
the view by list, as above, looks pretty good on that front.

CouchDB can also be configured to store view indexes on a separate disk from the database
file, which can reduce IO contention if you are at the edge of what your hardware can do.

Also, there is the option to query views with stale=ok, which will return a query based on
the latest snapshot, with low latency, so clients aren't blocked waiting for generation to
complete. then you can use a cron-job with a regular view query and limit=1 to keep the index
up to date. so clients always see a fairly recent snapshot, with low latency.

> 
> What you should be aware of is that CouchDB requires maintenance tasks to keep great
performances, it's called "compact" and should be run on databases (to rebuilt the db file
that is append-only) and on databases views (to rebuild the index file that is append-only).
During the compact, database is still available but performances are degraded (from my personnal
experience).
> Also, a new replication engine is in the pipe and should greatly improve the replication
experience.
> 
> 
> Mickael
> 
> ----- Mail Original -----
> De: "John" <john.logsdon@netdev.co.uk>
> À: user@couchdb.apache.org
> Envoyé: Samedi 24 Juillet 2010 11h37:56 GMT +01:00 Amsterdam / Berlin / Berne / Rome
/ Stockholm / Vienne
> Objet: Large lists of data
> 
> Hi 
> 
> I'm currently evaluating couchdb as a candidate to replace the relational databases as
used in our Telecom Applications.
> For most of our data I can see a good fit and we already expose our service provisioning
as json over REST so we're well positioned for a migration.
> One area that concerns me though is whether this technology is suitable for our list
data. An example of this is Mobile Number Portability where we have millions of rows of data
representing ported numbers with some atrributes against each.
> 
> We use the standard Relational approach to this and have an entries table that has a
foreign key reference to a parent list. 
> 
> On our web services we do something like this:
> 
> Create a List:
> 
> PUT /cie-rest/provision/accounts/netdev/lists/mylist
> { "type": "NP"}
> 
> To add a row to a list 
> PUT /cie-rest/provision/accounts/netdev/lists/mylist/entries/0123456789
> { "status":"portedIn", "operatorId":1234}
> 
> If we want to add a lot of rows we just POST a document to the list.
> 
> The list data is used when processing calls and it requires a fast lookup on the entries
table which is obviously indexed.
> 
> Anyway, I'd be interested in getting some opinions on:
> 
> 1) Is couchdb the *right* technology for this job? (I know it can do it!)
> 
> 2) I presume that the relationship I currently have in my relational database would remain
the same for couch i.e. The entry document would ref the list document but maybe there's a
better way to do this?
> 
> 3) Number portability requires 15 min, 1 hour and daily syncs with a central number portability
database. This can result in bulk updates of thousands of numbers. I'm concerned with how
long it takes to build a couchdb index and to incrementally update it when the number of changes
is large (Adds/removes).  
> What does this mean to the availability of the number? i.e. Is the entry in the db but
its unavailable to the application as it's entry in the index hasnt been built yet?
> 
> 4) Telephone numbers like btrees so the index building should be quite fast and efficient
I would of thought but does someone have anything more concrete in terms of how long it would
take typically? I think that the bottleneck is the disk i/o and therefore it may be vastly
different between my laptop and one of our beefy production servers but again I'd be interested
in other peoples experience.
> 
> Bit of a long one so thanks if you've read it to this point! There's a lot to like with
couchdb (esp the replication for our use case) so I'm hoping that what i've asked above is
feasible!
> 
> Thanks
> 
> John
> 
> 


Mime
View raw message