couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From J Chris Anderson <jch...@gmail.com>
Subject Re: couchdb and millions of records
Date Mon, 26 Jul 2010 18:02:14 GMT

On Jul 26, 2010, at 10:41 AM, Simon Metson wrote:

> Hi,
> 	We've done things at this scale with CouchDB. The key thing is to do bulk inserts, and
to trigger view indexing as you go. For instance our code by default will bulk insert 5000
records, then hit a view, then do the next 5000 then hit the view etc. Of course the batch
size is something you'd want to tune, since it'll depend on your documents and views. It's
much quicker to do the view index incrementally than hit all N million records at once. You
might also want to hit view and db compaction occasionally, especially if you're also doing
bulk deletes.
> Cheers
> Simon
> 

Also, 1.0 should be significantly faster for your use case.

Chris

> On 26 Jul 2010, at 18:00, Norman Barker wrote:
> 
>> Hi,
>> 
>> I have sampled the wikipedia tsv collection from freebase
>> (http://wiki.freebase.com/wiki/WEX/Documentation#articles), I ran this
>> through awk and drop the xml field and then did a simple conversion to
>> JSON. I then call _bulk_docs 150 docs at a time into couch 0.11.
>> 
>> I wrote a simple view in erlang that emits the date as a key (I am
>> actually using this to test the free text search couchdb-clucene), the
>> views are fast once computed.
>> 
>> The amount of disk storage used by couchdb is an issue, and the write
>> times are slow, I changed my view and my 2.3 million view computation
>> is still running!
>> 
>>       "request_time": {
>>           "description": "length of a request inside CouchDB without
>> MochiWeb",
>>           "current": 2253451.122,
>>           "sum": 2253451.122,
>>           "mean": 501.212,
>>           "stddev": 12275.385,
>>           "min": 0.5,
>>           "max": 798124.0
>>       },
>> 
>> For my use case once the system is up there is only a few updates per
>> hour, but doing the initial harvest takes a long time.
>> 
>> Does 1.0 make substantial gains on this, if so how, are there any
>> other areas that I should be looking at to improve this, I am happy
>> writing erlang code.
>> 
>> thanks,
>> 
>> Norman
> 


Mime
View raw message