couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From James Dingwall <james.dingw...@zynstra.com>
Subject Re: distributed user case
Date Mon, 16 Nov 2015 09:26:02 GMT
Dale Scott wrote:
> Without intentionally obfuscating, I have 128GB of data collected from an
> experiment, roughly equivalent to a large set of 640x480 PNG images. Images
> are independent and analyzed image-by-image by an image recognition
> algorithm. I was thinking of dividing the set of images into sub-sets by a
> scheduler and have a new EC2 instance analyze each sub-set.
You may find that replicating subsets of your data to the anaylzing
instance is unnecessary.  If I want to process a set of documents in
parallel and it isn't important where they are processed I write a view
function which assigns a random number to a document from 1..n: e.g.

function(doc) {
     var instances_count = 3;
     if(!doc.analyzer_result) {
         emit(''+Math.round(Math.random()*instance_count));
     }
}

For each analzyer assign it a number which is the key it will process
from the database.  If the analysis time > rtt of talking to CouchDB
this should be ok as is.  You could buffer documents at the fetch (query
with include_docs=true, fetch and limit=200) / save (use bulk update)
stage if the network time becomes significant in relation to the processing.

James
Zynstra is a private limited company registered in England and Wales (registered number 07864369).
Our registered office and Headquarters are at The Innovation Centre, Broad Quay, Bath, BA1
1UD. This email, its contents and any attachments are confidential. If you have received this
message in error please delete it from your system and advise the sender immediately.

Mime
View raw message